Generating local addresses and communication sets for data-parallel
programs
    1.
    发明授权
    Generating local addresses and communication sets for data-parallel programs 失效
    生成数据并行程序的本地地址和通讯组

    公开(公告)号:US5450313A

    公开(公告)日:1995-09-12

    申请号:US217404

    申请日:1994-03-24

    IPC分类号: G06F9/45 G06F15/16

    CPC分类号: G06F8/447 G06F8/45

    摘要: An optimizing compilation process generates executable code which defines the computation and communication actions that are to be taken by each individual processor of a computer having a distributed memory, parallel processor architecture to run a program written in a data-parallel language. To this end, local memory layouts of the one-dimensional and multidimensional arrays that are used in the program are derived from one-level and two-level data mappings consisting of alignment and distribution, so that array elements are laid out in canonical order and local memory space is conserved. Executable code then is generated to produce at program run time, a set of tables for each individual processor for each computation requiring access to a regular section of an array, so that the entries of these tables specify the spacing between successive elements of said regular section resident in the local memory of said processor, and so that all the elements of said regular section can be located in a single pass through local memory using said tables. Further executable code is generated to produce at program run time, another set of tables for each individual processor for each communication action requiring a given processor to transfer array data to another processor, so that the entries of these tables specify the identity of a destination processor to which the array data must be transferred and the location in said destination processor's local memory at which the array data must be stored, and so that all of said array data can be located in a single pass through local memory using these communication tables. And, executable node code is generated for each individual processor that uses the foregoing tables at program run time to perform the necessary computation and communication actions on each individual processor of the parallel computer.

    摘要翻译: 优化编译过程产生可执行代码,其定义将由具有分布式存储器的计算机的每个单独处理器采取的计算和通信动作,并行处理器架构来运行以数据并行语言编写的程序。 为此,在程序中使用的一维和多维数组的本地存储器布局是从由对齐和分布组成的一级和两级数据映射导出的,因此数组元素以规范顺序排列, 本地存储空间是保守的。 然后生成可执行代码以在程序运行时产生用于每个单独处理器的一组表,用于需要访问数组的常规部分的每个计算,使得这些表的条目指定所述常规部分的连续元素之间的间隔 驻留在所述处理器的本地存储器中,并且使得所述常规部分的所有元素可以位于通过使用所述表的本地存储器的单次传递中。 生成进一步的可执行代码以在程序运行时产生用于每个单独处理器的另一组表,用于每个通信动作,要求给定处理器将阵列数据传送到另一个处理器,以便这些表的条目指定目标处理器的标识 必须传送数组数据以及必须存储阵列数据的所述目的地处理器的本地存储器中的位置,并且使得所有所述阵列数据可以位于通过使用这些通信表的本地存储器的单次传递中。 并且,对于在程序运行时使用上述表的每个单独处理器生成可执行节点代码,以在并行计算机的每个单独处理器上执行必要的计算和通信动作。

    Method of compilation optimization using an N-dimensional template for
relocated and replicated alignment of arrays in data-parallel programs
for reduced data communication during execution
    2.
    发明授权
    Method of compilation optimization using an N-dimensional template for relocated and replicated alignment of arrays in data-parallel programs for reduced data communication during execution 失效
    使用N维模板进行编译优化的方法,用于数据并行程序中数组的重定位和复制对齐,用于在执行期间进行减少的数据通信

    公开(公告)号:US5475842A

    公开(公告)日:1995-12-12

    申请号:US104755

    申请日:1993-08-11

    CPC分类号: G06F8/453

    摘要: When a data-parallel language like Fortran 90 is compiled for a distributed-memory machine, aggregate data objects (such as arrays) are distributed across the processor memories. The mapping determines the amount of residual communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: first, an alignment that maps all the objects to an abstract template, and then a distribution that maps the template to the processors. This disclosure deals with two facets of the problem of finding alignments that reduce residual communication; namely, alignments that vary in loops, and objects that permit of replicated alignments. It is shown that loop-dependent dynamic alignment is sometimes necessary for optimum performance, and algorithms are provided so that a compiler can determine good dynamic alignments for objects within "do" loops. Also situations are identified in which replicated alignment is either required by the program itself (via spread operations) or can be used to improve performance. An algorithm based on network flow is proposed for determing which objects to replicate so as to minimize the total amount of broadcast communication in replication.

    摘要翻译: 当为分布式存储器机器编译Fortran 90的数据并行语言时,聚合数据对象(例如阵列)分布在处理器存储器中。 映射确定使并行操作的操作数彼此对准所需的剩余通信量。 一种常见的方法是将映射分为两个阶段:首先,将所有对象映射到抽象模板,然后将模板映射到处理器的分布。 本公开涉及寻找减少残余通信的对齐问题的两个方面; 即循环中不同的对齐,以及允许复制对齐的对象。 显示了循环相关的动态对齐有时是最佳性能所必需的,并且提供了算法,以便编译器可以确定“do”循环内对象的良好动态对齐。 还可以确定复制对齐是程序本身需要的(通过扩展操作)或可用于提高性能的情况。 提出了一种基于网络流的算法,用于确定要复制的对象,以便最小化复制中广播通信的总量。

    System and method for encoding and decoding architecture registers
    3.
    发明授权
    System and method for encoding and decoding architecture registers 失效
    用于编码和解码体系结构寄存器的系统和方法

    公开(公告)号:US07596680B2

    公开(公告)日:2009-09-29

    申请号:US10662179

    申请日:2003-09-15

    IPC分类号: G06F7/38

    摘要: A system and method to extend the number of architecturally visible registers in a processor while preserving the number of bits of the instruction encoding. The system comprises: an indirection table that encodes register patterns for the registers used in an instruction; instructions to load and store such table entries; a mechanism to identify instructions that use the indirection table; and a mechanism to identify a set of bits in instructions that are used to index into the indirection table. According to another embodiment, a method of encoding registers in a computer instruction comprises constructing a table having a plurality of entries. Each entry specifies a combination of a plurality of registers. The method also comprises generating an instruction having a reference to one of the entries in the table. The method then comprises accessing the plurality of registers specified by the referenced table entry. The method further comprises merging said number of registers into an expanded instruction that is used for remaining stages of instruction processing.

    摘要翻译: 一种用于在保持指令编码的位数的同时在处理器中扩展架构可见寄存器的数量的系统和方法。 该系统包括:对指令中使用的寄存器编码寄存器模式的间接表; 加载和存储这些表条目的指令; 识别使用间接表的指令的机制; 以及用于识别用于索引到间接表中的指令中的一组位的机制。 根据另一实施例,一种在计算机指令中编码寄存器的方法包括构建具有多个条目的表。 每个条目指定多个寄存器的组合。 该方法还包括生成具有对表中的一个条目的引用的指令。 该方法然后包括访问被引用的表条目指定的多个寄存器。 该方法还包括将所述数量的寄存器合并成用于剩余的指令处理阶段的扩展指令。

    Method and structure for high-performance linear algebra in the presence of limited outstanding miss slots
    4.
    发明申请
    Method and structure for high-performance linear algebra in the presence of limited outstanding miss slots 审中-公开
    存在有限突出漏洞的高性能线性代数的方法和结构

    公开(公告)号:US20060168401A1

    公开(公告)日:2006-07-27

    申请号:US11041935

    申请日:2005-01-26

    IPC分类号: G06F12/00 G06F15/00

    摘要: A method and structure of increasing computational efficiency in a computer that comprises at least one processing unit, a first memory device servicing the at least one processing unit, and at least one other memory device servicing the at least one processing unit. The first memory device has a memory line larger than an increment of data consumed by the at least one processing unit and has a pre-set number of allowable outstanding data misses before the processing unit is stalled. In a data retrieval responding to an allowable outstanding data miss, at least one additional data is included in a line of data retrieved from the at least one other memory device. The additional data comprises data that will prevent the pre-set number of outstanding data misses from being reached, reduce the chance that the pre-set number of outstanding data misses will be reached, or delay the time at which the pre-set number of outstanding data misses is reached.

    摘要翻译: 一种在计算机中提高计算效率的方法和结构,其包括至少一个处理单元,服务于所述至少一个处理单元的第一存储设备以及至少一个维修所述至少一个处理单元的其它存储器设备。 第一存储器件具有大于由至少一个处理单元消耗的数据的增量的存储器线,并且在处理单元停止之前具有预定数量的允许的未完成数据未命中。 在响应于允许的未完成数据未命中的数据检索中,至少一个附加数据被包括在从至少一个其他存储器件检索的数据行中。 附加数据包括将防止达到预定数量的未完成数据未命中的数据,减少将达到预定数量的未完成数据未命中的机会,或延迟预定数量的预设数量 未达数据丢失。

    System and method for algorithmic cache-bypass
    5.
    发明申请
    System and method for algorithmic cache-bypass 审中-公开
    用于算法缓存旁路的系统和方法

    公开(公告)号:US20060179240A1

    公开(公告)日:2006-08-10

    申请号:US11052877

    申请日:2005-02-09

    IPC分类号: G06F13/28

    CPC分类号: G06F12/0897 G06F12/0888

    摘要: A system for (and method of) algorithmic cache-bypass which includes acting on at least one level of cache to at least one of bypass the at least one level of cache, stream through the at least one level of cache, force utilization of at least one other level of cache, bypass at least one level of cache, bypass all levels of cache, force utilization of a main memory, and force utilization of an out-of core memory.

    摘要翻译: 一种用于(和)方法的算法高速缓存绕过系统,其包括对至少一个级别的缓存执行至少一个旁路至少一级的缓存,流过所述至少一级缓存,强制利用at 至少一个其他级别的缓存,绕过至少一个级别的缓存,绕过所有级别的高速缓存,强制利用主内存,以及强制利用核心内存。

    Scalable runtime system for global address space languages on shared and distributed memory machines
    6.
    发明申请
    Scalable runtime system for global address space languages on shared and distributed memory machines 有权
    可扩展运行时系统,用于共享和分布式内存机器上的全局地址空间语言

    公开(公告)号:US20050149903A1

    公开(公告)日:2005-07-07

    申请号:US10734690

    申请日:2003-12-12

    IPC分类号: G06F9/44 G06F9/50

    CPC分类号: G06F9/5016

    摘要: An improved scalability runtime system for a global address space language running on a distributed or shared memory machine uses a directory of shared variables having a data structure for tracking shared variable information that is shared by a plurality of program threads. Allocation and de-allocation routines are used to allocate and de-allocate shared variable entries in the directory of shared variables. Different routines can be used to access different types of shared data. A control structure is used to control access to the shared data such that all threads can access the data at any time. Since all threads see the same objects, synchronization issues are eliminated. In addition, the improved efficiency of the data sharing allows the number of program threads to be vastly increased.

    摘要翻译: 用于在分布式或共享存储器机器上运行的全局地址空间语言的改进的可扩展性运行时系统使用具有用于跟踪由多个程序线程共享的共享变量信息的数据结构的共享变量的目录。 分配和解除分配例程用于在共享变量目录中分配和取消分配共享变量条目。 可以使用不同的例程来访问不同类型的共享数据。 控制结构用于控制对共享数据的访问,以便所有线程可以随时访问数据。 由于所有线程都看到相同的对象,因此消除了同步问题。 另外,提高数据共享的效率使得程序线程的数量大大增加。

    Vector processor with data swap and replication
    7.
    发明申请
    Vector processor with data swap and replication 失效
    带数据交换和复制的向量处理器

    公开(公告)号:US20050102487A1

    公开(公告)日:2005-05-12

    申请号:US10704214

    申请日:2003-11-07

    摘要: A microprocessor includes a branch unit, a load/store unit (LSU), an arithmetic logic unit (ALU), and a vector unit to execute a vector instruction. The vector unit includes a vector register file having a primary vector register and a secondary vector register. The processor preferably further includes a first data bus and a second data bus wherein the first and second data busses couple the vector unit to the data memory. The vector unit includes a first input multiplexer enabling data on the first data bus to be provided to the primary register file or the secondary register file and a second input multiplexer, independent of the first input multiplexer enabling data on the second data bus to be provided to the second data bus. The first and second data busses may comprise first and second portions of a data memory bus.

    摘要翻译: 微处理器包括分支单元,加载/存储单元(LSU),算术逻辑单元(ALU)和用于执行向量指令的向量单元。 向量单元包括具有主向量寄存器和次向量寄存器的向量寄存器文件。 处理器优选地还包括第一数据总线和第二数据总线,其中第一和第二数据总线将向量单元耦合到数据存储器。 向量单元包括第一输入多路复用器,其使第一数据总线上的数据能够提供给主寄存器文件或辅助寄存器文件,第二输入多路复用器独立于第一输入多路复用器,使第二数据总线上的数据能够被提供 到第二个数据总线。 第一和第二数据总线可以包括数据存储器总线的第一和第二部分。

    Vector unit in a processor enabled to replicate data on a first portion of a data bus to primary and secondary registers
    8.
    发明授权
    Vector unit in a processor enabled to replicate data on a first portion of a data bus to primary and secondary registers 失效
    处理器中的矢量单元能够将数据总线的第一部分上的数据复制到主寄存器和辅助寄存器

    公开(公告)号:US08200945B2

    公开(公告)日:2012-06-12

    申请号:US10704214

    申请日:2003-11-07

    IPC分类号: G06F9/30

    摘要: A microprocessor includes a branch unit, a load/store unit (LSU), an arithmetic logic unit (ALU), and a vector unit to execute a vector instruction. The vector unit includes a vector register file having a primary vector register and a secondary vector register. The processor preferably further includes a first data bus and a second data bus wherein the first and second data busses couple the vector unit to the data memory. The vector unit includes a first input multiplexer enabling data on the first data bus to be provided to the primary register file or the secondary register file and a second input multiplexer, independent of the first input multiplexer enabling data on the second data bus to be provided to the second data bus. The first and second data busses may comprise first and second portions of a data memory bus.

    摘要翻译: 微处理器包括分支单元,加载/存储单元(LSU),算术逻辑单元(ALU)和用于执行向量指令的向量单元。 向量单元包括具有主向量寄存器和次向量寄存器的向量寄存器文件。 处理器优选还包括第一数据总线和第二数据总线,其中第一和第二数据总线将向量单元耦合到数据存储器。 向量单元包括第一输入多路复用器,其使第一数据总线上的数据能够提供给主寄存器文件或辅助寄存器文件,第二输入多路复用器独立于第一输入多路复用器,使第二数据总线上的数据能够被提供 到第二个数据总线。 第一和第二数据总线可以包括数据存储器总线的第一和第二部分。

    Scalable runtime system for global address space languages on shared and distributed memory machines
    10.
    发明授权
    Scalable runtime system for global address space languages on shared and distributed memory machines 有权
    可扩展运行时系统,用于共享和分布式内存机器上的全局地址空间语言

    公开(公告)号:US07380086B2

    公开(公告)日:2008-05-27

    申请号:US10734690

    申请日:2003-12-12

    IPC分类号: G06F12/00 G06F9/45 G06F9/46

    CPC分类号: G06F9/5016

    摘要: An improved scalability runtime system for a global address space language running on a distributed or shared memory machine uses a directory of shared variables having a data structure for tracking shared variable information that is shared by a plurality of program threads. Allocation and de-allocation routines are used to allocate and de-allocate shared variable entries in the directory of shared variables. Different routines can be used to access different types of shared data. A control structure is used to control access to the shared data such that all threads can access the data at any time. Since all threads see the same objects, synchronization issues are eliminated. In addition, the improved efficiency of the data sharing allows the number of program threads to be vastly increased.

    摘要翻译: 用于在分布式或共享存储器机器上运行的全局地址空间语言的改进的可扩展性运行时系统使用具有用于跟踪由多个程序线程共享的共享变量信息的数据结构的共享变量的目录。 分配和解除分配例程用于在共享变量目录中分配和取消分配共享变量条目。 可以使用不同的例程来访问不同类型的共享数据。 控制结构用于控制对共享数据的访问,以便所有线程可以随时访问数据。 由于所有线程都看到相同的对象,因此消除了同步问题。 另外,提高数据共享的效率使得程序线程的数量大大增加。