摘要:
An optimizing compilation process generates executable code which defines the computation and communication actions that are to be taken by each individual processor of a computer having a distributed memory, parallel processor architecture to run a program written in a data-parallel language. To this end, local memory layouts of the one-dimensional and multidimensional arrays that are used in the program are derived from one-level and two-level data mappings consisting of alignment and distribution, so that array elements are laid out in canonical order and local memory space is conserved. Executable code then is generated to produce at program run time, a set of tables for each individual processor for each computation requiring access to a regular section of an array, so that the entries of these tables specify the spacing between successive elements of said regular section resident in the local memory of said processor, and so that all the elements of said regular section can be located in a single pass through local memory using said tables. Further executable code is generated to produce at program run time, another set of tables for each individual processor for each communication action requiring a given processor to transfer array data to another processor, so that the entries of these tables specify the identity of a destination processor to which the array data must be transferred and the location in said destination processor's local memory at which the array data must be stored, and so that all of said array data can be located in a single pass through local memory using these communication tables. And, executable node code is generated for each individual processor that uses the foregoing tables at program run time to perform the necessary computation and communication actions on each individual processor of the parallel computer.
摘要:
When a data-parallel language like Fortran 90 is compiled for a distributed-memory machine, aggregate data objects (such as arrays) are distributed across the processor memories. The mapping determines the amount of residual communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: first, an alignment that maps all the objects to an abstract template, and then a distribution that maps the template to the processors. This disclosure deals with two facets of the problem of finding alignments that reduce residual communication; namely, alignments that vary in loops, and objects that permit of replicated alignments. It is shown that loop-dependent dynamic alignment is sometimes necessary for optimum performance, and algorithms are provided so that a compiler can determine good dynamic alignments for objects within "do" loops. Also situations are identified in which replicated alignment is either required by the program itself (via spread operations) or can be used to improve performance. An algorithm based on network flow is proposed for determing which objects to replicate so as to minimize the total amount of broadcast communication in replication.
摘要:
A system and method to extend the number of architecturally visible registers in a processor while preserving the number of bits of the instruction encoding. The system comprises: an indirection table that encodes register patterns for the registers used in an instruction; instructions to load and store such table entries; a mechanism to identify instructions that use the indirection table; and a mechanism to identify a set of bits in instructions that are used to index into the indirection table. According to another embodiment, a method of encoding registers in a computer instruction comprises constructing a table having a plurality of entries. Each entry specifies a combination of a plurality of registers. The method also comprises generating an instruction having a reference to one of the entries in the table. The method then comprises accessing the plurality of registers specified by the referenced table entry. The method further comprises merging said number of registers into an expanded instruction that is used for remaining stages of instruction processing.
摘要:
A method and structure of increasing computational efficiency in a computer that comprises at least one processing unit, a first memory device servicing the at least one processing unit, and at least one other memory device servicing the at least one processing unit. The first memory device has a memory line larger than an increment of data consumed by the at least one processing unit and has a pre-set number of allowable outstanding data misses before the processing unit is stalled. In a data retrieval responding to an allowable outstanding data miss, at least one additional data is included in a line of data retrieved from the at least one other memory device. The additional data comprises data that will prevent the pre-set number of outstanding data misses from being reached, reduce the chance that the pre-set number of outstanding data misses will be reached, or delay the time at which the pre-set number of outstanding data misses is reached.
摘要:
A system for (and method of) algorithmic cache-bypass which includes acting on at least one level of cache to at least one of bypass the at least one level of cache, stream through the at least one level of cache, force utilization of at least one other level of cache, bypass at least one level of cache, bypass all levels of cache, force utilization of a main memory, and force utilization of an out-of core memory.
摘要:
An improved scalability runtime system for a global address space language running on a distributed or shared memory machine uses a directory of shared variables having a data structure for tracking shared variable information that is shared by a plurality of program threads. Allocation and de-allocation routines are used to allocate and de-allocate shared variable entries in the directory of shared variables. Different routines can be used to access different types of shared data. A control structure is used to control access to the shared data such that all threads can access the data at any time. Since all threads see the same objects, synchronization issues are eliminated. In addition, the improved efficiency of the data sharing allows the number of program threads to be vastly increased.
摘要:
A microprocessor includes a branch unit, a load/store unit (LSU), an arithmetic logic unit (ALU), and a vector unit to execute a vector instruction. The vector unit includes a vector register file having a primary vector register and a secondary vector register. The processor preferably further includes a first data bus and a second data bus wherein the first and second data busses couple the vector unit to the data memory. The vector unit includes a first input multiplexer enabling data on the first data bus to be provided to the primary register file or the secondary register file and a second input multiplexer, independent of the first input multiplexer enabling data on the second data bus to be provided to the second data bus. The first and second data busses may comprise first and second portions of a data memory bus.
摘要:
A microprocessor includes a branch unit, a load/store unit (LSU), an arithmetic logic unit (ALU), and a vector unit to execute a vector instruction. The vector unit includes a vector register file having a primary vector register and a secondary vector register. The processor preferably further includes a first data bus and a second data bus wherein the first and second data busses couple the vector unit to the data memory. The vector unit includes a first input multiplexer enabling data on the first data bus to be provided to the primary register file or the secondary register file and a second input multiplexer, independent of the first input multiplexer enabling data on the second data bus to be provided to the second data bus. The first and second data busses may comprise first and second portions of a data memory bus.
摘要:
A method (and structure) of managing memory in which a low-level mechanism is executed to signal, in a sequence of instructions generated at a higher level, that at least a portion of a contiguous area of memory is permitted to be overwritten.
摘要:
An improved scalability runtime system for a global address space language running on a distributed or shared memory machine uses a directory of shared variables having a data structure for tracking shared variable information that is shared by a plurality of program threads. Allocation and de-allocation routines are used to allocate and de-allocate shared variable entries in the directory of shared variables. Different routines can be used to access different types of shared data. A control structure is used to control access to the shared data such that all threads can access the data at any time. Since all threads see the same objects, synchronization issues are eliminated. In addition, the improved efficiency of the data sharing allows the number of program threads to be vastly increased.