摘要:
A massively parallel processor is provided with a plurality of clusters. Each cluster includes a plurality of processor elements ("PEs") and a cluster memory. Each PE of the cluster has associated with it an address register, a stage register, an error register, a PE enable flag, a memory flag, and a grant request flag. A cluster data bus and an error bus connects each of the stage registers and error registers of the cluster to the memory. The grant request flags of the cluster are interconnected by a polling network, which polls only one of the grant request flags at a time. In response to a signal on the polling network and the state of the associated memory flag, the grant request flag determines an I/O operation between the associated data register and the cluster memory over the cluster data bus. Both data and error bits are associated with respective processor elements. The sequential memory operations proceed in parallel with parallel processor operations. The sequential memory operations also may be queued. Addressing modes include direct and indirect. In direct address mode, a PE addresses its own address space by appending its PE number to a broadcast partial address. The broadcast partial address is furnished over a broadcast bus, and the PE number is furnished on a cluster address bus. In indirect address mode, a PE addresses either its own address space or that of other PEs in its cluster by locally calculating a partial address, then appending to it either its own PE number or that of another PE in its cluster. The full address is furnished over the cluster address bus.
摘要:
A massively parallel computer system is disclosed having a global router network in which pipeline registers are spatially distributed to increase the messaging speed of the global router network. The global router network includes an expansion tap for processor to I/O messaging so that I/O messaging bandwidth matches interprocessor messaging bandwidth. A route-opening message packet includes protocol bits which are treated homogeneously with steering bits. The route-opening packet further includes redundant address bits for imparting a multiple-crossbars personality to router chips within the global router network. A structure and method for spatially supporting the processors of the massively parallel system and the global router network are also disclosed.
摘要:
A massively parallel computer system is disclosed having a global router network in which pipeline registers are spatially distributed to increase the messaging speed of the global router network. The global router network includes an expansion tap for processor to I/O messaging so that I/O messaging bandwidth matches interprocessor messaging bandwidth. A route-opening message packet includes protocol bits which are treated homogeneously with steering bits. The route-opening packet further includes redundant address bits for imparting a multiple-crossbars personality to router chips within the global router network. A structure and method for spatially supporting the processors of the massively parallel system and the global router network are also disclosed.
摘要:
To effect a block data transfer between a plurality of physical I/O devices coupled through interfaces to an I/O channel ("IOC") bus, a source logical device is established by programmably assigning to each of the physical device interfaces a logical device identifier, a leaf identifier determining when the physical device participates relative to the first data transfer in the block data transfer, a burst count specifying the number of consecutive transfers for which the physical device is responsible when its interleave period arrives, and an interleave factor identifying how often the physical device participates in the block data transfer. A destination logical device is similarly established. The source and logical devices are then activated to accomplish a block transfer of data between them. To permit different I/O processors to operate independently in making I/O requests, requests from each I/O processor are communicated to an IOC controller over another bus, which need not be a high performance bus, and are serviced to construct header packets in a transaction buffer identifying IOC transactions, including source and destination logical devices. When each packet is finished, the responsible I/O processor puts a pointer into a transaction queue, which is a FIFO register. Each IOC transaction is initiated as its corresponding pointer is popped from the transaction queue. Apparatus embodiments are disclosed as well.
摘要:
A massively parallel processor includes an array of processor elements (20), of PEs, and a multi-stage router interconnection network (30), which is used both for I/O communications and for PE to PE communications. The I/O system (10) for the massively parallel processor is based on a globally shared addressable I/O RAM buffer memory (50) that has address and data buses (52) to the I/O devices (80, 82) and other address and data buses (42) which are coupled to a router I/O element array (40). The router I/O element array is in turn coupled to the router ports (e.g. S2.sub.-- 0.sub.-- X0) of the second stage (430) of the router interconnection network. The router I/O array provides the corner turn conversion between the massive array of router lines (32) and the relatively few buses (52) to the I/O devices.
摘要:
A parallel processor system which operates in a single-instruction multiple-data mode has a highly flexible local control capability for enabling the system to operate fast. The system contains an array of processing elements or PEs (12.sub.1 -12.sub.N) that process respective sets of data according to instructions supplied from a global control unit (20). Each instruction is furnished simultaneously to all the PEs. One local control feature (52) entails selectively inverting certain instruction signals according to a data-dependent signal. Another local control feature (48) involves controlling the amount of a bit shift in a barrel shifter (98) according to a data-dependent signal.
摘要:
One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.
摘要:
A parallel thread processor executes thread groups belonging to multiple cooperative thread arrays (CTAs). At each cycle of the parallel thread processor, an instruction scheduler selects a thread group to be issued for execution during a subsequent cycle. The instruction scheduler selects a thread group to issue for execution by (i) identifying a pool of available thread groups, (ii) identifying a CTA that has the greatest seniority value, and (iii) selecting the thread group that has the greatest credit value from within the CTA with the greatest seniority value.
摘要:
One embodiment of an instruction decoder includes an instruction parser configured to process a first non-operative instruction and to generate a first event signal corresponding to the first non-operative instruction, and a first event multiplexer configured to receive the first event signal from the instruction parser, to select the first event signal from one or more event signals and to transmit the first event signal to an event logic block. The instruction decoder may be implemented in a multithreaded processing unit, such as a shader unit, and the occurrences of the first event signal may be tracked when one or more threads are executed within the processing unit. The resulting event signal count may provide a designer with a better understanding of the behavior of a program, such as a shader program, executed within the processing unit, thereby facilitating overall processing unit and program design.
摘要:
Parallelism in a processor is exploited to permute a data set based on bit reversal of indices associated with data points in the data set. Permuted data can be stored in a memory having entries arranged in banks, where entries in different banks can be accessed in parallel. A destination location in the memory for a particular data point from the data set is determined based on the bit-reversed index associated with that data point. The bit-reversed index can be further modified so that at least some of the destination locations determined by different parallel processes are in different banks, allowing multiple points of the bit-reversed data set to be written in parallel.