Parallel processor memory transfer system using parallel transfers
between processors and staging registers and sequential transfers
between staging registers and memory
    1.
    发明授权
    Parallel processor memory transfer system using parallel transfers between processors and staging registers and sequential transfers between staging registers and memory 失效
    并行处理器存储器传输系统使用处理器和分段寄存器之间的并行传输,以及分级寄存器和存储器之间的顺序传输

    公开(公告)号:US5581777A

    公开(公告)日:1996-12-03

    申请号:US400411

    申请日:1995-03-03

    摘要: A massively parallel processor is provided with a plurality of clusters. Each cluster includes a plurality of processor elements ("PEs") and a cluster memory. Each PE of the cluster has associated with it an address register, a stage register, an error register, a PE enable flag, a memory flag, and a grant request flag. A cluster data bus and an error bus connects each of the stage registers and error registers of the cluster to the memory. The grant request flags of the cluster are interconnected by a polling network, which polls only one of the grant request flags at a time. In response to a signal on the polling network and the state of the associated memory flag, the grant request flag determines an I/O operation between the associated data register and the cluster memory over the cluster data bus. Both data and error bits are associated with respective processor elements. The sequential memory operations proceed in parallel with parallel processor operations. The sequential memory operations also may be queued. Addressing modes include direct and indirect. In direct address mode, a PE addresses its own address space by appending its PE number to a broadcast partial address. The broadcast partial address is furnished over a broadcast bus, and the PE number is furnished on a cluster address bus. In indirect address mode, a PE addresses either its own address space or that of other PEs in its cluster by locally calculating a partial address, then appending to it either its own PE number or that of another PE in its cluster. The full address is furnished over the cluster address bus.

    摘要翻译: 大规模并行处理器具有多个簇。 每个群集包括多个处理器元件(“PE”)和群集存储器。 集群的每个PE与它相关联地址寄存器,阶段寄存器,错误寄存器,PE使能标志,存储器标志和授权请求标志。 集群数据总线和错误总线将集群的每个阶段寄存器和错误寄存器连接到存储器。 集群的授权请求标志由轮询网络相互连接,轮询网络一次仅轮询授权请求标志中的一个。 响应于轮询网络上的信号和相关联的存储器标志的状态,授权请求标志通过集群数据总线确定相关联的数据寄存器和集群存储器之间的I / O操作。 数据和错误位都与相应的处理器元件相关联。 顺序存储器操作与并行处理器操作并行进行。 顺序存储器操作也可以排队。 寻址模式包括直接和间接。 在直接地址模式下,PE通过将其PE号附加到广播部分地址来寻址其自己的地址空间。 广播部分地址通过广播总线提供,PE号码在集群地址总线上提供。 在间接寻址模式下,PE通过本地计算部分地址来寻址其自身的地址空间或其簇中的其他PE,然后将其自身的PE号或其簇中的另一个PE附加到该地址空间。 整个地址通过集群地址总线提供。

    Scalable processor to processor and processor to I/O interconnection
network and method for parallel processing arrays
    2.
    发明授权
    Scalable processor to processor and processor to I/O interconnection network and method for parallel processing arrays 失效
    可扩展处理器到处理器和处理器到I / O互连网络和并行处理阵列的方法

    公开(公告)号:US5598408A

    公开(公告)日:1997-01-28

    申请号:US182250

    申请日:1994-01-14

    CPC分类号: G06F15/17393

    摘要: A massively parallel computer system is disclosed having a global router network in which pipeline registers are spatially distributed to increase the messaging speed of the global router network. The global router network includes an expansion tap for processor to I/O messaging so that I/O messaging bandwidth matches interprocessor messaging bandwidth. A route-opening message packet includes protocol bits which are treated homogeneously with steering bits. The route-opening packet further includes redundant address bits for imparting a multiple-crossbars personality to router chips within the global router network. A structure and method for spatially supporting the processors of the massively parallel system and the global router network are also disclosed.

    摘要翻译: 公开了一种具有全局路由器网络的大规模并行计算机系统,其中流水线寄存器在空间上分布以增加全局路由器网络的消息传送速度。 全局路由器网络包括用于处理器到I / O消息传递的扩展抽头,以便I / O消息带宽与处理器间消息带宽相匹配。 路由开启消息分组包括与转向比特均匀对待的协议比特。 路由开启分组还包括冗余地址比特,用于向全球路由器网络内的路由器芯片赋予多交叉形状个性。 还公开了用于空间支持大规模并行系统和全局路由器网络的处理器的结构和方法。

    Scalable processor to processor and processor-to-I/O interconnection
network and method for parallel processing arrays
    3.
    发明授权
    Scalable processor to processor and processor-to-I/O interconnection network and method for parallel processing arrays 失效
    可扩展处理器到处理器和处理器到I / O互连网络和并行处理阵列的方法

    公开(公告)号:US5280474A

    公开(公告)日:1994-01-18

    申请号:US461492

    申请日:1990-01-05

    CPC分类号: G06F15/17393

    摘要: A massively parallel computer system is disclosed having a global router network in which pipeline registers are spatially distributed to increase the messaging speed of the global router network. The global router network includes an expansion tap for processor to I/O messaging so that I/O messaging bandwidth matches interprocessor messaging bandwidth. A route-opening message packet includes protocol bits which are treated homogeneously with steering bits. The route-opening packet further includes redundant address bits for imparting a multiple-crossbars personality to router chips within the global router network. A structure and method for spatially supporting the processors of the massively parallel system and the global router network are also disclosed.

    摘要翻译: 公开了一种具有全局路由器网络的大规模并行计算机系统(500),其中流水线寄存器在空间上分布以增加全局路由器网络的消息传送速度。 全局路由器网络包括用于处理器到I / O(1700)消息传递的扩展抽头,以便I / O消息带宽与处理器间消息带宽相匹配。 路由开启消息分组包括与转向比特均匀对待的协议比特。 路由开启分组还包括冗余地址比特,用于向全球路由器网络内的路由器芯片赋予多交叉形状个性。 还公开了用于空间上支持大规模并行系统和全局路由器网络的处理器(700)的结构和方法。

    Broadcasting headers to configure physical devices interfacing a data
bus with a logical assignment and to effect block data transfers
between the configured logical devices
    4.
    发明授权
    Broadcasting headers to configure physical devices interfacing a data bus with a logical assignment and to effect block data transfers between the configured logical devices 失效
    广播头以配置将数据总线与逻辑分配接口的物理设备,并在配置的逻辑设备之间实现块数据传输

    公开(公告)号:US5488694A

    公开(公告)日:1996-01-30

    申请号:US937639

    申请日:1992-08-28

    IPC分类号: G06F13/42 G06F13/00 G06F13/38

    CPC分类号: G06F13/423

    摘要: To effect a block data transfer between a plurality of physical I/O devices coupled through interfaces to an I/O channel ("IOC") bus, a source logical device is established by programmably assigning to each of the physical device interfaces a logical device identifier, a leaf identifier determining when the physical device participates relative to the first data transfer in the block data transfer, a burst count specifying the number of consecutive transfers for which the physical device is responsible when its interleave period arrives, and an interleave factor identifying how often the physical device participates in the block data transfer. A destination logical device is similarly established. The source and logical devices are then activated to accomplish a block transfer of data between them. To permit different I/O processors to operate independently in making I/O requests, requests from each I/O processor are communicated to an IOC controller over another bus, which need not be a high performance bus, and are serviced to construct header packets in a transaction buffer identifying IOC transactions, including source and destination logical devices. When each packet is finished, the responsible I/O processor puts a pointer into a transaction queue, which is a FIFO register. Each IOC transaction is initiated as its corresponding pointer is popped from the transaction queue. Apparatus embodiments are disclosed as well.

    摘要翻译: 为了实现通过与I / O通道(“IOC”)总线的接口耦合的多个物理I / O设备之间的块数据传输,通过可编程地向每个物理设备接口分配逻辑设备来建立源逻辑设备 标识符,确定物理设备何时相对于块数据传输中的第一数据传输参与的叶标识符,指定物理设备在其交织周期到达时负责的连续传输次数的突发计数,以及交织因子识别 物理设备参与块数据传输的频率。 类似地建立目的地逻辑设备。 然后激活源和逻辑设备以在它们之间实现数据的块传输。 为了允许不同的I / O处理器在进行I / O请求时独立运行,来自每个I / O处理器的请求通过不需要是高性能总线的另一总线传送给IOC控制器,并且被服务以构建报头包 在事务缓冲区中标识IOC事务,包括源和目标逻辑设备。 当每个数据包完成后,负责的I / O处理器将一个指针放入事务队列,这是一个FIFO寄存器。 每个IOC事务被启动,因为它的相应指针从事务队列弹出。 还公开了装置实施例。

    Input/output system for parallel processing arrays
    5.
    发明授权
    Input/output system for parallel processing arrays 失效
    用于并行处理阵列的输入/输出系统

    公开(公告)号:US5243699A

    公开(公告)日:1993-09-07

    申请号:US802944

    申请日:1991-12-06

    IPC分类号: G06F15/173 G06F15/80

    CPC分类号: G06F15/8007 G06F15/17393

    摘要: A massively parallel processor includes an array of processor elements (20), of PEs, and a multi-stage router interconnection network (30), which is used both for I/O communications and for PE to PE communications. The I/O system (10) for the massively parallel processor is based on a globally shared addressable I/O RAM buffer memory (50) that has address and data buses (52) to the I/O devices (80, 82) and other address and data buses (42) which are coupled to a router I/O element array (40). The router I/O element array is in turn coupled to the router ports (e.g. S2.sub.-- 0.sub.-- X0) of the second stage (430) of the router interconnection network. The router I/O array provides the corner turn conversion between the massive array of router lines (32) and the relatively few buses (52) to the I/O devices.

    摘要翻译: 大规模并行处理器包括PE的处理器元件阵列(20)和用于I / O通信和用于PE至PE通信的多级路由器互连网络(30)。 用于大规模并行处理器的I / O系统(10)基于具有到I / O设备(80,82)的地址和数据总线(52)的全局共享的可寻址I / O RAM缓冲存储器(50) 耦合到路由器I / O元件阵列(40)的其它地址和数据总线(42)。 路由器I / O元件阵列又耦合到路由器互连网络的第二级(430)的路由器端口(例如,S2-0-X0)。 路由器I / O阵列提供大量路由器线路(32)和相对较少的总线(52)到I / O设备之间的拐角转换。

    Parallel processor system with highly flexible local control capability,
including selective inversion of instruction signal and control of bit
shift amount
    6.
    发明授权
    Parallel processor system with highly flexible local control capability, including selective inversion of instruction signal and control of bit shift amount 失效
    并行处理器系统具有高度灵活的本地控制能力,包括指令信号的选择性反转和位移量的控制

    公开(公告)号:US5542074A

    公开(公告)日:1996-07-30

    申请号:US965938

    申请日:1992-10-22

    IPC分类号: G06F15/80 G06F15/76

    摘要: A parallel processor system which operates in a single-instruction multiple-data mode has a highly flexible local control capability for enabling the system to operate fast. The system contains an array of processing elements or PEs (12.sub.1 -12.sub.N) that process respective sets of data according to instructions supplied from a global control unit (20). Each instruction is furnished simultaneously to all the PEs. One local control feature (52) entails selectively inverting certain instruction signals according to a data-dependent signal. Another local control feature (48) involves controlling the amount of a bit shift in a barrel shifter (98) according to a data-dependent signal.

    摘要翻译: 以单指令多数据模式运行的并行处理器系统具有高度灵活的本地控制能力,使系统能够快速运行。 该系统包含一系列处理元件或PE(121-12N),根据从全局控制单元(20)提供的指令处理相应的数据集。 每个指令同时提供给所有的PE。 一个本地控制特征(52)需要根据数据相关信号选择性地反转某些指令信号。 另一个本地控制特征(48)涉及根据数据相关信号来控制桶形移位器(98)中的位移的量。

    Coalescing memory barrier operations across multiple parallel threads
    7.
    发明授权
    Coalescing memory barrier operations across multiple parallel threads 有权
    在多个并行线程之间合并记忆障碍操作

    公开(公告)号:US09223578B2

    公开(公告)日:2015-12-29

    申请号:US12887081

    申请日:2010-09-21

    IPC分类号: G06F9/46 G06F9/38 G06F9/30

    摘要: One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.

    摘要翻译: 本发明的一个实施例提出了一种用于在多个并行线程之间聚合存储器屏障操作的技术。 来自给定并行线程处理单元的存储器屏障请求被合并以减少对系统其余部分的影响。 此外,存储器屏障请求可以指定针对其提交内存事务的一组线程的级别。 例如,第一类型的存储器障碍指令可以将存储器事务提交到共享L1(一级)高速缓存的一组协作线程的级别。 第二种类型的存储器障碍指令可以将存储器事务提交到共享全局存储器的一组线程的级别。 最后,第三种类型的存储器障碍指令可以将存储器事务提交到共享所有系统存储器的所有线程的系统级。 执行存储器屏障指令所需的延迟基于存储器屏障指令的类型而变化。

    Generating event signals for performance register control using non-operative instructions
    9.
    发明授权
    Generating event signals for performance register control using non-operative instructions 有权
    使用非操作指令生成用于性能寄存器控制的事件信号

    公开(公告)号:US07809928B1

    公开(公告)日:2010-10-05

    申请号:US11313872

    申请日:2005-12-20

    IPC分类号: G06F9/30 G06F17/00 G09G5/02

    摘要: One embodiment of an instruction decoder includes an instruction parser configured to process a first non-operative instruction and to generate a first event signal corresponding to the first non-operative instruction, and a first event multiplexer configured to receive the first event signal from the instruction parser, to select the first event signal from one or more event signals and to transmit the first event signal to an event logic block. The instruction decoder may be implemented in a multithreaded processing unit, such as a shader unit, and the occurrences of the first event signal may be tracked when one or more threads are executed within the processing unit. The resulting event signal count may provide a designer with a better understanding of the behavior of a program, such as a shader program, executed within the processing unit, thereby facilitating overall processing unit and program design.

    摘要翻译: 指令解码器的一个实施例包括:指令解析器,被配置为处理第一非操作指令并产生对应于第一非操作指令的第一事件信号;以及第一事件多路复用器,被配置为从指令接收第一事件信号 解析器,以从一个或多个事件信号中选择第一事件信号,并将第一事件信号发送到事件逻辑块。 指令解码器可以在诸如着色器单元的多线程处理单元中实现,并且当在处理单元内执行一个或多个线程时,可以跟踪第一事件信号的出现。 所得到的事件信号计数可以使设计者更好地理解在处理单元内执行的诸如着色器程序之类的程序的行为,从而有助于整体处理单元和程序设计。

    Bit reversal methods for a parallel processor
    10.
    发明授权
    Bit reversal methods for a parallel processor 有权
    并行处理器的位反转方法

    公开(公告)号:US07640284B1

    公开(公告)日:2009-12-29

    申请号:US11424514

    申请日:2006-06-15

    IPC分类号: G06F17/14

    CPC分类号: G06F17/142 G06F7/76

    摘要: Parallelism in a processor is exploited to permute a data set based on bit reversal of indices associated with data points in the data set. Permuted data can be stored in a memory having entries arranged in banks, where entries in different banks can be accessed in parallel. A destination location in the memory for a particular data point from the data set is determined based on the bit-reversed index associated with that data point. The bit-reversed index can be further modified so that at least some of the destination locations determined by different parallel processes are in different banks, allowing multiple points of the bit-reversed data set to be written in parallel.

    摘要翻译: 处理器中的并行性被利用以基于与数据集中的数据点相关联的索引的位反转来置换数据集。 被许可的数据可以存储在具有排列在存储体中的条目的存储器中,其中可以并行地访问不同存储体中的条目。 基于与该数据点相关联的位反转索引来确定来自数据集的用于特定数据点的存储器中的目的地位置。 可以进一步修改位反转索引,使得由不同并行进程确定的至少一些目的地位置在不同的存储体中,允许并行写入位反转数据集的多个点。