Parallel processor memory transfer system using parallel transfers
between processors and staging registers and sequential transfers
between staging registers and memory
    1.
    发明授权
    Parallel processor memory transfer system using parallel transfers between processors and staging registers and sequential transfers between staging registers and memory 失效
    并行处理器存储器传输系统使用处理器和分段寄存器之间的并行传输,以及分级寄存器和存储器之间的顺序传输

    公开(公告)号:US5581777A

    公开(公告)日:1996-12-03

    申请号:US400411

    申请日:1995-03-03

    摘要: A massively parallel processor is provided with a plurality of clusters. Each cluster includes a plurality of processor elements ("PEs") and a cluster memory. Each PE of the cluster has associated with it an address register, a stage register, an error register, a PE enable flag, a memory flag, and a grant request flag. A cluster data bus and an error bus connects each of the stage registers and error registers of the cluster to the memory. The grant request flags of the cluster are interconnected by a polling network, which polls only one of the grant request flags at a time. In response to a signal on the polling network and the state of the associated memory flag, the grant request flag determines an I/O operation between the associated data register and the cluster memory over the cluster data bus. Both data and error bits are associated with respective processor elements. The sequential memory operations proceed in parallel with parallel processor operations. The sequential memory operations also may be queued. Addressing modes include direct and indirect. In direct address mode, a PE addresses its own address space by appending its PE number to a broadcast partial address. The broadcast partial address is furnished over a broadcast bus, and the PE number is furnished on a cluster address bus. In indirect address mode, a PE addresses either its own address space or that of other PEs in its cluster by locally calculating a partial address, then appending to it either its own PE number or that of another PE in its cluster. The full address is furnished over the cluster address bus.

    摘要翻译: 大规模并行处理器具有多个簇。 每个群集包括多个处理器元件(“PE”)和群集存储器。 集群的每个PE与它相关联地址寄存器,阶段寄存器,错误寄存器,PE使能标志,存储器标志和授权请求标志。 集群数据总线和错误总线将集群的每个阶段寄存器和错误寄存器连接到存储器。 集群的授权请求标志由轮询网络相互连接,轮询网络一次仅轮询授权请求标志中的一个。 响应于轮询网络上的信号和相关联的存储器标志的状态,授权请求标志通过集群数据总线确定相关联的数据寄存器和集群存储器之间的I / O操作。 数据和错误位都与相应的处理器元件相关联。 顺序存储器操作与并行处理器操作并行进行。 顺序存储器操作也可以排队。 寻址模式包括直接和间接。 在直接地址模式下,PE通过将其PE号附加到广播部分地址来寻址其自己的地址空间。 广播部分地址通过广播总线提供,PE号码在集群地址总线上提供。 在间接寻址模式下,PE通过本地计算部分地址来寻址其自身的地址空间或其簇中的其他PE,然后将其自身的PE号或其簇中的另一个PE附加到该地址空间。 整个地址通过集群地址总线提供。

    Input/output system for parallel processing arrays
    2.
    发明授权
    Input/output system for parallel processing arrays 失效
    用于并行处理阵列的输入/输出系统

    公开(公告)号:US5243699A

    公开(公告)日:1993-09-07

    申请号:US802944

    申请日:1991-12-06

    IPC分类号: G06F15/173 G06F15/80

    CPC分类号: G06F15/8007 G06F15/17393

    摘要: A massively parallel processor includes an array of processor elements (20), of PEs, and a multi-stage router interconnection network (30), which is used both for I/O communications and for PE to PE communications. The I/O system (10) for the massively parallel processor is based on a globally shared addressable I/O RAM buffer memory (50) that has address and data buses (52) to the I/O devices (80, 82) and other address and data buses (42) which are coupled to a router I/O element array (40). The router I/O element array is in turn coupled to the router ports (e.g. S2.sub.-- 0.sub.-- X0) of the second stage (430) of the router interconnection network. The router I/O array provides the corner turn conversion between the massive array of router lines (32) and the relatively few buses (52) to the I/O devices.

    摘要翻译: 大规模并行处理器包括PE的处理器元件阵列(20)和用于I / O通信和用于PE至PE通信的多级路由器互连网络(30)。 用于大规模并行处理器的I / O系统(10)基于具有到I / O设备(80,82)的地址和数据总线(52)的全局共享的可寻址I / O RAM缓冲存储器(50) 耦合到路由器I / O元件阵列(40)的其它地址和数据总线(42)。 路由器I / O元件阵列又耦合到路由器互连网络的第二级(430)的路由器端口(例如,S2-0-X0)。 路由器I / O阵列提供大量路由器线路(32)和相对较少的总线(52)到I / O设备之间的拐角转换。

    Parallel processor system with highly flexible local control capability,
including selective inversion of instruction signal and control of bit
shift amount
    3.
    发明授权
    Parallel processor system with highly flexible local control capability, including selective inversion of instruction signal and control of bit shift amount 失效
    并行处理器系统具有高度灵活的本地控制能力,包括指令信号的选择性反转和位移量的控制

    公开(公告)号:US5542074A

    公开(公告)日:1996-07-30

    申请号:US965938

    申请日:1992-10-22

    IPC分类号: G06F15/80 G06F15/76

    摘要: A parallel processor system which operates in a single-instruction multiple-data mode has a highly flexible local control capability for enabling the system to operate fast. The system contains an array of processing elements or PEs (12.sub.1 -12.sub.N) that process respective sets of data according to instructions supplied from a global control unit (20). Each instruction is furnished simultaneously to all the PEs. One local control feature (52) entails selectively inverting certain instruction signals according to a data-dependent signal. Another local control feature (48) involves controlling the amount of a bit shift in a barrel shifter (98) according to a data-dependent signal.

    摘要翻译: 以单指令多数据模式运行的并行处理器系统具有高度灵活的本地控制能力,使系统能够快速运行。 该系统包含一系列处理元件或PE(121-12N),根据从全局控制单元(20)提供的指令处理相应的数据集。 每个指令同时提供给所有的PE。 一个本地控制特征(52)需要根据数据相关信号选择性地反转某些指令信号。 另一个本地控制特征(48)涉及根据数据相关信号来控制桶形移位器(98)中的位移的量。

    Scalable processor to processor and processor to I/O interconnection
network and method for parallel processing arrays
    4.
    发明授权
    Scalable processor to processor and processor to I/O interconnection network and method for parallel processing arrays 失效
    可扩展处理器到处理器和处理器到I / O互连网络和并行处理阵列的方法

    公开(公告)号:US5598408A

    公开(公告)日:1997-01-28

    申请号:US182250

    申请日:1994-01-14

    CPC分类号: G06F15/17393

    摘要: A massively parallel computer system is disclosed having a global router network in which pipeline registers are spatially distributed to increase the messaging speed of the global router network. The global router network includes an expansion tap for processor to I/O messaging so that I/O messaging bandwidth matches interprocessor messaging bandwidth. A route-opening message packet includes protocol bits which are treated homogeneously with steering bits. The route-opening packet further includes redundant address bits for imparting a multiple-crossbars personality to router chips within the global router network. A structure and method for spatially supporting the processors of the massively parallel system and the global router network are also disclosed.

    摘要翻译: 公开了一种具有全局路由器网络的大规模并行计算机系统,其中流水线寄存器在空间上分布以增加全局路由器网络的消息传送速度。 全局路由器网络包括用于处理器到I / O消息传递的扩展抽头,以便I / O消息带宽与处理器间消息带宽相匹配。 路由开启消息分组包括与转向比特均匀对待的协议比特。 路由开启分组还包括冗余地址比特,用于向全球路由器网络内的路由器芯片赋予多交叉形状个性。 还公开了用于空间支持大规模并行系统和全局路由器网络的处理器的结构和方法。

    Scalable processor to processor and processor-to-I/O interconnection
network and method for parallel processing arrays
    5.
    发明授权
    Scalable processor to processor and processor-to-I/O interconnection network and method for parallel processing arrays 失效
    可扩展处理器到处理器和处理器到I / O互连网络和并行处理阵列的方法

    公开(公告)号:US5280474A

    公开(公告)日:1994-01-18

    申请号:US461492

    申请日:1990-01-05

    CPC分类号: G06F15/17393

    摘要: A massively parallel computer system is disclosed having a global router network in which pipeline registers are spatially distributed to increase the messaging speed of the global router network. The global router network includes an expansion tap for processor to I/O messaging so that I/O messaging bandwidth matches interprocessor messaging bandwidth. A route-opening message packet includes protocol bits which are treated homogeneously with steering bits. The route-opening packet further includes redundant address bits for imparting a multiple-crossbars personality to router chips within the global router network. A structure and method for spatially supporting the processors of the massively parallel system and the global router network are also disclosed.

    摘要翻译: 公开了一种具有全局路由器网络的大规模并行计算机系统(500),其中流水线寄存器在空间上分布以增加全局路由器网络的消息传送速度。 全局路由器网络包括用于处理器到I / O(1700)消息传递的扩展抽头,以便I / O消息带宽与处理器间消息带宽相匹配。 路由开启消息分组包括与转向比特均匀对待的协议比特。 路由开启分组还包括冗余地址比特,用于向全球路由器网络内的路由器芯片赋予多交叉形状个性。 还公开了用于空间上支持大规模并行系统和全局路由器网络的处理器(700)的结构和方法。

    Coalescing memory barrier operations across multiple parallel threads
    6.
    发明授权
    Coalescing memory barrier operations across multiple parallel threads 有权
    在多个并行线程之间合并记忆障碍操作

    公开(公告)号:US09223578B2

    公开(公告)日:2015-12-29

    申请号:US12887081

    申请日:2010-09-21

    IPC分类号: G06F9/46 G06F9/38 G06F9/30

    摘要: One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.

    摘要翻译: 本发明的一个实施例提出了一种用于在多个并行线程之间聚合存储器屏障操作的技术。 来自给定并行线程处理单元的存储器屏障请求被合并以减少对系统其余部分的影响。 此外,存储器屏障请求可以指定针对其提交内存事务的一组线程的级别。 例如,第一类型的存储器障碍指令可以将存储器事务提交到共享L1(一级)高速缓存的一组协作线程的级别。 第二种类型的存储器障碍指令可以将存储器事务提交到共享全局存储器的一组线程的级别。 最后,第三种类型的存储器障碍指令可以将存储器事务提交到共享所有系统存储器的所有线程的系统级。 执行存储器屏障指令所需的延迟基于存储器屏障指令的类型而变化。

    Generating event signals for performance register control using non-operative instructions
    8.
    发明授权
    Generating event signals for performance register control using non-operative instructions 有权
    使用非操作指令生成用于性能寄存器控制的事件信号

    公开(公告)号:US07809928B1

    公开(公告)日:2010-10-05

    申请号:US11313872

    申请日:2005-12-20

    IPC分类号: G06F9/30 G06F17/00 G09G5/02

    摘要: One embodiment of an instruction decoder includes an instruction parser configured to process a first non-operative instruction and to generate a first event signal corresponding to the first non-operative instruction, and a first event multiplexer configured to receive the first event signal from the instruction parser, to select the first event signal from one or more event signals and to transmit the first event signal to an event logic block. The instruction decoder may be implemented in a multithreaded processing unit, such as a shader unit, and the occurrences of the first event signal may be tracked when one or more threads are executed within the processing unit. The resulting event signal count may provide a designer with a better understanding of the behavior of a program, such as a shader program, executed within the processing unit, thereby facilitating overall processing unit and program design.

    摘要翻译: 指令解码器的一个实施例包括:指令解析器,被配置为处理第一非操作指令并产生对应于第一非操作指令的第一事件信号;以及第一事件多路复用器,被配置为从指令接收第一事件信号 解析器,以从一个或多个事件信号中选择第一事件信号,并将第一事件信号发送到事件逻辑块。 指令解码器可以在诸如着色器单元的多线程处理单元中实现,并且当在处理单元内执行一个或多个线程时,可以跟踪第一事件信号的出现。 所得到的事件信号计数可以使设计者更好地理解在处理单元内执行的诸如着色器程序之类的程序的行为,从而有助于整体处理单元和程序设计。

    Bit reversal methods for a parallel processor
    9.
    发明授权
    Bit reversal methods for a parallel processor 有权
    并行处理器的位反转方法

    公开(公告)号:US07640284B1

    公开(公告)日:2009-12-29

    申请号:US11424514

    申请日:2006-06-15

    IPC分类号: G06F17/14

    CPC分类号: G06F17/142 G06F7/76

    摘要: Parallelism in a processor is exploited to permute a data set based on bit reversal of indices associated with data points in the data set. Permuted data can be stored in a memory having entries arranged in banks, where entries in different banks can be accessed in parallel. A destination location in the memory for a particular data point from the data set is determined based on the bit-reversed index associated with that data point. The bit-reversed index can be further modified so that at least some of the destination locations determined by different parallel processes are in different banks, allowing multiple points of the bit-reversed data set to be written in parallel.

    摘要翻译: 处理器中的并行性被利用以基于与数据集中的数据点相关联的索引的位反转来置换数据集。 被许可的数据可以存储在具有排列在存储体中的条目的存储器中,其中可以并行地访问不同存储体中的条目。 基于与该数据点相关联的位反转索引来确定来自数据集的用于特定数据点的存储器中的目的地位置。 可以进一步修改位反转索引,使得由不同并行进程确定的至少一些目的地位置在不同的存储体中,允许并行写入位反转数据集的多个点。

    SYSTEMS AND METHODS FOR COALESCING MEMORY ACCESSES OF PARALLEL THREADS
    10.
    发明申请
    SYSTEMS AND METHODS FOR COALESCING MEMORY ACCESSES OF PARALLEL THREADS 有权
    用于并行线程的存储器访问的系统和方法

    公开(公告)号:US20090240895A1

    公开(公告)日:2009-09-24

    申请号:US12054330

    申请日:2008-03-24

    IPC分类号: G06F12/00

    摘要: One embodiment of the present invention sets forth a technique for efficiently and flexibly performing coalesced memory accesses for a thread group. For each read application request that services a thread group, the core interface generates one pending request table (PRT) entry and one or more memory access requests. The core interface determines the number of memory access requests and the size of each memory access request based on the spread of the memory access addresses in the application request. Each memory access request specifies the particular threads that the memory access request services. The PRT entry tracks the number of pending memory access requests. As the memory interface completes each memory access request, the core interface uses information in the memory access request and the corresponding PRT entry to route the returned data. When all the memory access requests associated with a particular PRT entry are complete, the core interface satisfies the corresponding application request and frees the PRT entry.

    摘要翻译: 本发明的一个实施例提出了一种用于有效且灵活地执行线程组合的存储器访问的技术。 对于为线程组服务的每个读取应用程序请求,核心接口生成一个未决请求表(PRT)条目和一个或多个内存访问请求。 核心接口基于应用程序请求中的存储器访问地址的扩展来确定存储器访问请求的数量和每个存储器访问请求的大小。 每个存储器访问请求指定存储器访问请求服务的特定线程。 PRT条目跟踪挂起的内存访问请求的数量。 当存储器接口完成每个存储器访问请求时,核心接口使用存储器访问请求中的信息和对应的PRT条目来路由返回的数据。 当与特定PRT条目相关联的所有存储器访问请求完成时,核心接口满足相应的应用请求并释放PRT条目。