Distributed stream output in a parallel processing unit
    2.
    发明授权
    Distributed stream output in a parallel processing unit 有权
    分布式流输出并行处理单元

    公开(公告)号:US08817031B2

    公开(公告)日:2014-08-26

    申请号:US12894001

    申请日:2010-09-29

    IPC分类号: G06F15/80

    CPC分类号: G06T1/00

    摘要: A technique for performing stream output operations in a parallel processing system is disclosed. A stream synchronization unit is provided that enables the parallel processing unit to track batches of vertices being processed in a graphics processing pipeline. A plurality of stream output units is also provided, where each stream output unit writes vertex attribute data to one or more stream output buffers for a portion of the batches of vertices. A messaging protocol is implemented between the stream synchronization unit and the plurality of stream output units that ensures that each of the stream output units writes vertex attribute data for the particular batch of vertices distributed to that particular stream output unit in the same order in the stream output buffers as the order in which the batch of vertices was received from a device driver by the parallel processing unit.

    摘要翻译: 公开了一种用于在并行处理系统中执行流输出操作的技术。 提供流同步单元,其使并行处理单元能够跟踪在图形处理流水线中正在处理的顶点的批次。 还提供了多个流输出单元,其中每个流输出单元将顶点属性数据写入一批或多个顶点的一部分的流输出缓冲器。 在流同步单元和多个流输出单元之间实现消息传递协议,确保每个流输出单元以流中相同的顺序写入分配给该特定流输出单元的特定批次的顶点的顶点属性数据 输出缓冲器作为由并行处理单元从设备驱动器接收到顶点批次的顺序。

    DISTRIBUTED STREAM OUTPUT IN A PARALLEL PROCESSING UNIT
    3.
    发明申请
    DISTRIBUTED STREAM OUTPUT IN A PARALLEL PROCESSING UNIT 有权
    并行处理单元中的分布式流输出

    公开(公告)号:US20110141122A1

    公开(公告)日:2011-06-16

    申请号:US12894001

    申请日:2010-09-29

    IPC分类号: G06F15/80

    CPC分类号: G06T1/00

    摘要: A technique for performing stream output operations in a parallel processing system is disclosed. A stream synchronization unit is provided that enables the parallel processing unit to track batches of vertices being processed in a graphics processing pipeline. A plurality of stream output units is also provided, where each stream output unit writes vertex attribute data to one or more stream output buffers for a portion of the batches of vertices. A messaging protocol is implemented between the stream synchronization unit and the plurality of stream output units that ensures that each of the stream output units writes vertex attribute data for the particular batch of vertices distributed to that particular stream output unit in the same order in the stream output buffers as the order in which the batch of vertices was received from a device driver by the parallel processing unit.

    摘要翻译: 公开了一种用于在并行处理系统中执行流输出操作的技术。 提供流同步单元,其使并行处理单元能够跟踪在图形处理流水线中正在处理的顶点的批次。 还提供了多个流输出单元,其中每个流输出单元将顶点属性数据写入一批或多个顶点的一部分的流输出缓冲器。 在流同步单元和多个流输出单元之间实现消息传递协议,确保每个流输出单元以流中相同的顺序写入分配给该特定流输出单元的特定批次的顶点的顶点属性数据 输出缓冲器作为由并行处理单元从设备驱动器接收到顶点批次的顺序。

    Distributing primitives to multiple rasterizers
    5.
    发明授权
    Distributing primitives to multiple rasterizers 有权
    将原语分发到多个光栅化器

    公开(公告)号:US09536341B1

    公开(公告)日:2017-01-03

    申请号:US12581746

    申请日:2009-10-19

    IPC分类号: G06F15/80 G06T15/00

    CPC分类号: G06T15/005 G06T2210/52

    摘要: One embodiment of the present invention sets forth a technique for parallel distribution of primitives to multiple rasterizers. Multiple, independent geometry units perform geometry processing concurrently on different graphics primitives. A primitive distribution scheme delivers primitives from the multiple geometry units concurrently to multiple rasterizers at rates of multiple primitives per clock. The multiple, independent rasterizer units perform rasterization concurrently on one or more graphics primitives, enabling the rendering of multiple primitives per system clock.

    摘要翻译: 本发明的一个实施例提出了一种用于将原语并行分配到多个光栅化器的技术。 多个独立的几何单元在不同的图形基元上同时执行几何处理。 原始分配方案以每个时钟的多个基元的速率将原始图元从多个几何单元同时传送到多个光栅化器。 多个独立的光栅化器单元在一个或多个图形基元上同时执行光栅化,使得能够每个系统时钟渲染多个基元。

    Computing tessellation coordinates using dedicated hardware
    6.
    发明授权
    Computing tessellation coordinates using dedicated hardware 有权
    使用专用硬件计算镶嵌坐标

    公开(公告)号:US08599202B1

    公开(公告)日:2013-12-03

    申请号:US12240390

    申请日:2008-09-29

    IPC分类号: G06T15/30

    CPC分类号: G06T17/20 G06T15/005

    摘要: A system and method for performing tessellation of three-dimensional surface patches performs some tessellation operations using programmable processing units and other tessellation operations using fixed function units with limited precision. (u,v) parameter coordinates for each vertex are computed using fixed function units to offload programmable processing engines. The (u,v) computation is a symmetric operation and is based on integer coordinates of the vertex, tessellation level of detail values, and a spacing mode.

    摘要翻译: 用于执行三维表面贴片的细分的系统和方法使用具有有限精度的固定功能单元,使用可编程处理单元和其他镶嵌操作来执行一些镶嵌操作。 (u,v)使用固定功能单元计算每个顶点的参数坐标,以卸载可编程处理引擎。 (u,v)计算是对称运算,并且基于顶点的整数坐标,细节值的细分级别和间隔模式。

    Coalescing memory barrier operations across multiple parallel threads
    7.
    发明授权
    Coalescing memory barrier operations across multiple parallel threads 有权
    在多个并行线程之间合并记忆障碍操作

    公开(公告)号:US09223578B2

    公开(公告)日:2015-12-29

    申请号:US12887081

    申请日:2010-09-21

    IPC分类号: G06F9/46 G06F9/38 G06F9/30

    摘要: One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.

    摘要翻译: 本发明的一个实施例提出了一种用于在多个并行线程之间聚合存储器屏障操作的技术。 来自给定并行线程处理单元的存储器屏障请求被合并以减少对系统其余部分的影响。 此外,存储器屏障请求可以指定针对其提交内存事务的一组线程的级别。 例如,第一类型的存储器障碍指令可以将存储器事务提交到共享L1(一级)高速缓存的一组协作线程的级别。 第二种类型的存储器障碍指令可以将存储器事务提交到共享全局存储器的一组线程的级别。 最后,第三种类型的存储器障碍指令可以将存储器事务提交到共享所有系统存储器的所有线程的系统级。 执行存储器屏障指令所需的延迟基于存储器屏障指令的类型而变化。

    GRID WALK SAMPLING
    8.
    发明申请
    GRID WALK SAMPLING 审中-公开
    网路采样

    公开(公告)号:US20120280992A1

    公开(公告)日:2012-11-08

    申请号:US13461666

    申请日:2012-05-01

    IPC分类号: G06T17/00

    CPC分类号: G06T11/40

    摘要: The grid walk sampling technique is an efficient sampling algorithm aimed at optimizing the cost of triangle rasterization for modern graphics workloads. Grid walk sampling is an iterative rasterization algorithm that intelligently tests the intersection of triangle edges with multi-cell grids, determining coverage for a grid cell while identifying other cells in the grid that are either fully covered or fully uncovered by the triangle. Grid walk sampling rasterizes triangles using fewer computations and simpler computations compared with conventional highly parallel rasterizers. Therefore, a rasterizer employing grid walk sampling may compute sample coverage of triangles more efficiently in terms of power and circuitry die area compared with conventional highly parallel rasterizers.

    摘要翻译: 网格行走采样技术是一种高效的采样算法,旨在优化现代图形工作负载的三角形光栅化成本。 网格行走采样是一种迭代光栅化算法,它可以智能地测试三角形边缘与多单元格网格的交点,确定网格单元格的覆盖范围,同时识别网格中由三角形完全覆盖或完全未覆盖的其他单元格。 与传统的高度平行光栅化器相比,栅格行走采样使用更少的计算和更简单的计算来对三角形进行光栅化。 因此,与传统的高度平行光栅化器相比,使用栅格行走采样的光栅化器可以在功率和电路裸片面积方面更有效地计算三角形的样本覆盖。

    Context switching using halt sequencing protocol
    9.
    发明授权
    Context switching using halt sequencing protocol 有权
    使用停止排序协议进行上下文切换

    公开(公告)号:US07512773B1

    公开(公告)日:2009-03-31

    申请号:US11252855

    申请日:2005-10-18

    IPC分类号: G06F9/46

    CPC分类号: G06F9/485 G06F9/4881

    摘要: A halt sequencing protocol permits a context switch to occur in a processing pipeline even before all units of the processing pipeline are idle. The context switch method based on the halt sequencing protocol includes the steps of issuing a halt request signal to the units of a processing pipeline, monitoring the status of each of the units, and freezing the states of all of the units when they are either idle or halted. Then, the states of the units, which pertain to the thread that has been halted, are dumped into memory, and the units are restored with states corresponding to a different thread that is to be executed after the context switch.

    摘要翻译: 即使在处理流水线的所有单元都空闲之前,停止排序协议也允许在处理流水线中进行上下文切换。 基于暂停排序协议的上下文切换方法包括以下步骤:向处理流水线的单元发出停止请求信号,监视每个单元的状态,以及在空闲时冻结所有单元的状态 或停止。 然后,与暂停的线程相关的单元的状态被转储到存储器中,并且单元被恢复为与上下文切换之后要执行的不同线程相对应的状态。

    Superscalar processor with multiple register windows and speculative
return address generation
    10.
    发明授权
    Superscalar processor with multiple register windows and speculative return address generation 失效
    具有多个寄存器窗口和推测返回地址生成的超标量处理器

    公开(公告)号:US5896528A

    公开(公告)日:1999-04-20

    申请号:US522845

    申请日:1995-09-01

    IPC分类号: G06F9/32 G06F9/38 G06F9/42

    摘要: A superscaler processor capable of executing multiple instructions concurrently. The processor includes a program counter which identifies instructions for execution by multiple execution units. Further included is a register file made up of multiple register window pointer selects one of the multiple register windows. In response to the value of the current window pointer, a return prediction table provides a speculative program counter value, indicative of a return address of an instruction for a subroutine, corresponding to the selected register window. A watchpoint register stores the speculative program counter value. A fetch program counter, in response to the speculative program counter value, stores the instructions for execution after they have been identified by the program counter.

    摘要翻译: 能够同时执行多个指令的超标量处理器。 该处理器包括一个程序计数器,用于识别由多个执行单元执行的指令。 另外包括由多个寄存器窗口指针组成的寄存器文件,用于选择多个寄存器窗口之一。 响应于当前窗口指针的值,返回预测表提供与所选择的寄存器窗口相对应的指示子程序的指令的返回地址的推测程序计数器值。 观察点寄存器存储推测程序计数器值。 获取程序计数器响应于推测程序计数器值,在由程序计数器识别之后存储用于执行的指令。