摘要:
For ray tracing, methods, apparatus, and computer readable media provide efficient transmission and/or storage of rays between ray emitters, and an intersection testing resource. Ray emitters, during emission of a plurality of rays, identify a shared attribute of each ray of the plurality, and represent that attribute as shared ray data. The shared ray data, and other ray data sufficient to determine both an origin and a direction for each ray of the plurality, are transmitted. Functionality in the intersection testing resource receives the shared ray data and the other ray data, and interprets the shared ray data and the other ray data to determine an origin and direction for each ray of the plurality, and provides those rays for intersection testing. Rays can be stored in the shared attribute format in the intersection testing resource and data elements representing the rays can be constructed later. Programmable receiving functionality of the intersection testing resource can accommodate many ray types and other situations.
摘要:
Systems and methods include high throughput and/or parallelized ray/geometric shape intersection testing using intersection testing resources accepting and operating with block floating point data. Block floating point data sacrifices precision of scene location in ways that maintain precision where more beneficial, and allow reduced precision where beneficial. In particular, rays, acceleration structures, and primitives can be represented in a variety of block floating point formats, such that storage requirements for storing such data can be reduced. Hardware accelerated intersection testing can be provided with reduced sized math units, with reduced routing requirements. A driver for hardware accelerators can maintain full-precision versions of rays and primitives to allow reduced communication requirements for high throughput intersection testing in loosely coupled systems. Embodiments also can include using BFP formatted data in programmable test cells or more general purpose processing elements.
摘要:
Systems and methods include high throughput and/or parallelized ray/geometric shape intersection testing using intersection testing resources accepting and operating with block floating point data. Block floating point data sacrifices precision of scene location in ways that maintain precision where more beneficial, and allow reduced precision where beneficial. In particular, rays, acceleration structures, and primitives can be represented in a variety of block floating point formats, such that storage requirements for storing such data can be reduced. Hardware accelerated intersection testing can be provided with reduced sized math units, with reduced routing requirements. A driver for hardware accelerators can maintain full-precision versions of rays and primitives to allow reduced communication requirements for high throughput intersection testing in loosely coupled systems. Embodiments also can include using BFP formatted data in programmable test cells or more general purpose processing elements.
摘要:
Aspects include a multistage collector to receive outputs from plural processing elements. Processing elements may comprise (each or collectively) a plurality of clusters, with one or more ALUs that may perform SIMD operations on a data vector and produce outputs according to the instruction stream being used to configure the ALU(s). The multistage collector includes substituent components each with at least one input queue, a memory, a packing unit, and an output queue; these components can be sized to process groups of input elements of a given size, and can have multiple input queues and a single output queue. Some components couple to receive outputs from the ALUs and others receive outputs from other components. Ultimately, the multistage collector can output groupings of input elements. Each grouping of elements (e.g., at input queues, or stored in the memories of component) can be formed based on matching of index elements.
摘要:
In some aspects, finer grained parallelism is achieved by segmenting programmatic workloads into smaller discretized portions, where a first element can be indicative both of a configuration or program to be executed, and a first data set to be used in such execution, while a second element can be indicative of a second data element or group. The discretized portions can cause program execute on distributed processors. Approaches to selecting processors, and allocating local memory associated with those processors are disclosed. In one example, discretized portions that share a program have an anti-affinity to cause dispersion, for initial execution assignment. Flags, such as programmer and compiler generated flags can be used in determining such allocations. Workloads can be grouped according to compatibility of memory usage requirements.
摘要:
Aspects include a multistage collector to receive outputs from plural processing elements. Processing elements may comprise (each or collectively) a plurality of clusters, with one or more ALUs that may perform SIMD operations on a data vector and produce outputs according to the instruction stream being used to configure the ALU(s). The multistage collector includes substituent components each with at least one input queue, a memory, a packing unit, and an output queue; these components can be sized to process groups of input elements of a given size, and can have multiple input queues and a single output queue. Some components couple to receive outputs from the ALUs and others receive outputs from other components. Ultimately, the multistage collector can output groupings of input elements. Each grouping of elements (e.g., at input queues, or stored in the memories of component) can be formed based on matching of index elements.
摘要:
In one aspect, photon queries are answered using systems and methods of traversal of collections of photon queries through an acceleration structure, to identify photons meeting a specification of a given query. Such systems and methods can be extended to satisfying similarity queries in an n-dimensional parameter space. Queries can be associated with code (or pointers to code) that are run to achieve closure of that query. Queries can cause further queries to be emitted. Arbitrary data can be passed from one query to another; for example, parameters defined internally to the code modules themselves (e.g., the parameters do not need to have a definition or meaning to the systems or within the methods).
摘要:
Systems and methods for producing an acceleration structure provide for subdividing a 3-D scene into a plurality of volumetric portions, which have different sizes, each being addressable using a multipart address indicating a location and a relative size of each volumetric portion. A stream of primitives is processed by characterizing each according to one or more criteria, selecting a relative size of volumetric portions for use in bounding the primitive, and finding a set of volumetric portions of that relative size which bound the primitive. A primitive ID is stored in each location of a cache associated with each volumetric portion of the set of volumetric portions. A cache location is selected for eviction, responsive to each cache eviction decision made during the processing. An element of an acceleration structure according to the contents of the evicted cache location is generated, responsive to the evicted cache location.
摘要:
In some aspects, systems and methods provide for forming groupings of a plurality of independently-specified computation workloads, such as graphics processing workloads, and in a specific example, ray tracing workloads. The workloads include a scheduling key, which is one basis on which the groupings can be formed. Workloads grouped together can all execute from the same source of instructions, one or more different private data elements. Such workloads can recursively instantiate other workloads that reference the same private data elements. In some examples, the scheduling key can be used to identify a data element to be used by all the workloads of a grouping. Memory conflicts to private data elements are handled through scheduling of non-conflicted workloads or specific instructions an deferring conflicted workloads instead of locking memory locations.
摘要:
Aspects include computation systems that can identify computation instances that are not capable of being reentrant, or are not reentrant capable on a target architecture, or are non-reentrant as a result of having a memory conflict in a particular execution situation. A system can have a plurality of computation units, each with an independently schedulable SIMD vector. Computation instances can be defined by a program module, and a data element(s) that may be stored in a local cache for a particular computation unit. Each local cache does not maintain coherency controls for such data elements. During scheduling, a scheduler can maintain a list of running (or runnable) instances, and attempt to schedule new computation instances by determining whether any new computation instance conflicts with a running instance and responsively defer scheduling. Memory conflict checks can be conditioned on a flag or other indication of the potential for non-reentrancy.