Abstract:
Multi-processor core three-dimensional (3D) integrated circuits (ICs) (3DICs) and related methods are disclosed. In aspects disclosed herein, ICs are provided that include a central processing unit (CPU) having multiple processor cores ("cores") to improve performance. To further improve CPU performance, the multiple cores can also be designed to communicate with each other to offload workloads and/or share resources for parallel processing, but at a communication overhead associated with passing data through interconnects which have an associated latency. To mitigate this communication overhead inefficiency, aspects disclosed herein provide the CPU with its multiple cores in a 3DIC. Because 3DICs can overlap different IC tiers and/or align similar components in the same IC tier, the cores can be designed and located between or within different IC tiers in a 3DIC to reduce communication distance associated with processor core communication to share workload and/or resources, thus improving performance of the multi-processor CPU design.
Abstract:
Examples include techniques for coalescing doorbells in a request message. Example techniques include gathering doorbells to access a device. The gathered are combined in a cache line structure and the cache line structure is written to a cache or buffer for a central processing unit in a single write operation.
Abstract:
An apparatus and method are described for performing a distance test in a ray tracing system. For example, one embodiment of a graphics processing apparatus comprises: a ray tracing traversal/intersection unit to identify two or more ray-surface intersections, each of the ray-surface intersections being assigned a unique hit point identifier (ID); and a distance testing module to disambiguate the order of the ray-surface intersections using the hit point ID if the two or more of the ray-surface intersections share the same distance.
Abstract:
Techniques are described for metadata processing that can be used to encode an arbitrary number of security policies for code running on a processor. Metadata may be added to every word in the system and a metadata processing unit may be used that works in parallel with data flow to enforce an arbitrary set of policies. In one aspect, the metadata may be characterized as unbounded and software programmable to be applicable to a wide range of metadata processing policies. Techniques and policies have a wide range of uses including, for example, safety, security, and synchronization. Additionally, described are aspects and techniques in connection with metadata processing in an embodiment based on the RISC-V architecture.
Abstract:
A processor includes a front end to decode an instruction, a temporary destination, and an allocator to assign the instruction to an execution unit to execute the instruction to get a selected column of data into a destination register. The execution unit includes an element counter, a logic to determine an index from an index vector based on the element count, a logic to compute an address of the data, a row to be loaded into the temporary destination, and a data processing unit to copy a portion of the temporary destination into the element of the destination register.
Abstract:
A processor includes an execution unit to execute instructions to load indices from an array of indices, optionally perform scatters, and prefetch (to a specified cache) contents of target locations for future scatters from arbitrary locations in memory. The execution unit includes logic to load, for each target location of a scatter or prefetch operation, an index value to be used in computing the address in memory for the operation. The index value may be retrieved from an array of indices identified for the instruction. The execution unit includes logic to compute the addresses based on the sum of a base address specified for the instruction, the index value retrieved for the location, and a prefetch offset (for prefetch operations), with optional scaling. The execution unit includes logic to retrieve data elements from contiguous locations in a source vector register specified for the instruction to be scattered to the memory.
Abstract:
In one embodiment, a processor includes a core having a fetch logic to fetch instructions, a decode logic to decode a first persistent memory prefetch instruction and provide the decoded first persistent memory prefetch instruction to a control logic. In turn, the control logic is to enable prefetch of data requested by the first persistent memory prefetch instruction and storage of the data in a location external to the processor. Other embodiments are described and claimed.
Abstract:
Technology related to prefetching instruction blocks is disclosed. In one example of the disclosed technology, a processor comprises a block-based processor core for executing a program comprising a plurality of instruction blocks. The block-based processor core can include prefetch logic and a local buffer. The prefetch logic can be configured to receive a reference to a predicted instruction block and to determine a mapping of the predicted instruction block to one or more lines. The local buffer can be configured to selectively store portions of the predicted instruction block and to provide the stored portions of the predicted instruction block when control of the program passes along a predicted execution path to the predicted instruction block.
Abstract:
Technology related to prefetching data associated with predicated stores of programs in block-based processor architectures is disclosed. In one example of the disclosed technology, a processor includes a block-based processor core for executing an instruction block comprising a plurality of instructions. The block-based processor core includes decode logic and prefetch logic. The decode logic is configured to detect a predicated store instruction of the instruction block. The prefetch logic is configured to calculate a target address of the predicated store instruction and initiate a memory operation associated with the calculated target address before a predicate of the predicated store instruction is calculated.
Abstract:
Apparatus and methods are disclosed for controlling instruction flow in block-based processor architectures. In one example of the disclosed technology, an instruction block address register stores an index address to a memory storing a plurality of instructions for an instruction block, the indexed address being inaccessible when the processor is in one or more unprivileged operational modes, one or more execution units configured to execute instructions for the instruction block, and a control unit configured to fetch and decode two or more of the plurality of instructions from the memory based on the indexed address.