Abstract:
Apparatuses, systems, and techniques to execute CUDA programs. In at least one embodiment, an application programming interface is performed to cause memory to be shared between two or more groups of blocks of threads.
Abstract:
Apparatuses, systems, and techniques to execute CUDA programs. In at least one embodiment, an application programming interface is performed to indicate one or more limitations of one or more attributes of one or more groups of blocks of one or more threads.
Abstract:
Apparatuses, systems, and techniques to execute CUDA programs. In at least one embodiment, an application programming interface is performed to indicate a maximum number of blocks of threads capable of being scheduled in parallel.
Abstract:
Apparatuses, systems, and techniques to execute CUDA programs. In at least one embodiment, an application programming interface is performed to determine a scheduling policy of one or more blocks of one or more threads.
Abstract:
Apparatuses, systems, and techniques to generate numbers. In at least one embodiment, one or more circuits are to cause one or more thirty-two bit floating point numbers to be truncated to generate one or more rounded numbers based, at least in part, on one or more rounding attributes.
Abstract:
Embodiments of the present invention provide a novel solution that supports the separate compilation of host code and device code used within a heterogeneous programming environment. Embodiments of the present invention are operable to link device code embedded within multiple host object files using a separate device linking operation. Embodiments of the present invention may extract device code from their respective host object files and then linked them together to form linked device code. This linked device code may then be embedded back into a host object generated by embodiments of the present invention which may then be passed to a host linker to form a host executable file. As such, device code may be split into multiple files and then linked together to form a final executable file by embodiments of the present invention.
Abstract:
Apparatuses, systems, and techniques to execute CUDA programs. In at least one embodiment, an application programming interface is performed to cause performance of one or more threads within a group of blocks of threads to stop at least until all threads within the group of blocks have performed a barrier instruction.
Abstract:
Apparatuses, systems, and techniques to execute CUDA programs. In at least one embodiment, an application programming interface is performed to indicate whether one or more threads within two or more blocks of threads have performed a barrier instruction.
Abstract:
Apparatuses, systems, and techniques to control operation of a memory cache. In at least one embodiment, cache guidance is specified within application source code by associating guidance with declaration of a memory block, and then applying specified guidance to source code statements that access said memory block.
Abstract:
A system and method for processing source code for compilation. The method includes accessing a portion of host source code and determining whether the portion of the host source code comprises a device lambda expression. The method further includes in response to the portion of host code comprising the device lambda expression, determining a unique placeholder type instantiation based on the device lambda expression and modifying the device lambda expression based on the unique placeholder type instantiation to produce modified host source code. The method further includes sending the modified host source code to a host compiler.