Abstract:
A system and method for processing source code for compilation. The method includes accessing a portion of host source code and determining whether the portion of the host source code comprises a device lambda expression. The method further includes in response to the portion of host code comprising the device lambda expression, determining a unique placeholder type instantiation based on the device lambda expression and modifying the device lambda expression based on the unique placeholder type instantiation to produce modified host source code. The method further includes sending the modified host source code to a host compiler.
Abstract:
One embodiment of the present invention sets forth a method for causing thread convergence. The method includes determining that a control flow graph representing a first section of a program includes at least two non-overlapping paths that extend from a first divergent node to a candidate node. The method also includes determining that the first divergent node is not a dominator of the candidate node or that the candidate node is not a post-dominator of the first divergent node. The method further includes identifying an external node and inserting a first instruction configured to cause a predicate variable to be set to true for a first set of threads that is to execute the external node. The method additionally includes inserting into the program a second divergent node configured to cause various threads to execute or not execute a first control flow path associated with the external node.
Abstract:
Various disclosed embodiments are directed to methods and systems for reducing memory space in sequential computer-implemented operations. The method includes generating a directed acyclic graph (DAG) having a plurality of vertices and directed edges, wherein each edge connects a predecessor vertex to a successor vertex. Each vertex represents one of the computer-implemented operations and each directed edge represents output data generated by the operations. The method includes merging one of the predecessor vertex with one of the successor vertex by combining the operations of the predecessor vertex and the successor vertex if the predecessor and successor vertices are connected by a directed edge and there is only one directed edge originating from the predecessor vertex. The merger of the predecessor and the successor vertices reduces the number of directed edges in the DAG, resulting in a reduction of intermediate buffer memory required to store the output data.
Abstract:
A device compiler and linker within a parallel processing unit (PPU) is configured to optimize program code of a co-processor enabled application by rematerializing a subset of live-in variables for a particular block in a control flow graph generated for that program code. The device compiler and linker identifies the block of the control flow graph that has the greatest number of live-in variables, then selects a subset of the live-in variables associated with the identified block for which rematerializing confers the greatest estimated profitability. The profitability of rematerializing a given subset of live-in variables is determined based on the number of live-in variables reduced, the cost of rematerialization, and the potential risk of rematerialization.
Abstract:
A system and method are provided for inserting synchronization statements into a program file to mitigate race conditions. The method includes reading a program file and determining one or more convergent statements in the program file. The method also includes inserting one or more synchronization statements in the program file between the determined convergent statements. The method further includes removing one or more of the inserted synchronization statements and writing the modified program file. The method may include, after removing the inserted synchronization statements, identifying to a user any remaining inserted synchronization statements.
Abstract:
A system and method for compiling source code (e.g., with a compiler). The method includes accessing a portion of device source code and determining whether the portion of the device source code comprises a piece of work to be launched on a device from the device. The method further includes determining a plurality of application programming interface (API) calls based on the piece of work to be launched on the device and generating compiled code based on the plurality of API calls. The compiled code comprises a first portion operable to execute on a central processing unit (CPU) and a second portion operable to execute on the device (e.g., GPU).
Abstract:
Systems and methods for obtaining a set of instructions for executing a computer program and generating executable code for the computer program based, at least in part, on scheduling operations associated with the executable code according to a polyhedral representation of a directed acyclic graph. The set of instructions may be represented as a domain-specific language. The executable code may be executable code for a specific processor architecture.
Abstract:
Apparatuses, systems, and techniques to infer information from one or more sets of data. In at least one embodiment, a processor uses one or more neural networks to infer information from one or more sets of data based, at least in part, on one or more dynamically configurable dimensions of the one or more sets of data.
Abstract:
Apparatuses, systems, and techniques to combine operations. In at least one embodiment, a processor causes two or more operations in a graph to be combined based, at least in part, on another combination of two or more independent operations.
Abstract:
System and method of compiling a program having a mixture of host code and device code to enable code coverage data collection for device code execution. An exemplary integrated compiler can compile source code programmed to be executed by a host processor (e.g., CPU) and a co-processor (e.g., a GPU) concurrently. The compilation can generate an instrumented executable code which includes: coverage instrumentation counters for the device functions; mapping information that maps the counters with the instrumented source points; and instructions for the host processor to allocate and initialize device memory for the counters and to retrieve collected code coverage information from the device memory to the host memory. Execution of the instrumented executable can yield a coverage report on the device code functions.