Abstract:
Various embodiments are generally directed to optimizing dataflow in automated transformation frameworks (e.g., compiler, runtime, etc.) for spatial architectures (e.g., Configurable Spatial Accelerator) that translate high-level user code into forms that use “streams” (e.g., Latency Insensitive Channels, line buffers) to reduce overhead, eliminate or improve the efficiency of redundant memory accesses, and improve overall throughput.
Abstract:
Methods, apparatus, systems and articles of manufacture are disclosed to eliminate partial-redundant vector loads. An example apparatus includes a node group to associate a vector operation with a node group based on a load type of the vector operation. The example apparatus also includes a candidate identifier to identify a candidate in the node group, the candidate to include a subset of vector operations of the node group. The example apparatus also includes a code optimizer to determine replacement code based on a characteristic of the candidate, and to compare an estimated cost associated with executing the replacement code to a threshold cost relative to a cost of executing the candidate. The example apparatus also includes a code generator to generate machine code using the replacement code when the estimated cost of executing the replacement code satisfies the threshold cost.
Abstract:
Methods, apparatus, systems and articles of manufacture are disclosed to eliminate partial-redundant vector load operations. An example apparatus includes a node grouper to associate a vector operation with a node group, a candidate verifier to perform a dependencies test on a subset of the node group, and identify a subset of the node group as a candidate when the subset satisfies the dependencies test, and a code optimizer to determine replacement code based on a characteristic of the candidate in the node group and compare an estimated cost associated with executing the replacement code to a threshold. The example apparatus also includes a code generator to generate machine code using the replacement code when the estimated cost of executing the replacement code satisfies the threshold.
Abstract:
Logic may transform a target code to partition data automatically and/or autonomously based on a memory constraint associated with a resource such as a target device. Logic may identify a tag in the code to identify a task, wherein the task comprises at least one loop, the loop to process data elements in one or more arrays. Logic may automatically generate instructions to determine one or more partitions for the at least one loop to partition data elements, accessed by one or more memory access instructions for the one or more arrays within the at least one loop, based on a memory constraint, the memory constraint to identify an amount of memory available for allocation to process the task. Logic may determine one or more iteration space blocks for the parallel loops, determine memory windows for each block, copy data into and out of constrained memory, and transform array accesses.
Abstract:
Methods, apparatus, systems and articles of manufacture are disclosed to eliminate partial-redundant vector load operations. An example apparatus includes a node grouper to associate a vector operation with a node group, a candidate verifier to perform a dependencies test on a subset of the node group, and identify a subset of the node group as a candidate when the subset satisfies the dependencies test, and a code optimizer to determine replacement code based on a characteristic of the candidate in the node group and compare an estimated cost associated with executing the replacement code to a threshold. The example apparatus also includes a code generator to generate machine code using the replacement code when the estimated cost of executing the replacement code satisfies the threshold.
Abstract:
Methods, apparatus, systems and articles of manufacture (e.g., computer readable storage media) to perform automatic compiler optimization to enable streaming-store generation for unaligned contiguous write access are disclosed. Example apparatus disclosed herein are to mark a store instruction in source program code as a transformation candidate when the store instruction is associated with a group of memory accesses that are unaligned with respect to a size of a cache line in a cache. Disclosed apparatus are also to transform the store instruction that is marked as the transformation candidate to form transformed program code when a non-temporal property is satisfied, the transformed program code to replace the store instruction with (i) a write to a buffer in the cache and (ii) a streaming-store instruction that is to write contents of the buffer to memory.
Abstract:
Methods, apparatus, systems and articles of manufacture are disclosed to eliminate partial-redundant vector loads. An example apparatus includes a node group to associate a vector operation with a node group based on a load type of the vector operation. The example apparatus also includes a candidate identifier to identify a candidate in the node group, the candidate to include a subset of vector operations of the node group. The example apparatus also includes a code optimizer to determine replacement code based on a characteristic of the candidate, and to compare an estimated cost associated with executing the replacement code to a threshold cost relative to a cost of executing the candidate. The example apparatus also includes a code generator to generate machine code using the replacement code when the estimated cost of executing the replacement code satisfies the threshold cost.
Abstract:
Methods, apparatus, systems and articles of manufacture (e.g., computer readable storage media) to perform automatic compiler optimization to enable streaming-store generation for unaligned contiguous write access are disclosed. Example apparatus disclosed herein are to mark a store instruction in source program code as a transformation candidate when the store instruction is associated with a group of memory accesses that are unaligned with respect to a size of a cache line in a cache. Disclosed apparatus are also to transform the store instruction that is marked as the transformation candidate to form transformed program code when a non-temporal property is satisfied, the transformed program code to replace the store instruction with (i) a write to a buffer in the cache and (ii) a streaming-store instruction that is to write contents of the buffer to memory.
Abstract:
Systems, apparatuses and methods may provide for technology that detects one or more local variables in source code, wherein the local variable(s) lack dependencies across iterations of a loop in the source code, automatically generate pipeline execution code for the local variable(s), and incorporate the pipeline execution code into an output of a compiler. In one example, the pipeline execution code includes an initialization of a pool of buffer storage for the local variable(s).
Abstract:
Systems, apparatuses and methods may provide for technology that detects one or more local variables in source code, wherein the local variable(s) lack dependencies across iterations of a loop in the source code, automatically generate pipeline execution code for the local variable(s), and incorporate the pipeline execution code into an output of a compiler. In one example, the pipeline execution code includes an initialization of a pool of buffer storage for the local variable(s).