Abstract:
Systems and methods are disclosed for reusing fetched and decoded instructions in block-based processor architectures. In one example of the disclosed technology, a system includes a plurality of block-based processor cores and an instruction scheduler. A respective core is capable of executing one or more instruction blocks of a program. The instruction scheduler can be configured to identify a given instruction block of the program that is resident on a first processor core of the processor cores and is to be executed again. The instruction scheduler can be configured to adjust a mapping of instruction blocks in flight so that the given instruction block is re-executed on the first processor core without re-fetching the given instruction block.
Abstract:
Technologies for optimizing complex exponential calculations include a computing device with optimizing compiler. The compiler parses source code, optimizes the parsed representation of the source code, and generates output code. During optimization, the compiler identifies a loop in the source code including a call to the exponential function having an argument that is a loop-invariant complex number multiplied by the loop index variable. The compiler tiles the loop to generate a pair of nested loops. The compiler generates code to pre-compute the exponential function and store the resulting values in a pair of coefficient arrays. The size of each coefficient array may be equal to the square root of the number of loop iterations. The compiler applies rewrite rules to replace the exponential function call with a multiplicative expression of one element from each of the coefficient arrays. Other embodiments are described and claimed.
Abstract:
A system and method for efficiently processing instructions in hardware parallel execution lanes within a processor. In response to a given divergent point within an identified loop, a compiler arranges instructions within the identified loop into very large instruction words (VLIW's). At least one VLIW includes instructions intermingled from different basic blocks between the given divergence point and a corresponding convergence point. The compiler generates code wherein when executed assigns at runtime instructions within a given VLIW to multiple parallel execution lanes within a target processor. The target processor includes a single instruction multiple data (SIMD) micro-architecture. The assignment for a given lane is based on branch direction found at runtime for the given lane at the given divergent point. The target processor includes a vector register for storing indications indicating which given instruction within a fetched VLIW for an associated lane to execute.
Abstract:
According to one embodiment, a code optimizer is configured to receive first code having a program loop implemented with scalar instructions to store values of a first array to a second array based on values of a third array and to generate second code representing the program loop using at least one vector instruction. The second code include a shuffle instruction to shuffle elements of the first array based on the third array using a shuffle table in a vector manner, a blend instruction to blend the shuffled elements of the first array using a blend table in a vector manner, and a store instruction to store the blended elements of the first array in the second array.
Abstract:
A method of statically testing dependence in a dataflow program is provided, the method comprising receiving a dataflow program which provides parameters, including consumption rates, production rates on connections between actors in the program and a number of initial samples (delays) on the connections, generating from the parameters a model of a precedence graph for the dataflow program representing dependence constraints between distinct firings of the number of actors. For the model, determining a feedback distance between multiple firings of a same actor, determining sets of parallel regions comprising a given number of actor firings of a same actor, composing mutually independent component regions comprising at least a part of the sets of parallel regions, and composing one or more composite regions comprising one or more component regions and/or one or more sets of parallel regions, being composed so that a pre-determined criteria is satisfied.
Abstract:
Examples are described for a device to receive intermediate code that was generated from compiling source code of an application. The intermediate code includes information generated from the compiling that identifies a hierarchical structure of lower level sub-routines in higher level sub-routines, and the lower level sub-routines are defined in the source code of the application to execute more frequently than the higher level sub-routines that identify the lower level sub-routines. The device is configured to compile the intermediate code to generate object code based on the information that identifies lower level sub-routines in higher level sub-routines, and store the object code.
Abstract:
Aspects include computing devices, systems, and methods for task-based handling of nested repetitive processes in parallel. At least one processor of the computing device may be configured to partition iterations of an outer repetitive process and assign the partitions to initialized tasks to be executed in parallel by a plurality of processor cores. A shadow task may be initialized for each task to execute iterations of an inner repetitive process. Upon completing a task, divisible partitions of the outer repetitive process of ongoing tasks may be subpartitioned and assigned to the ongoing task, and the completed task and shadow task or a newly initialized task and shadow task. Upon completing all but one task and one iteration of the outer repetitive process, shadow tasks may be initialized to execute partitions of iterations of the inner repetitive process.
Abstract:
Methods and systems to convert scalar computer program loops having loop carried dependences to vector computer program loops are disclosed. One example method and system generates a first predicate set associated with a first conditionally executed statement. The first predicate set contains a first set of predicates that cause a variable to be defined in a scalar computer program loop at or before the variable is defined by the first conditionally executed statement. The method and system also generates a second predicate set associated with the first conditionally executed statement. The second predicate set contains a second set of predicates that cause the variable to be used in the scalar computer program loop at or before the variable is defined by the first conditionally executed statement. The method and system determines whether the second predicate set is a subset of the first predicate set and, based on the determination, propagates a vector value in an element of a vector of the variable to a subsequent element of the vector.