Abstract:
Methods and apparatus for parallel processing are provided. A multicore processor is described. The multicore processor may include a distributed memory unit with memory nodes coupled to the processor's cores. The cores may be configured to execute parallel threads, and at least one of the threads may be data-dependent on at least one of the other threads. The distributed memory unit may be configured to proactively send shared memory data from a thread that produces the shared memory data to one or more of the threads.
Abstract:
Loop vectorization methods and apparatus are disclosed. An example method includes generating a first control mask for a set of iterations of a loop by evaluating a condition of the loop, wherein generating the first control mask includes setting a bit of the control mask to a first value when the condition indicates that an operation of the loop is to be executed, and setting the bit of the first control mask to a second value when the condition indicates that the operation of the loop is to be bypassed. The example method also includes compressing indexes corresponding to the first set of iterations of the loop according to the first control mask.
Abstract:
L'invention concerne un procédé d'optimisation de traitement parallèle de données sur une plateforme matérielle comprenant au moins une unité de calcul comprenant une pluralité d'unités de traitement aptes à exécuter en parallèle une pluralité de tâches exécutables, dans lequel l'ensemble de données à traiter est décomposé en sous-ensembles de données, une même suite d'opérations étant effectuée sur chaque sous-ensemble de données. Le procédé de l'invention comprend l'obtention (50, 52) du nombre maximal de sous-ensembles de données à traiter par une même suite d'opérations, et d'un nombre maximal de tâches exécutables en parallèle par une unité de calcul de la plateforme matérielle, la détermination (54) d'au moins deux découpages de traitement, chaque découpage de traitement correspondant au découpage de l'ensemble de données en un nombre de groupes de données, et à l'assignation d'au moins une tâche exécutable, apte à exécuter ladite suite d'opérations, à chaque sous-ensemble de données dudit groupe de données, et la sélection (60, 62) du découpage de traitement permettant d'obtenir une valeur de mesure optimale selon un critère prédéterminé. Des instructions de code de programmation mettant en œuvre ledit découpage de traitement sélectionné sont alors obtenues. Une utilisation du procédé de l'invention est la sélection d'une plateforme matérielle optimale selon une mesure de performance d'exécution.
Abstract:
The invention comprises (i) a compilation method for automatically converting a single-threaded software program into an application-specific supercomputer, and (ii) the supercomputer system structure generated as a result of applying this method. The compilation method comprises: (a) Converting an arbitrary code fragment from the application into customized hardware whose execution is functionally equivalent to the software execution of the code fragment; and (b) Generating interfaces on the hardware and software parts of the application, which (i) Perform a software-to-hardware program state transfer at the entries of the code fragment; (ii) Perform a hardware-to-software program state transfer at the exits of the code fragment; and (iii) Maintain memory coherence between the software and hardware memories. If the resulting hardware design is large, it is divided into partitions such that each partition can fit into a single chip. Then, a single union chip is created which can realize any of the partitions.
Abstract:
The present invention, in one embodiment, is a method of prediction techniques to the execution time of programs or jobs. The invention describes the method or process to handle the complexity of nested iterations and conditional statements.
Abstract:
Various technologies and techniques are disclosed for transforming a sequential loop into a parallel loop for use with a transactional memory system. A transactional memory system is provided. A first section of code containing an original sequential loop is transformed into a second section of code containing a parallel loop that uses transactions to preserve an original input to output mapping. For example, the original sequential loop can be transformed into a parallel loop by taking each iteration of the original sequential loop and generating a separate transaction that follows a pre-determined commit order process. At least some of the separate transactions are executed in different threads. When an unhandled exception is detected that occurs in a particular transaction while the parallel loop is executing, state modifications made by the particular transaction and predecessor transactions are committed, and state modifications made by successor transactions are discarded.
Abstract:
A software engine for decomposing work to be done into tasks, and distributing the tasks to multiple, independent CPUs for execution is described. The engine utilizes dynamic code generation, with run-time specialization of variables, to achieve high performance. Problems are decomposed according to methods that enhance parallel CPU operation, and provide better opportunities for specialization and optimization of dynamically generated code. A specific application of this engine, a software three dimensional (3D) graphical image renderer, is described.
Abstract:
The sorting relation between the dimension of array and loops is determined first, a loop most appropriate as a distribution candidate is selected, and the distribution of array is determined in accordance with the selected loop. Consequently, the time taken to determine the sorting relation is shortened. The possibility that optimum sorting relation is finally employed is increased by leaving a plurality of sorting relation candidates for determining the sorting relation between the dimension of array and loops.