摘要:
An apparatus and method for partitioning programs between a general purpose core and one or more accelerators are provided. With the apparatus and method, a compiler front end is provided for converting a program source code in a corresponding high level programming language into an intermediate code representation. This intermediate code representation is provided to an interprocedural optimizer which determines which core processor or accelerator each portion of the program should execute on and partitions the program into sub-programs based on this set of decisions. The interprocedural optimizer may further add instructions to the partitions to coordinate and synchronize the sub-programs as required. Each sub-program is compiled on an appropriate compiler backend for the instruction set architecture of the particular core processor or accelerator selected to execute the sub-program. The compiled sub-programs and then linked to thereby generate an executable program.
摘要:
The present invention provides for a method for computer program code optimization for a software managed cache in either a uni-processor or a multi-processor system. A single source file comprising a plurality of array references is received. The plurality of array references is analyzed to identify predictable accesses. The plurality of array references is analyzed to identify secondary predictable accesses. One or more of the plurality of array references is aggregated based on identified predictable accesses and identified secondary predictable accesses to generate aggregated references. The single source file is restructured based on the aggregated references to generate restructured code. Prefetch code is inserted in the restructured code based on the aggregated references. Software cache update code is inserted in the restructured code based on the aggregated references. Explicit cache lookup code is inserted for the remaining unpredictable accesses. Calls to a miss handler for misses in the explicit cache lookup code are inserted. A miss handler is included in the generated code for the program. In the miss handler, a line to evict is chosen based on recent usage and predictability. In the miss handler, appropriate DMA commands are issued for the evicted line and the missing line.
摘要:
In an embodiment, asynchronous conflict events are received during a previous rollback period. Each of the asynchronous conflict events represent conflicts encountered by speculative execution of a first plurality of work units and may be received out-of-order. During a current rollback period, a first work unit is determined whose speculative execution raised one of the asynchronous conflict events, and the first work unit is older than all other of the first plurality of work units. A second plurality of work units are determined, whose ages are equal to or older than the first work unit, wherein each of the second plurality of work units are assigned to respective executing threads. Rollbacks of the second plurality of work units are performed. After the rollbacks of the second plurality of work units are performed, speculative executions of the second plurality of work units are initiated in age order, from oldest to youngest.
摘要:
A method (and system) for emulating a target system's memory addressing using a virtual-to-real memory mapping mechanism of a host multiprocessor system's operating system, includes inputting a target virtual memory address into a simulated page table to obtain a host virtual memory address. The target system is oblivious to the software it is running on.
摘要:
In a host system, a method for using instruction scheduling to efficiently emulate the operation of a target computing syste includes preparing, on the host system, an instruction sequence to interpret an instruction written for execution on the target computing system. An instruction scheduling on the instruction sequence is performed, to achieve an efficient instruction level parallelism, for the host system. A separate and independent instruction sequence is inserted, which, when executed simultaneously with the instruction sequence, performs to copy to a separate location a minimum instruction sequence necessary to execute an intent of an interpreted target instruction, the interpreted target instruction being a translation; and modifies the interpreter code such that a next interpretation of the target instruction results in execution of the translated version, thereby removing execution of interpreter overhead.
摘要:
A method (and system) for executing a multiprocessor program written for a target instruction set architecture on a host computing system having a plurality of processors designed to process instructions of a second instruction set architecture, includes representing each portion of the program designed to run on a processor of the target computing system as one or more program threads to be executed on the host computing system.
摘要:
The present invention provides inserting and deleting a breakpoint in a parallel processing system. A breakpoint is inserted in a module loaded into the execution environment of an attached processor unit. The breakpoint can be inserted directly. Furthermore, the unloaded image of the module can also have a breakpoint associated with it. The breakpoint can be inserted directly into the module image, or a breakpoint request can be generated, and the breakpoint is inserted when the module is loaded into the execution environment of the attached processor unit.
摘要:
The present invention provides for creating and employing code and data partitions in a heterogeneous environment. This is achieved by separating source code and data into at least two partitioned sections and at least one unpartitioned section. Generally, a partitioned section is targeted for execution on an independent memory device, such as an attached processor unit. Then, at least two overlay sections are generated from at least one partition section. The plurality of partition sections are pre-bound to each other. A root module is also created, associated with both the pre-bound plurality of partitions and the overlay sections. The root module is employable to exchange the at least two overlay sections between the first and second execution environments. The pre-bound plurality of partition sections are then bound to the at least one unpartitioned section. The binding produces an integrated executable.
摘要:
A system and method for managing position independent code using a software framework is presented. A software framework provides the ability to cache multiple plug-in's which are loaded in a processor's local storage. A processor receives a command or data stream from another processor, which includes information corresponding to a particular plug-in. The processor uses the plug-in identifier to load the plug-in from shared memory into local memory before it is required in order to minimize latency. When the data stream requests the processor to use the plug-in, the processor retrieves a location offset corresponding to the plug-in and applies the plug-in to the data stream. A plug-in manager manages an entry point table that identifies memory locations corresponding to each plug-in and, therefore, plug-ins may be placed anywhere in a processor's local memory.
摘要:
A central processing unit (CPU) in a computer that permits speculative parallel execution of more than one instruction thread. The CPU uses Fork-Suspend instructions that are added to the instruction set of the CPU, and are inserted in a program prior to run-time to delineate potential future threads for parallel execution. The CPU has an instruction cache with one or more instruction cache ports, a bank of one or more program counters, a bank of one or more dispatchers, a thread management unit that handles inter-thread communications and discards future threads that violate dependencies, a set of architectural registers common to all threads, and a scheduler that schedules parallel execution of the instructions on one or more functional units in the CPU.