摘要:
A parallel processing system and method is disclosed, which provides an improved instruction distribution mechanism for a parallel processing array. The invention broadcasts a basic instruction to each of a plurality of processor elements. Each processor element decodes the same instruction by combining it with a unique offset value stored in each respective processor element, to produce a derived instruction that is unique to the processor element. A first type of basic instruction results in the processor element performing a logical or control operation. A second type of basic instruction results in the generation of a pointer address. The pointer address has a unique address value because it results from combining the basic instruction with the unique offset value stored at the processor element. The pointer address is used to access an alternative instruction from an alternative instruction storage, for execution in the processor element. The alternative instruction is a very long instruction word, whose length is, for example, an integral multiple of the length of the basic instruction and contains much more information than can be represented by the basic instruction. A very long instruction word such as this is useful for providing parallel control of a plurality of primitive execution units that reside within the processor element. In this manner, a high degree of flexibility and versatility is attained in the operation of processor elements of a parallel processing array.
摘要:
A massively parallel processor apparatus having an instruction set architecture for each of the N ² the PEs of the structure. The apparatus which we prefer will have a PE structure consisting of PEs that contain instruction and data storage units, receive instructions and data, and execute instructions. The N ² structure should contain "N" communicating ALU trees, "N" programmable root tree processor units, and an arrangement for communicating both instructions, data, and the root tree processor outputs back to the input processing elements by means of the communicating ALU trees. The apparatus can be structured as a bit-serial or word parallel system. The preferred structure contains N ² PEs, identified a s PE column,row, in a N root tree processor system, placed in the form of a N by N processor array that has been folded along the diagonal and made up of diagonal cells and general cells. The Diagonal-Cells are comprised of a single processing element identified as PE i,i of the folded N by N processor array and the General-Cells are comprised of two PEs merged together, identified as PE i,j and PE j,i of the folded N by N processor array. Matrix processing algorithms are discussed followed by a presentation of the Diagonal-Fold Tree Array Processor architecture. The Massively Parallel Diagonal-Fold Tree Array Processor supports completely connected root tree processors through the use of the array of PEs that are interconnected by folded communication ALU trees.
摘要:
Described is a scalable compound instruction set machine and method which provides for processing a set of instructions or program to be executed by a computer to determine statically which instructions may be combined into compound instructions which are executed in parallel by a scalar machine. Such processing looks for classes of instructions that can be executed in parallel without data-dependent or hardware-dependent interlocks. Without regard to their original sequence the individual instructions are combined with one or more other individual instructions to form a compound instruction which eliminates interlocks. Control information is appended to identify information relevant to the execution of the compound instructions. The result is a stream of scalar instructions compounded or grouped together before instruction decode time so that they are already flagged and identified for selective simultaneous parallel execution by execution units. The compounding does not change the object code results and existing programs realize performance improvements while maintaining compatibility with previously implemented systems for which the original set of instructions was provided.
摘要:
A digital computer system is described capable of processing two or more computer instructions in parallel and having a cache storage unit for temporarily storing machine-level computer instructions in their journey from a higher-level storage unit of the computer system to the functional units which process the instructions. The computer system includes an instruction compounding unit located intermediate to the higher-level storage unit and the cache storage unit for analyzing the instructions and adding to each instruction a tag field which indicates whether or not that instruction may be processed in parallel with one or more neighboring instructions in the instruction stream. These tagged instructions are then stored in the cache unit. The computer system further includes a plurality of functional instruction processing units which operate in parallel with one another. The instructions supplied to these functional units are obtained from the cache storage unit. At instruction issue time, the tag fields of the instructions are examined and those tagged for parallel processing are sent to different ones of the functional units in accordance with the codings of their operation code fields.
摘要:
Image processing for multimedia workstations is a computationally intensive task requiring special purpose hardware to meet the high speed requirements associated with the task. One type of specialized hardware that meets the computation high speed requirements is the mesh connected computer. Such a computer becomes a massively parallel machine when an array of computers interconnected by a network are replicated in a machine. The nearest neighbor mesh computer consists of an N x N square array of Processor Elements(PEs) where each PE is connected to the North, South, East and West PEs only. Assuming a single wire interface between PEs, there are a total of 2N² wires in the mesh structure. Under the assumtion of SIMD operation with uni-directional message and data transfers between the processing elements in the meah, for example all PES transferring data North, it is possible to reconfigure the array by placing the symmetric processing elements together and sharing the north-south wires with the east-west wires, thereby reducing the wiring complexity in half, i.e. N² without affecting performance. The resulting diagonal folded mesh array processor, which is called Oracle, allows the matrix transformation operation to be accomplished in one cycle by simple interchange of the data elements in the dual symmetric processor elements. The use of Oracle for a parallel 2-D convolution mechanish for image processing and multimedia applications and for a finite difference method of solving differential equations is presented, concentrating on the computational aspects of the algorithm.
摘要:
A mechanism is presented for detecting overflow in an interlock collapsing hardware apparatus that simultaneously executes two instructions. The overflow is determined as if the second instruction executes by itself using results from execution of the first instruction. Overflow detection is accomplished by using only values input into, and generated within, the interlock collapsing apparatus.
摘要:
A multi-function ALU for use in digital data processing is described, which facilitates the execution of instructions in parallel, thereby increasing processor performance. The proposed apparatus reduces the instruction execution latency that results from data dependency hazards in a pipelined machine. This latency reduction is accomplished by collapsing the interlocks due to these hazards. The proposed apparatus achieves performance improvement while maintaining compatibility with previous implementations designed using an identical architecture.
摘要:
A system for dividing a digital dividend operand N by a digital divisor operand D to obtain a quotient operand Q with minimal execution time and hardware calculates a value NP₀P₁...P m , where the value P₀P₁...P m has a magnitude such that NP₀P₁...P m converges to Q and DP₀P₁ converges to 1. The divider employs a one's complementation, multiplication and addition sequence to calculate the value NP₀P₁...P m .
摘要:
A mechanism is presented for detecting overflow in an interlock collapsing hardware apparatus that simultaneously executes two instructions. The overflow is determined as if the second instruction executes by itself using results from execution of the first instruction. Overflow detection is accomplished by using only values input into, and generated within, the interlock collapsing apparatus.
摘要:
A digital computer system is described capable of processing two or more computer instructions in parallel and having a cache storage unit for temporarily storing machine-level computer instructions in their journey from a higher-level storage unit of the computer system to the functional units which process the instructions. The computer system includes an instruction compounding unit located intermediate to the higher-level storage unit and the cache storage unit for analyzing the instructions and adding to each instruction a tag field which indicates whether or not that instruction may be processed in parallel with one or more neighboring instructions in the instruction stream. These tagged instructions are then stored in the cache unit. The computer system further includes a plurality of functional instruction processing units which operate in parallel with one another. The instructions supplied to these functional units are obtained from the cache storage unit. At instruction issue time, the tag fields of the instructions are examined and those tagged for parallel processing are sent to different ones of the functional units in accordance with the codings of their operation code fields.