摘要:
A neural network architecture consisting of input weight multiplications, product summation, neural state calculations, and complete connectivity among the neuron processing elements. Neural networks are modelled using a sequential pipelined neurocomputer producing high performance with minimum hardware by sequentially processing each neuron in the completely connected network model. An N neuron network is implemented using multipliers, a pipelined adder tree structure, and activation functions. The activation functions are provided by using one activation function module and sequentially passing the N input product summations sequentially through it. One bus provides N.times.N communications by sequentially providing N neuron values to the multiplier registers. The neuron values are ensured of reaching corresponding multipliers through a tag compare function. The neuron information includes a source tag and a valid signal. Higher performance is provided by connecting a number of the neurocomputers in a parallel.
摘要:
The architectures for a scalable neural processor (SNAP) and a Triangular Scalable Neural Array Processor (T-SNAP) are expanded to handle network simulations where the number of neurons to be modeled exceeds the number of physical neurons implemented. This virtual neural processing is described for three general virtual architectural approaches for handling the virtual neurons, one for SNAP and one for TSNAP, and a third approach applied to both SNAP and TSNAP.
摘要:
An Array Processor and Method for a Scalable Array Neural Processor (SNAP) permits computing as a dynamic and highly parallel computationally intensive system typically consisting of input weight multiplications, product summation, neural state calculations, and complete connectivity among the neurons. The Scalable Neural Array Processor (SNAP) uses a unique intercommunication scheme within an array structure that provides high performance for completely connected network models such as the Hopfield model. SNAP's packaging and expansion capabilities are addressed, demonstrating SNAP's scalability to larger networks. The array processor is scalable. It has an array of function elements and a plurality of orthogonal horizontal and vertical processing elements for communication, computation and reduction. This structure permits in a first computation state the generation of a set of output values and in the first communication state the processing elements produce, responsive to the output values, first reduction values. In a second computation state processing elements, responsive to the first reduction values, generate vertical output values, and in a second computation state the vertical output values are communicated back to the inputs of the function elements. Responsive to a third computation state responsive to the vertical output values, a second set of output values is generated by said function elements, and in a third communication state the horizontal processing elements produce second reduction values. In a fourth computation state the horizontal processing elements generate horizontal output values, and responsive to a fourth communication state the horizontal processing elements communicate the horizontal output values back to the inputs of the function elements.
摘要:
An input/output bus for a data processing system which has extended addressing capabilities and a variable length handshake which accommodates the difference delays associated with various sets of logic and a two part address field which allows a bus unit and channel to be identified. The various units can disconnect from the bus during internal processing to free the bus for other activity. The unit removes the busy signal prior to dropping the data lines to allow a bus arbitration sequence to occur without slowing down the bus.
摘要:
Hardware and software techniques for interrupt detection and response in a scalable pipelined array processor environment are described. Utilizing these techniques, a sequential program execution model with interrupts can be maintained in a highly parallel scalable pipelined array processing containing multiple processing elements and distributed memories and register files. When an interrupt occurs, interface signals are provided to all PEs to support independent interrupt operations in each PE dependent upon the local PE instruction sequence prior to the interrupt. Processing/element exception interrupts are supported and low latency interrupt processing is also provided for embedded systems where real time signal processing is to required. Further, a hierarchical interrupt structure is used allowing a generalized debug approach using debut interrupts and a dynamic debut monitor mechanism.
摘要:
An array processor includes processing elements arranged in clusters which are, in turn, combined in a rectangular array. Each cluster is formed of processing elements which preferably communicate with the processing elements of at least two other clusters. Additionally each inter-cluster communication path is mutually exclusive, that is, each path carries either north and west, south and east, north and east, or south and west communications. Due to the mutual exclusivity of the data paths, communications between the processing elements of each cluster may be combined in a single inter-cluster path. That is, communications from a cluster which communicates to the north and east with another cluster may be combined in one path, thus eliminating half the wiring required for the path. Additionally, the length of the longest communication path is not directly determined by the overall dimension of the array, as it is in conventional torus arrays. Rather, the longest communications path is limited only by the inter-cluster spacing. In one implementation, transpose elements of an N×N torus are combined in clusters and communicate with one another through intra-cluster communications paths. Since transpose elements have direct connections to one another, transpose operation latency is eliminated in this approach. Additionally, each PE may have a single transmit port and a single receive port. As a result, the individual PEs are decoupled from the topology of the array.
摘要:
The ManArray processor is a scalable indirect VLIW array processor that defines two preferred architectures for indirect VLIW memories. One approach treats the VIM as one composite block of memory using one common address interface to access any VLIW stored in the VIM. The second approach treats the VIM as made up of multiple smaller VIMs each individually associated with the functional units and each individually addressable for loading and reading during XV execution. The VIM memories, contained in each processing element (PE), are accessible by the same type of LV and XV Short Instruction Words (SIWs) as in a single processor instantiation of the indirect VLIW architecture. In the ManArray architecture, the control processor, also called a sequence processor (SP), fetches the instructions from the SIW memory and dispatches them to itself and the PEs. By using the LV instruction, VLIWs can be loaded into VIMs in the SP and the PEs. Since the LV instruction is supplied by the SP through the instruction stream, when VLIWs are being loaded into any VIM no other processing takes place. In addition, as defined in the ManArray architecture, when the SP is processing SIWs, such as control and other sequential code, the PE array is not executing any instructions. Techniques are provided herein to independently load the VIMs concurrent with SIW or iVLIW execution on the SP or on the PEs thereby allowing the load latency to be hidden by the computation.
摘要:
A pipelined data processing unit includes an instruction sequencer and n functional units capable of executing n operations in parallel. The instruction sequencer includes a random access memory for storing very-long-instruction-words (VLIWs) used in operations involving the execution of two or more functional units in parallel. Each VLIW comprises a plurality of short-instruction-words (SIWs) where each SIW corresponds to a unique type of instruction associated with a unique functional unit. VLIWs are composed in the VLIW memory by loading and concatenating SIWs in each address, or entry. VLIWs are executed via the execute-VLIW (XV) instruction. The iVLIWs can be compressed at a VLIW memory address by use of a mask field contained within the XV1 instruction which specifies which functional units are enabled, or disabled, during the execution of the VLIW. The mask can be changed each time the XV1 instruction is executed, effectively modifying the VLIW every time it is executed. The VLIW memory (VIM) can be further partitioned into separate memories each associated with a function decode-and-execute unit. With a second execute VLIW instruction XV2, each functional unit's VIM can be independently addressed thereby removing duplicate SIWs within the functional unit's VIM. This provides a further optimization of the VLIW storage thereby allowing the use of smaller VLIW memories in cost sensitive applications.
摘要:
An apparatus for concurrently executing controller single instruction single data (SISD) instructions and single instruction multiple data (SIMD) processing element instructions comprising a combined controller and processing element. At least first and second simplex instructions each comprise a mode of operation bit, said mode of operation bit in the first simplex instruction specifying a controller SISD operation for execution by the controller, and the mode of operation bit in the second simplex instruction specifying a procesing element SIMD operation for execution by the processsing element. A very long instruction word (VLIW) contains said at least first and second simplex instructions.
摘要:
A multi-processor array organization is dynamically configured by the inclusion of a configuration topology field in instructions broadcast to the processors in the array. Each of the processors in the array is capable of performing a customized data selection and storage, instruction execution, and result destination selection, by uniquely interpreting a broadcast instruction by using the identity of the processor executing the instruction. In this manner, processing elements in a large multi-processing array can be dynamically reconfigured and have their operations customized for each processor using broadcast instructions.