摘要:
A method optimizes routing in a multiprocessor computer system by defining two types of virtual channels having virtual channel buffers for storing messages communicated between processing element nodes in the multiprocessor computer system. A dateline is associated to each type of virtual channel, and messages are restrained from crossing a dateline on its associated type of virtual channel to avoid deadlock. A cost function is defined which is correlated to imbalances in the utilization of the two types of virtual channels. The unrestrained messages are allocated between the two types of virtual channels to minimize the cost function by defining an initial virtual channel allocation, randomly modifying the virtual channel allocation, and accepting the random modification if the modification decreases the cost function, else accepting the modification based on a probability that slowly decreases during the allocating step.
摘要:
Method and apparatus for vector processing on a computer system. As the last element of a group of elements (called a "chunk") in a vector register is loaded from memory, the entire chunk is marked valid and thus made available for use by subsequent or pending operations. The vector processing apparatus comprises a plurality of vector registers, wherein each vector register holds a plurality of elements. For each of the vector registers, a validity indicator is provided wherein each validity indicator indicates a subset of the elements in the corresponding vector register which are valid. A chunk-validation controller is coupled to the validity indicators operable to adjust a value of the validity indicator in response to a plurality of elements becoming valid. An arithmetic logical functional unit (ALFU) is coupled to the vector registers to execute functions specified by program instructions. A vector register controller is connected to control the vector registers in response to program instructions in order to cause valid elements of a selected vector register to be successively transmitted to said ALFU, so that elements are streamed through said ALFU at a speed that is determined by the availability of valid elements from the vector registers. The ALFU optionally comprises a processor pipeline to hold operand data for operations not yet completed while receiving operands for successive operations. The ALFU also optionally comprises an address pipeline to hold element addresses corresponding to the operands for operations not yet completed while receiving element addresses corresponding to the operands for successive operations.
摘要:
A communications protocol having a plurality of signals, wherein the plurality of signals includes data packets, control packets, checksum packets and sync symbols. One of the control packets is transmitted after a sync symbol is transmitted. One of the sync symbols, data packets or checksum packets is transmitted after the control packet is transmitted. One of the sync symbols is transmitted after one of said checksum packets is transmitted and one of the sync symbols or another of the data packets is transmitted after one of the data packets is transmitted.
摘要:
A messaging facility is described that enables the passing of packets of data from one processing element to another in a globally addressable, distributed memory multiprocessor without having an explicit destination address in the target processing element's memory. The messaging facility can be used to accomplish a remote action by defining an opcode convention that permits one processor to send a message containing opcode, address and arguments to another. The destination processor, upon receiving the message after the arrival interrupt, can decode the opcode and perform the indicated action using the argument address and data. The messaging facility provides the primitives for the construction of an interprocessor communication protocol. Operating system communication and message-passing programming models can be accomplished using the messaging facility.
摘要:
A method of adjusting clock skew for a computer system, wherein the computer system includes a clock generator for generating a clock signal, at least one logic module and a clock distribution network for carrying the clock signal from the clock generator to the logic modules, includes deskewing each of the logic modules and also deskewing the distribution network between the clock generator and the logic modules. Deskewing is performed by measuring a delay for the clock signal between a clock input and a test point on the logic module, comparing the measured delay to a desired delay, calculating an amount of adjustment needed to cause the measure delay to equal a desired delay and programming a skew compensator on the logic module with a calculator to mount adjustment.
摘要:
A barrier mechanism provides a low-latency method of synchronizing all or some of the processing elements (PEs) in a massively parallel processing system. The barrier mechanism is supported by several physical barrier synchronization circuits, each receiving an input from every PE in the processing system. Each PE has two associated barrier synchronization registers, in which each bit is used as an input to one of several logical barrier synchronization circuits. The hardware supports both a conventional barrier function and an alternative eureka function. Each bit in each of the barrier synchronization registers can be programmed to perform as either barrier or eureka function, and all bits of the registers and each barrier synchronization circuit functions independently. Partitioning among PEs is accomplished by a barrier mask and interrupt register which enables certain of the bits in the barrier synchronization registers to a defined group of PEs. Further partitioning is accomplished by providing bypass points in the physical barrier synchronization circuits to subdivide the physical barrier synchronization circuits into several types of PE barrier partitions of varying size and shape. The barrier mask and interrupt register and the bypass points are used in concert to accomplish flexible and scalable partitions corresponding to user-desired sizes and shapes with a latency several orders of magnitude faster than existing software implementations.
摘要:
The present invention is an improved high performance scalar/vector processor. In the preferred embodiment, the scalar/vector processor is used in a multiprocessor system. The scalar/vector processor is comprised of a scalar processor for operating on scalar and logical instructions, including a plurality of independent functional units operably connected to the scalar processor, a vector processor for operating on vector instructions, including a plurality of independent functional units operably connected to the vector processor, and an instruction control mechanism for fetching both the scalar and vector instructions from an instruction cache and controlling the operation of those instructions in both the scalar and vector processor. The instruction control mechanism is designed to enhance the performance of the scalar/vector processor by keeping a multiplicity of pipelines substantially filled with a minimum number of gaps.
摘要:
A unified parallel processing architecture connects together an extendible number of clusters of multiple numbers of processors to create a high performance parallel processing computer system. Multiple processors are grouped together into four or more physically separable clusters, each cluster having a common cluster shared memory that is symmetrically accessible by all of the processors in that cluster; however, only some of the clusters are adjacently interconnected. Clusters are adjacently interconnected to form a floating shared memory if certain memory access conditions relating to relative memory latency and relative data locality can create an effective shared memory parallel programming environment. A shared memory model can be used with programs that can be executed in the cluster shared memory of a single cluster, or in the floating shared memory that is defined across an extended shared memory space comprised of the cluster shared memories of any set of adjacently interconnected clusters. A distributed memory model can be used with any programs that are to be executed in the cluster shared memories of any non-adjacently interconnected clusters. The adjacent interconnection of multiple clusters of processors to a create a floating shared memory effectively combines all three type of memory models, pure shared memory, extended shared memory and distributed shared memory, into a unified parallel processing architecture.
摘要:
A digital optical serial communication system and encoding method comprises a transmitter responsive to an input of parallel information for parsing the information into 4-bit groups. The 4-bit groups are encoded into 5-bit codes having a 40/60 duty cycle and wherein no more than two consecutive bits are logical 1's or 0's on either end of the 5-bit code. The 5-bit codes are serially transmitted by an optical transmission medium for providing a conduit from the transmitter to a receiver. The receiver receives and decodes the serial information to 4-bit groups. The 4-bit groups are concatenated to form a parallel packet of information suitable for data processing. The encoding/decoding scheme has the advantages of (1) a worst case duty factor of 40/60%; (2) a maximum run of bits without transition equal to five; (3) an easily recaptured framing of packets due to a unique sync symbol; and (4) simple encoding and decoding of packets using combinational logic rather than lookup tables. In addition, data can be continuously sent via a communications protocol.
摘要:
A completely shielded metallized connector block for use in multiple circuit modules of an electronic device. Electrical communication between the circuit boards is effected by an array of metallic pins which run through the blocks. The metallization on the nonconductive blocks can be held at ground or at a constant potential to increase the shielding between pins as well as maintaining voltage and ground planes at constant levels throughout the modules. The metallization is insulated from the pins and circuit boards by nonconductive bushings inserted in holes in the blocks. In one embodiment, the metallization consists of copper and solder plating and the blocks are constructed of liquid crystal polymer.