摘要:
A bi-endian multiprocessor system having multiple processing elements, each of which includes a processor core, a local memory and a memory flow controller. The memory flow controller transfers data between the local memory and data sources external to the processing element. If the processing element and the data source implement data representations having the same endian-ness, each multi-word line of data is stored in the local memory in the same word order as in the data source. If the processing element and the data source implement data representations having different endian-ness, the words of each multi-word line of data are transposed when data is transferred between local memory and the data source. The processing element may incorporate circuitry to add doublewords, wherein the circuitry can alternately carry bits from a first word to a second word or vice versa, depending upon whether the words in lines of data are transposed.
摘要:
A bi-endian multiprocessor system having multiple processing elements, each of which includes a processor core, a local memory and a memory flow controller. The memory flow controller transfers data between the local memory and data sources external to the processing element. If the processing element and the data source implement data representations having the same endian-ness, each multi-word line of data is stored in the local memory in the same word order as in the data source. If the processing element and the data source implement data representations having different endian-ness, the words of each multi-word line of data are transposed when data is transferred between local memory and the data source. The processing element may incorporate circuitry to add doublewords, wherein the circuitry can alternately carry bits from a first word to a second word or vice versa, depending upon whether the words in lines of data are transposed.
摘要:
A novel method for optimizing the design of digital circuits containing clock gated memory elements. The method unclock gates memory elements by adding necessary feedback loops. Logic functions of memory element outputs in the circuit are viewed as a whole, rather than as separate functions for each input. Detection of duplicate unclock gated memory elements is then effected by identifying identical canonical representations of said unclock gated memory elements. Identified duplicate clock gated memory elements can then be eliminated from the original digital circuit. Further optimization can be accomplished by applying standard logic optimization algorithms to all unclock gated memory elements in said digital circuit. The resulting optimized circuit is clock gated and replaces the original clock gated circuit in said digital circuit.
摘要:
The present invention provides a fully automatic method for obtaining a circuit having minimized power consumption due to clock-gating. A circuit design to be optimized is modified to a reduced power modified design and associated with a clock gating scheme. Verification tools compare the modified design with the original design to a predetermined trigger-events to determine if the modified design can be used. Further modifications may be made iteratively until an optimal design is achieved.
摘要:
A method of reducing latency in instruction processing in a system, includes calculating a result of a first execution unit, storing the result of the first execution unit in a register file, forwarding the result of the first execution unit, through the bypass unit, to a second execution unit, the second execution unit conducting an instruction dependent on the result, forwarding the result of the first execution unit, from the bypass unit, to a third execution unit, without accessing the register file, the third execution unit conducting an instruction dependent on the result, wherein the execution units can extract the result of the first execution unit through the bypass unit until the new result is calculated, wherein after the new result is calculated, the execution units can access the result of the first execution unit through the register file.
摘要:
An electronic computing circuit for implementing a method for reducing the bit width of two operands from a bit length N to a reduced bit length M, thus, M
摘要:
Instead of having a processor with an instruction set architecture (ISA) that includes fixed architected operands, an improved processor supports additional characteristic bits for computing instructions (e.g., a multiply-add, load/store instructions). Such additional bits for the certain instructions influence the processing of these instructions by the processor. Also, a new instruction is introduced for further usage of the proposed method. Typically these additional characteristic bits as well as the instruction can be automatically generated by compilers to provide relatively well-suited instruction sequences for the processor.
摘要:
An electronic computing circuit for implementing a method for reducing the bit width of two operands from a bit length N to a reduced bit length M, thus, M
摘要:
A method for performing equivalence checking on logic circuit designs is disclosed. Within a composite netlist of an original version and a modified version of a logic circuit design, all level-sensitive sequential elements sensitized by a clock=0 are converted into buffers, and all level-sensitive sequential elements sensitized by a clock=1 are converted into level-sensitive registers. A subset of edge-sensitive sequential elements are selectively transformed into level-sensitive sequential elements by removing edge detection logic from the subset of the edge-sensitive sequential elements. A clock to the resulting sequential elements is then set to a logical “1” to verify the sequential equivalence of the transformed netlist.
摘要:
A crossbar (20) circuit with multiplexer (22A, 22B) circuits implemented in a polygonal form on a chip. The crossbar can be used for implementing a permutation of input bits (24A, 24B) controlled by a bit vector (25). Horizontal and vertical wiring lengths in the crossbar (20) are reduced by stacking the operand latches (24A, 24B, 25) and horizontal or vertical multiplexers (22A, 22B). This implementation decreases the latency of the crossbar and avoids latches to store intermediated results, thus reducing area and power consumption.