摘要:
A technique that improves both processor performance and associated data bandwidth through user-defined interfaces that can be added to a configurable and extensible microprocessor core. These interfaces can be used to communicate status or control information and to achieve synchronization between the processor and any external device including other processors. These interfaces can also be used to achieve data transfer at the rate of one data element per interface in every clock cycle. This technique makes it possible to design multiprocessor SOC systems with high-speed data transfer between processors without using the memory subsystem. Such a system and design methodology offers a complete shift from the standard bus-based architecture and allows designers to treat processors more like true computational units, so that designers can more effectively utilize programmable solutions rather than design dedicated hardware. This can have dramatic effects not only in the performance and bandwidth achieved by designs, but also in the time to market and reuse of such designs.
摘要:
A technique that improves both processor performance and associated data bandwidth through user-defined interfaces that can be added to a configurable and extensible microprocessor core. These interfaces can be used to communicate status or control information and to achieve synchronization between the processor and any external device including other processors. These interfaces can also be used to achieve data transfer at the rate of one data element per interface in every clock cycle. This technique makes it possible to design multiprocessor SOC systems with high-speed data transfer between processors without using the memory subsystem. Such a system and design methodology offers a complete shift from the standard bus-based architecture and allows designers to treat processors more like true computational units, so that designers can more effectively utilize programmable solutions rather than design dedicated hardware. This can have dramatic effects not only in the performance and bandwidth achieved by designs, but also in the time to market and reuse of such designs.
摘要:
A processor can achieve high code density while allowing higher performance than existing architectures, particularly for Digital Signal Processing (DSP) applications. In accordance with one aspect, the processor supports three possible instruction sizes while maintaining the simplicity of programming and allowing efficient physical implementation. Most of the application code can be encoded using two sets of narrow size instructions to achieve high code density. Adding a third (and larger, i.e. VLIW) instruction size allows the architecture to encode multiple operations per instruction for the performance critical section of the code. Further, each operation of the VLIW format instruction can optionally be a SIMD operation that operates upon vector data. A scheme for the optimal utilization (highest achievable performance for the given amount of hardware) of multiply-accumulate (MAC) hardware is also provided.
摘要:
A processor can achieve high code density while allowing higher performance than existing architectures, particularly for Digital Signal Processing (DSP) applications. In accordance with one aspect, the processor supports three possible instruction sizes while maintaining the simplicity of programming and allowing efficient physical implementation. Most of the application code can be encoded using two sets of narrow size instructions to achieve high code density. Adding a third (and larger, i.e. VLIW) instruction size allows the architecture to encode multiple operations per instruction for the performance critical section of the code. Further, each operation of the VLIW format instruction can optionally be a SIMD operation that operates upon vector data. A scheme for the optimal utilization (highest achievable performance for the given amount of hardware) of multiply-accumulate (MAC) hardware is also provided.
摘要:
A method and system for managing memory in a multiprocessor system includes defining the plurality of processor coherence domains within a system coherence domain of the multiprocessor system. The processor coherence domains each include a plurality of processors and a processor memory. Shared access to data in the processor memory of each processor coherence domain is provided only to elements of the multiprocessor system within the processor coherence domain. Non-shared access to data in the processor memory of each processor coherence domain is provided to elements of the multiprocessor system within and outside of the processor coherence domain.
摘要:
A memory access arbitration scheme is provided where transactions to a Shared memory are stored in an arbitration queue. Prior to arbitration, the transactions are compared against the contents of cache memory, to determine which transactions will hit in cache, which will miss and which will be victims. Also prior to arbitration, the entries in the arbitration queue are grouped according to a transaction parameter, such as DRAM bank, Write to Bank, Read to Bank, etc. Arbitration is the performed among those groups which are ready for service. From the group winning arbitration, the oldest transaction is selected for servicing. Preferably, a collapsible queuing structure and method is used, such that once a transaction is serviced, higher order entries ripple down in the queue to make room for new entries while maintaining an oldest to newest relationship among the queue entries.
摘要:
A multiprocessor system and method includes a processing sub-system including a plurality of processors in a processor memory system. A network is operable to couple the processing sub-system to an input/output (I/O) sub-system. The I/O sub-system includes a plurality of I/O interfaces each operable to couple a peripheral device to the multiprocessor system. The I/O interfaces each include a local memory operable to store exclusive read-only copies of data from the processor memory system for use by a corresponding peripheral device.
摘要翻译:多处理器系统和方法包括在处理器存储器系统中包括多个处理器的处理子系统。 网络可操作以将处理子系统耦合到输入/输出(I / O)子系统。 I / O子系统包括多个I / O接口,每个I / O接口可操作以将外围设备耦合到多处理器系统。 I / O接口各自包括本地存储器,其可操作以存储来自处理器存储器系统的数据的专用只读副本以供对应的外围设备使用。
摘要:
The present invention provides alignment and ordering of vector elements for SIMD processing. In the alignment of vector elements for SIMD processing, one vector is loaded from a memory unit into a first register and another vector is loaded from the memory unit into a second register. The first vector contains a first byte of an aligned vector to be generated. Then, a starting byte specifying the first byte of an aligned vector is determined. Next, a vector is extracted from the first register and the second register beginning from the first bit in the first byte of the first register continuing through the bits in the second register. Finally, the extracted vector is replicated into a third register such that the third register contains a plurality of elements aligned for SIMD processing. In the ordering of vector elements for SIMD processing, a first vector is loaded from a memory unit into a first register and a second vector is loaded from the memory unit into a second register. Then, a subset of elements are selected from the first register and the second register. The elements from the subset are then replicated into the elements in the third register in a particular order suitable for subsequent SIMD vector processing.
摘要:
A method for preventing deadlock due to the need for data exclusivity when performing forced atomic instructions in a multi-level cache in a multi-processor system. The system determines whether an aligned multi-byte word in which the data of a forced atomic instruction, such as an integer store operation, is exclusive in a first level cache. If so, the forced atomic instruction is allowed to enter a second level cache pipeline. If not, the forced atomic instruction is prevented from entering the second level cache pipeline and a cache miss and fill operation is initiated to cause the aligned word to be exclusive in the first level cache.
摘要:
The computer system having a first circuit board with a processor for processing information and a slot for receiving an IC card. The slot includes multiple pins for connection to the IC card. The IC card includes a second processor coupled to a second circuit board, where the processor is contained within outer framing structure. An interface coupled to the circuit board may be coupled to the multiple pins in the slot, such that the second processor in the integrated circuit card is able to control the computer system.