摘要:
A programmable multi-mode accelerator is disclosed for use with a programmable processor or microprocessor. The programmable multi-mode accelerator allows a programmable processor to execute specific algorithms, such as certain types of finite impulse response (FIR), correlation and Viterbi computations, that require low-precision operations at an extremely high rate. The accelerator extends the digital signal processor's performance into the required range for low-precision computations. The accelerator can be coupled with the main data path of a programmable processor or microprocessor and can directly read and write to the main register files of the programmable processor. In an illustrative implementation, the accelerator data path accesses its input values (source operands) directly from a main register file of the programmable processor and writes results back into a second main register file. The accelerator allows a plurality of low-precision algorithms requiring primarily addition or multiply-add computations, such as finite impulse response, correlation and Viterbi computations, to utilize the same adder cells. The accelerator includes a multi-mode adder that can be programmatically reconfigured to perform various addition computations. In a first mode, referred to as the “single-add mode,” the adder operates as a 17-input 16-bit adder. The single-add mode can be utilized to perform finite impulse response and correlation computations. The second mode, referred to as the “ACS mode,” can be utilized to perform Viterbi computations. The accelerator has a small instruction set and instruction memory and, once started by the main data path, the accelerator executes its own instruction stream. In addition, the accelerator includes a delay line having delays of z−1 or z−2.
摘要:
A method and apparatus are disclosed for allocating a section of a cache memory to one or more tasks. A set index value that identifies a corresponding set in the cache memory is transformed to a mapped set index value that constrains a given task to the corresponding allocated section of the cache. The allocated cache section of the cache can be varied by selecting an appropriate map function. When the map function is embodied as a logical and function, for example, individual sets can be included in an allocated section, for example, by setting a corresponding bit value to binary value of one. A cache addressing scheme is also disclosed that permits a desired portion of a cache to be selectively allocated to one or more tasks. A desired location and size of the allocated section of sets of the cache memory may be specified.
摘要:
A method and apparatus are disclosed for adaptively decreasing cache trashing in a cache memory device. Cache performance is improved by automatically detecting thrashing of a set and then providing one or more augmentation frames as additional cache space. In one embodiment, the augmentation frames are obtained by mapping the blocks that map to a thrashed set to one or more additional, less utilized sets. The disclosed cache thrashing reduction system initially identifies a set that is likely to be experiencing thrashing, referred to herein as a thrashed set. Once thrashing is detected, the cache thrashing reduction system selects one or more additional sets to augment a thrashed set, referred to herein as the augmentation sets. In this manner, blocks of main memory that are mapped to a thrashed set are now mapped to an expanded group of sets (the thrashed set and the augmentation sets). Finally, when the augmentation sets are no longer likely to be needed to decrease thrashing, the augmentation set(s) are disassociated from the thrashed set(s).
摘要:
A scheme for variable-delay instructions in a digital processor that allows for variable delay of some instructions to increase performance at different frequencies. The variable-delay (VD) feature allows flag-modifying instructions to execute in a differing number (1 or 2) of clock cycles, depending on the application. In applications that clock the processor at less than maximum frequency, instructions that modify the flag are executed in one clock cycle. In applications that clock the processor at its maximum frequency, the instructions that modify the flag are executed in two clock cycles. If the critical path, and consequently the maximum frequency, of a processor is determined by a flag-modifying operation immediately followed by a flag-reading operation, then the VD scheme helps increase performance at either frequency. The performance increase is proportional to the difference in delays between the critical path associated with flag-modifying and other critical paths. At the lower frequency, a given application consumes slightly less energy and the cost of implementing the scheme is minimal.
摘要:
A method and apparatus are disclosed for locking the most recently accessed frames in a cache memory. The most recently accessed frames in a cache memory are likely to be accessed by a task again in the near future. The most recently used frames may be locked at the beginning of a task switch or interrupt to improve the performance of the cache. The list of most recently used frames is updated as a task executes and may be embodied, for example, as a list of frames addresses or a flag associated with each frame. The list of most recently used frames may be separately maintained for each task if multiple tasks may interrupt each other. An adaptive frame unlocking mechanism is also disclosed that automatically unlocks frames that may cause a significant performance degradation for a task. The adaptive frame unlocking mechanism monitors a number of times a task experiences a frame miss and unlocks a given frame if the number of frame misses exceeds a predefined threshold.
摘要:
The present invention is a variable-delay division (VDD) scheme implementable in hardware to execute signed and unsigned integer division and remainder operations in digital processor. The VDD scheme advantageously uses hardware utilized for multiplication to implement a 2-bits/cycle alignment step to iteratively align the divisor with the dividend. This speeds up the alignment phase of integer division. Quotient bits are produced at the rate of 1-bit/cycle using the well-known restoring scheme. For 32-bit 2's complement operands, the scheme has a delay less than a fixed-delay scheme for most operands.
摘要:
Most recently accessed frames are locked in a cache memory. The most recently accessed frames are likely to be accessed by a task again in the near future and may be locked at the beginning of a task switch or interrupt to improve cache performance. The list of most recently used frames is updated as a task executes and may be embodied as a list of frame addresses or a flag associated with each frame. The list of most recently used frames may be separately maintained for each task if multiple tasks may interrupt each other. An adaptive frame unlocking mechanism is also disclosed that automatically unlocks frames that may cause a significant performance degradation for a task. The adaptive frame unlocking mechanism monitors a number of times a task experiences a frame miss and unlocks a given frame if the number of frame misses exceeds a predefined threshold.
摘要:
A method and apparatus are described for distributing multi-source/multi-sink control signals among nodes on a chip. Each node on the chip assists in returning the control signal to an inactive state at the start of each cycle. Thus, since all nodes contribute to returning the control signal to the inactive state, the control signal returns to the inactive state more quickly, near the start of a given cycle, and the remainder of the cycle remains available for a given node to drive the control signal. Each node includes an exemplary pulsed reset block that discharges the control signal network closest to it for a short interval, and over time the rest of the network, returning the network to an inactive state. Once the control signal network has been returned to an inactive state, the control signal may then be driven by a node during the remainder of the cycle.
摘要:
A RAKE receiver for use in a CDMA system is implemented as a transverse correlator in the complex domain. The transverse topology results in the correlator comprising a plurality of serial stages, each stage formed as a canonical unit of a multiplier, adder and memory. When implemented in the complex domain, the multiplier is replaced by multiplexers and the hardware may be significantly reduced by multiplexing between the I and Q components.
摘要:
An integrated circuit having a digital processor, a decode stage for decoding an instruction from the instruction set, an execute stage coupled to the decode stage for executing the instruction, and event logic coupled to the decode stage operable to provide an event commands to the decode stage to override the instruction. In one embodiment, an integrated circuit having a pipelined processor handles multiple precise events through the decode stage and execute stage through a process which includes the steps of detecting a plurality of events and issuing an event command, selecting a highest priority event from said the of events, providing an event vector and a link address for the highest priority event, and allowing the event vector and the link to be modified for a higher priority event until the event command is issued to the execute stage.