Abstract:
In a method of operating a computer system, an instruction loop is executed by a processor in which each iteration of the instruction loop accesses a current data vector and an associated current vector predicate. The instruction loop is repeated when the current vector predicate indicates the current data vector contains at least one valid data element and the instruction loop is exited when the current vector predicate indicates the current data vector contains no valid data elements.
Abstract:
Disclosed embodiments relate to methods of using a processor to load and duplicate scalar data from a source into a destination register. The data may be duplicated in byte, half word, word or double word parts, according to a duplication pattern.
Abstract:
In one embodiment, a system includes a memory and a processor core. The processor core includes functional units and an instruction decode unit configured to determine whether an execute packet of instructions received by the processing core includes a first instruction that is designated for execution by a first functional unit of the functional units and a second instruction that is a condition code extension instruction that includes a plurality of sets of condition code bits, wherein each set of condition code bits corresponds to a different one of the functional units, and wherein the sets of condition code bits include a first set of condition code bits that corresponds to the first functional unit. When the execute packet includes the first and second instructions, the first functional unit is configured to execute the first instruction conditionally based upon the first set of condition code bits in the second instruction.
Abstract:
In one disclosed embodiment, a processor includes a first execution unit and a second execution unit, a register file, and a data path including a plurality of lanes. The data path and the register file are arranged so that writing to the register file by the first execution unit and by the second execution unit is allowed over the data path, reading from the register file by the first execution unit is allowed over the data path, and reading from the register file by the second execution unit is not allowed over the data path. The processor also includes a power control circuit configured to, when a transfer of data between the register file and either of the first and second execution units uses less than all of the lanes, power down the lanes of the data path not used for the transfer of the data.
Abstract:
The number of registers required is reduced by overlapping scalar and vector registers. This allows increased compiler flexibility when mixing scalar and vector instructions. Local register read ports are reduced by restricting read access. Dedicated predicate registers reduce requirements for general registers, and allows reduction of critical timing paths by allowing the predicate registers to be placed next to the predicate unit.
Abstract:
One embodiment of this invention provides two conditional execution auxiliary instructions directed to disparate subsets of the plural functional units. Depending on the conditional execution desired, only one of the two conditional execution auxiliary instructions may be required for a particular execute packet. Another embodiment of this invention employs only one of two possible register files for the condition registers. In a VLIW processor it may be advantageous to split the functional units into separate sets with corresponding register files. This limits the number of functional units that may simultaneously access the register files. In the preferred embodiment of this invention the functional units are divided into a scalar set which access scalar registers and a vector set which access vector registers. The data registers storing the conditions for both scalar and vector instructions are in the scalar data register file.
Abstract:
This invention addresses implements a range of interesting technologies into a single block. Each DSP CPU has a streaming engine. The streaming engines include: a SE to L2 interface that can request 512 bits/cycle from L2; a loose binding between SE and L2 interface, to allow a single stream to peak at 1024 bits/cycle; one-way coherence where the SE sees all earlier writes cached in system, but not writes that occur after stream opens; full protection against single-bit data errors within its internal storage via single-bit parity with semi-automatic restart on parity error.
Abstract:
This invention is a digital signal processor capable of performing correlation of data with pseudo noise for code division multiple access (CDMA) decoding using clusters. Each cluster includes plural multipliers. The multipliers multiply real and imaginary parts of packed data by corresponding pseudo noise data. Within a cluster the real parts and the imaginary parts of the products are summed separately. This forms plural complex number outputs equal in number to the number of clusters. The pseudo noise data is offset relative to the data input differing amounts for different clusters. The clusters are divided into first half clusters receiving data from even numbered slots and second half clusters receiving data from odd numbered slots. The correlation unit includes a mask input to selectively zero a multiplier product.
Abstract:
This invention is a digital signal processor capable of performing correlation of data with pseudo noise for code division multiple access (CDMA) decoding using clusters. Each cluster includes plural multipliers. The multipliers multiply real and imaginary parts of packed data by corresponding pseudo noise data. Within a cluster the real parts and the imaginary parts of the products are summed separately. This forms plural complex number outputs equal in number to the number of clusters. The pseudo noise data is offset relative to the data input differing amounts for different clusters. The clusters are divided into first half clusters receiving data from even numbered slots and second half clusters receiving data from odd numbered slots. The correlation unit includes a mask input to selectively zero a multiplier product.
Abstract:
This invention addresses implements a range of interesting technologies into a single block. Each DSP CPU has a streaming engine. The streaming engines include: a SE to L2 interface that can request 512 bits/cycle from L2; a loose binding between SE and L2 interface, to allow a single stream to peak at 1024 bits/cycle; one-way coherence where the SE sees all earlier writes cached in system, but not writes that occur after stream opens; full protection against single-bit data errors within its internal storage via single-bit parity with semi-automatic restart on parity error.