Abstract:
Techniques are disclosed for the use of fused vector processor instructions by a vector processor architecture. Each fused vector processor instruction may include a set of fields associated with individual vector processing instructions. The vector processor architecture may implement local buffers facilitating a single vector processor instruction to be used to execute each of the individual vector processing instructions without re-accessing vector registers between each executed individual vector processing instruction. The vector processor architecture enables less communication across the interconnection network, thereby increasing interconnection network bandwidth and the speed of computations, and decreasing power usage.
Abstract:
Techniques are disclosed for the use of local buffers integrated into the execution units of an array processor architecture. The use of local buffers results in less communication across the interconnection network implemented by processors, and increases interconnection network bandwidth, increases the speed of computations, and decreases power usage.
Abstract:
Techniques are disclosed for the implementation of a programmable processing array architecture that realizes vectorized processing operations for a variety of applications. Such vectorized processing operations may include digital front end (DFE) processing operations, which include finite impulse response (FIR) filter processing operations. The programmable processing array architecture provides a front-end interconnection network that generates specific data sliding time window patterns in accordance with the particular DFE processing operation to be executed. The architecture enables the processed data generated in accordance with these sliding time window patterns to be fed to a set of multipliers and adders to generate output data. The architecture supports a wide range of processing operations to be performed via a single programmable processing array platform by leveraging the programmable nature of the array and the use of instruction sets.
Abstract:
Techniques are disclosed for a vector processor architecture that enables data interpolation in accordance with multiple dimensions, such as one-, two-, and three-dimensional linear interpolation. The vector processor architecture includes a vector processor and accompanying vector addressable memory that enable a simultaneous retrieval of multiple entries in the vector addressable memory to facilitate linear interpolation calculations. The vector processor architecture vastly increases the speed in which such calculations may occur compared to conventional processing architectures. Example implementations include the calculation of digital pre-distortion (DPD) coefficients for use with radio frequency (RF) transmitter chains to support multi-band applications.
Abstract:
Systems, methods, and apparatuses relating to a user defined formatting instruction to configure multicast Benes network circuitry are described. In one embodiment, a processor includes a decoder to decode a single instruction into a decoded single instruction, the single instruction having fields that identify packed input data, packed control data, and a packed data destination; and an execution unit to execute the decoded single instruction to: send the packed control data to respective control inputs of a circuit that comprises an inverse butterfly circuit coupled in series to a butterfly circuit, wherein the inverse butterfly circuit comprises a first plurality of stages of multicast switches and the butterfly circuit comprises a second plurality of stages of multicast switches, read, once from storage separate from the circuit, each element of the packed input data as respective inputs of the circuit, route the packed input data through the circuit according to the packed control data, and store resultant packed data from the circuit into the packed data destination.
Abstract:
A digital processor is provided having an instruction set with a complex exponential function. The digital processor evaluates a complex exponential function for an input value, x, by obtaining a complex exponential software instruction having the input value, x, as an input; and in response to the complex exponential software instruction: invoking at least one complex exponential functional unit that implements complex exponential software instructions to apply the complex exponential function to the input value, x; and generating an output corresponding to the complex exponential of the input value, x. A complex exponential function for an input value, x, can be evaluated by wrapping the input value to maintain a given range; computing a coarse approximation angle using a look-up table; scaling the coarse approximation angle to obtain an angle from 0 to θ; and computing a fine corrective value using a polynomial approximation.
Abstract:
Techniques are disclosed for a programmable processor architecture that enables data interpolation using an architecture that iteratively processes portions of a look-up table (LUT) in accordance with a fused single instruction stream, multiple data streams (SIMD) instruction. The LUT may contain segment entries that correspond to a result of evaluating a function using a corresponding index values, which represent an independent variable of the function. The index values are used to map data sample values in a data array that is to be interpolated to the segment entries. By using an iterative process of mapping data samples to valid segment entries contained in each LUT portion, the architecture advantageously facilitates scaling to support larger LUTs and thus may be expanded to enable linear interpolation on multiple dimensions.
Abstract:
Systems, methods, and apparatuses relating to vector processor architecture having an array of identical circuit blocks are described. In one embodiment, a processor includes a single centralized circuit comprising an instruction decoder and a controller; and a plurality of circuit slices that each comprise an arithmetic logic unit, a multiplier, a register file, a local memory, and a same plurality of logic circuits and a packed data datapath in between, wherein each circuit slice includes a physical port that provides a unique identification value that identifies a circuit slice from the other circuit slices, and the controller is to broadcast a same configuration value to the plurality of circuit slices to cause a first circuit slice to enable a first logic circuit and enable a second logic circuit of the first circuit slice based on its unique identification value and the configuration value, and cause a second circuit slice to enable a same, first logic circuit and disable a same, second logic circuit of the second circuit slice based on its unique identification value and the configuration value.
Abstract:
A vector processor is provided having an instruction set with a vector convolution function. The disclosed vector processor performs a convolution function between an input signal and a filter impulse response by obtaining a vector comprised of at least N1+N2-1 input samples; obtaining N2 time shifted versions of the vector (including a zero shifted version), wherein each time shifted version comprises N1 samples; and performing a weighted sum of the time shifted versions of the vector by a vector of N1 coefficients; and producing an output vector comprising one output value for each of the weighted sums. The vector processor performs the method, for example, in response to one or more vector convolution software instructions having a vector input. The vector can comprise a plurality of real or complex input samples and the filter impulse response can be expressed using a plurality of coefficients that are real or complex.
Abstract:
Techniques are disclosed for reducing or eliminating loop overhead caused by function calls in processors that form part of a pipeline architecture. The processors in the pipeline process data blocks in an iterative fashion, with each processor in the pipeline completing one of several iterations associated with a processing loop for a commonly-executed function. The described techniques leverage the use of message passing for pipelined processors to enable an upstream processor to signal to a downstream processor when processing has been completed, and thus a data block is ready for further processing in accordance with the next loop processing iteration. The described techniques facilitate a zero loop overhead architecture, enable continuous data block processing, and allow the processing pipeline to function indefinitely within the main body of the processing loop associated with the commonly-executed function where efficiency is greatest.