Abstract:
An apparatus and method are provided for transferring a plurality of data structures between memory and a plurality of vector registers, each vector register being arranged to store a vector operand comprising a plurality of data elements. Access circuitry is used to perform access operations to move data elements of vector operands between the data structures in memory and specified vector registers, each data structure comprising multiple data elements stored at contiguous addresses in the memory. Decode circuitry is responsive to a single access instruction identifying a plurality of vector registers and a plurality of data structures that are located discontiguously with respect to each other in the memory, to generate control signals to control the access circuitry to perform a sequence of access operations to move the plurality of data structures between the memory and the plurality of vector registers such that the vector operand in each vector register holds a corresponding data element from each of the plurality of data structures. This provides a very efficient mechanism for performing complex access operations, resulting in an increase in execution speed, and potential reductions in power consumption.
Abstract:
In an embodiment a method of vectorizing a collapsed multi-nested loop includes executing, in a vector unit of a processor, the collapsed loop to obtain a vector of offsets, including for each of a plurality of iterations, calculating a scalar offset into a multi-dimensional data structure, storing the scalar offset in a data element of a first vector register, and updating a loop counter value of a multi-dimensional loop counter vector. In turn, a plurality of data elements are loaded from the multi-dimensional data structure using a base value and indexes from the vector of offsets, at least one computation is performed on the loaded plurality of data elements to obtain a plurality of results, and the plurality of results are stored into the multi-dimensional data structure using the base value and the indexes from the vector of offsets. Other embodiments are described and claimed.
Abstract:
Included is an apparatus comprising a processor configured to identify a code segment in a program, analyze the code segment to determine a memory access pattern, if the memory access pattern is regular, turn on hardware prefetching for the code segment by setting a control register before the code segment, and turn off the hardware prefetching by resetting the control register after the code segment. Also included is a method comprising identifying a code segment in a program, analyzing the code segment to determine a memory access pattern, if the memory access pattern is regular, turning on hardware prefetching for the code segment by setting a control register before the code segment, and turning off the hardware prefetching by resetting the control register after the code segment.
Abstract:
The present invention relates to a micro processor device comprising a vector processor architecture with a functional vector processor unit comprising first memory means for storing plural index vectors and processing means, the functional vector processor unit being arranged to receive a processing instruction and at least one input vector to be processed, said first memory means being arranged to provide the processing means with one of said plural index vectors in accordance with the processing instruction, and the processing means being arranged to generate in response to said instruction at least one output vector having the elements of the at least one input vector rearranged in accordance with the one index vector provided. The functional vector processor unit further comprises pre-processing means arranged to receive a parameter and to process the elements of the one index vector dependent on said parameter before generating said at least one output vector in accordance with the processed index vector. The invention further relates to a method for processing vectors with such a functional vector-processing unit.
Abstract:
A reconfigurable processing system executes instructions and configurations in parallel. Initially, a first instruction loads configurations into configuration registers. the configuration field of a subsequently fetched instruction selects a configuration register. The instruction controls and controls of the configuration in the selected configuration register are decoded and modified as specially by the instruction. The controls provide data operands to the execution units which process the operands and generate results. Scalar data, vector data, or a combination of scalar and vector data can be processed. The processing is controlled by instructions executed in parallel with configurations invoked by configuration fields within the instructions. Vectors are processed using a vector register file which stores vectors. A vector address unit identifies addresses of vector elements in the vector register file to be processed. For each vector, vector address units provide addresses which stride through each element of each vector.
Abstract:
A novel vector processor architecture, and hardware and processing features associated therewith, provide both vector processing and superscalar processing features.
Abstract:
A method of controlling the enabling of processor datapaths in a SIMD processor during a loop processing operation is described. The information used by the method includes an allocation between the data items and a memory (20), a size of the array, and a number of remaining parallel passes of the datapaths in the loop processing operation. A computer instruction (12) is also provided, which includes a loop handling instruction that specifies the enabling of one of a plurality of processor datapaths during processing an array of data items. The instruction includes a count field that specifies the number of remaining parallel loop passes to process the array and a count field that specifies the number of serial loop passes to process the array. Different instructions can be used to handle different allocations of passes to parallel datapaths. The instruction also uses information about the total number of datapaths (18).
Abstract:
A floating point unit having a register bank containing a plurality of registers supports vector operations that execute a specified operation a plurality of times upon a sequence of data values form different registers. The register bank is divided into subsets and with the sequence of registers used in a vector operation wrapping within a subset. The subsets comprise disjoint, contiguous ranges of register numbers. The wrapping within ranges allows compact code and efficient to be provided for performing DSP operations, such as FIR filtering and matrix transformations.
Abstract:
A floating point unit is provided with a register bank comprising 32 registers that may be used as either vector registers of scalar registers. A data processing instruction includes at least one register specifying field pointing to a register containing a data value to be used in that operation. An increase in the instruction bit space available to encode more opcodes or to allow for more registers is provided by encoding whether a register is to be treated as a vector or a scalar within the register field itself. Further, the register field for one register of the instruction may encode whether another register is a vector or a scalar. The registers can be initially accessed using the values within the register fields of the instruction independently of the opcode allowing for easier decode.