Abstract:
Methods, apparatus, systems and articles of manufacture are disclosed to forward and invalidate inflight data in a store queue. An example apparatus includes a cache storage, a cache controller coupled to the cache storage and operable to receive a first memory operation, determine that the first memory operation corresponds to a read miss in the cache storage, determine a victim address in the cache storage to evict in response to the read miss, issue a read-invalidate command that specifies the victim address, compare the victim address to a set of addresses associated with a set of memory operations being processed by the cache controller, and in response to the victim address matching a first address of the set of addresses corresponding to a second memory operation of the set of memory operations, provide data associated with the second memory operation.
Abstract:
In described examples, a processor system includes a processor core that generates memory write requests, and a cache memory with a memory controller having a memory pipeline. The cache memory has cache lines of length L. The cache memory has a minimum write length that is less than a cache line length of the cache memory. The memory pipeline determines whether the data payload includes a first chunk and ECC syndrome that correspond to a partial write and are writable by a first cache write operation, and a second chunk and ECC syndrome that correspond to a full write operation that can be performed separately from the first cache write operation. The memory pipeline performs an RMW operation to store the first chunk and ECC syndrome in the cache memory, and performs the full write operation to store the second chunk and ECC syndrome in the cache memory.
Abstract:
A method is provided that includes receiving, in a permute network, a plurality of data elements for a vector instruction from a streaming engine, and mapping, by the permute network, the plurality of data elements to vector locations for execution of the vector instruction by a vector functional unit in a vector data path of a processor.
Abstract:
A processor is provided that includes a first multiplication unit in a first data path of the processor, the first multiplication unit configured to perform single issue multiply instructions, and a second multiplication unit in the first data path, the second multiplication unit configured to perform single issue multiply instructions, wherein the first multiplication unit and the second multiplication unit are configured to execute respective single issue multiply instructions in parallel.
Abstract:
In a very long instruction word (VLIW) central processing unit instructions are grouped into execute packets that execute in parallel. A constant may be specified or extended by bits in a constant extension instruction in the same execute packet. If an instruction includes an indication of constant extension, the decoder employs bits of a constant extension instruction to extend the constant of an immediate field. Two or more constant extension slots are permitted in each execute packet, each extending constants for a different predetermined subset of functional unit instructions. In an alternative embodiment, more than one functional unit may have constants extended from the same constant extension instruction employing the same extended bits. A long extended constant may be formed using the extension bits of two constant extension instructions.
Abstract:
Software instructions are executed on a processor within a computer system to configure a steaming engine with stream parameters to define a multidimensional array. The stream parameters define a size for each dimension of the multidimensional array and a specified width for two selected dimensions of the array. Data is fetched from a memory coupled to the streaming engine responsive to the stream parameters. A stream of vectors is formed for the multidimensional array responsive to the stream parameters from the data fetched from memory. When either selected dimension in the stream of vectors exceeds a respective specified width, the streaming engine inserts null elements into each portion of a respective vector for the selected dimension that exceeds the specified width in the stream of vectors. Stream vectors that are completely null are formed by the streaming engine without accessing the system memory for respective data.
Abstract:
Software instructions are executed on a processor within a computer system to configure a steaming engine with stream parameters to define a multidimensional array. The stream parameters define a size for each dimension of the multidimensional array and a specified width for a selected dimension of the array. Data is fetched from a memory coupled to the streaming engine responsive to the stream parameters. A stream of vectors is formed for the multidimensional array responsive to the stream parameters from the data fetched from memory. When the selected dimension in the stream of vectors exceeds the specified width, the streaming engine inserts null elements into each portion of a respective vector for the selected dimension that exceeds the specified width in the stream of vectors. Stream vectors that are completely null are formed by the streaming engine without accessing the system memory for respective data.
Abstract:
A stream of data is accessed from a memory system using a stream of addresses generated in a first mode of operating a streaming engine in response to executing a first stream instruction. A block cache management operation is performed on a cache in the memory using a block of addresses generated in a second mode of operating the streaming engine in response to executing a second stream instruction.
Abstract:
The vector data path is divided into smaller vector lanes. A register such as a memory mapped control register stores a vector lane number (VLX) indicating the number of vector lanes to be powered. A decoder converts this VLX into a vector lane control word, each bit controlling the ON of OFF state of the corresponding vector lane. This number of contiguous least significant vector lanes are powered. In the preferred embodiment the stored data VLX indicates that 2VLX contiguous least significant vector lanes are to be powered. Thus the number of vector lanes powered is limited to an integral power of 2. This manner of coding produces a very compact controlling bit field while obtaining substantially all the power saving advantage of individually controlling the power of all vector lanes.
Abstract:
The invention allows a processor to maintain a fixed instruction width regardless of the width of the constants needed. The constant extension solves the problem of having variable length opcodes to accommodate longer constants. The invention allows the architecture to have a fixed width, regardless of the width of the constants specified, which simplify instruction decoding. Constant widths can be variable and extend beyond the fixed processor instruction width.