摘要:
An apparatus includes a fuse array and a stores. The fuse array is programmed with compressed configuration data for a plurality of cores. The stores is coupled to the plurality of cores, and includes a plurality of sub-stores that each correspond to each of the plurality of cores, where one of the plurality of cores accesses the semiconductor fuse array upon power-up/reset to read and decompress the compressed configuration data, and to store a plurality of decompressed configuration data sets for one or more cache memories within the each of the plurality of cores in the plurality of sub-stores. Each of the plurality of cores has sleep logic. The sleep logic is configured to subsequently access a corresponding one of the each of the plurality of sub-stores to retrieve and employ the decompressed configuration data sets to initialize the one or more caches following a power gating event.
摘要:
A hardware data compressor includes a first hardware engine that scans an input block of characters to produce a stream of tokens, the stream of tokens comprising replacement back pointers to matched strings of characters of the input block and non-replaced characters of the input block. The hardware data compressor also includes a second hardware engine that receives the stream of tokens and maintains a sorted list of symbols associated with the tokens. The hardware data compressor also includes the second hardware engine concurrently maintains the sorted list of symbols by frequency of occurrence as the first hardware engine produces the tokens of the stream.
摘要:
An apparatus includes a semiconductor fuse array, a cache memory, and a plurality of cores. The semiconductor fuse array is disposed on a die, into which is programmed the configuration data. The semiconductor fuse array has a first plurality of semiconductor fuses that is configured to store compressed cache correction data. The cache memory is disposed on the die. The plurality of cores is disposed on the die, where each of the plurality of cores is coupled to the semiconductor fuse array and the cache memory, and is configured to access the semiconductor fuse array upon power-up/reset, to decompress the compressed cache correction data, and to distribute decompressed cached correction data to initialize the cache memory.
摘要:
An apparatus including first and second reservation stations. The first reservation station dispatches a load micro instruction, and indicates on a hold bus if the load micro instruction is a specified load micro instruction directed to retrieve an operand from a prescribed resource other than on-core cache memory. The second reservation station is coupled to the hold bus, and dispatches one or more younger micro instructions therein that depend on the load micro instruction for execution after a number of clock cycles following dispatch of the first load micro instruction, and if it is indicated on the hold bus that the load micro instruction is the specified load micro instruction, the second reservation station is configured to stall dispatch of the one or more younger micro instructions until the load micro instruction has retrieved the operand. The plurality of non-core resources includes a fuse array, configured to store the plurality of specified load instructions corresponding to the out-of-order processor which, upon initialization, accesses the fuse array to determine the plurality of specified load instructions.
摘要:
An apparatus is contemplated for storing and providing configuration data to an integrated circuit device, the apparatus has a fuse array and a plurality of cores. The fuse array is disposed on a die. The fuse array has a first plurality of semiconductor fuses and a second plurality of semiconductor fuses. The plurality of cores is disposed on the die, where each of the plurality of cores is coupled to the fuse array. The each of the plurality of cores includes array control, configured to access the first and second pluralities of fuses, and configured to process first states of the first plurality of semiconductor fuses and second states of the second plurality of semiconductor fuses according to contents of a configuration data register.
摘要:
First/second memories hold rows of N weight/data words. Each of N processing units (PU) of index J have a register, an accumulator having an output, an arithmetic unit that performs an operation thereon to accumulate a result, the first input receives the output of the accumulator, the second input receives a respective first memory weight word, the third input receives a respective data word output by the register, and multiplexing logic receives a respective second memory data word and a data word output by the register of PU J−1 and outputs a selected data word to the register. PU J−1 for PU 0 is PU N−1. The multiplexing logic of PU 0 also receives the data word output by the register of PU (N/2)−1. The multiplexing logic of PU N/2 also receives the data word output by the register of PU N−1.
摘要:
First/second memories hold rows of N weight/data words. Each of N processing units (PU) of index J have a register, an accumulator having an output, an arithmetic unit that performs an operation thereon to accumulate a result, the first input receives the output of the accumulator, the second input receives a respective first memory weight word, the third input receives a respective data word output by the register, and multiplexing logic receives a respective second memory data word and a data word output by the register of PU J−1 and outputs a selected data word to the register. PU J−1 for PU 0 is PU N−1. The multiplexing logic of PU N/4 also receives the data word output by the register of PU (3N/4)−1. The multiplexing logic of PU 3N/4 also receives the data word output by the register of PU (N/4)−1.
摘要:
A processor with an expandable instruction set architecture for dynamically configuring execution resources. The processor includes a programmable execution unit (PEU) that may be programmed to perform a user-defined function in response to a user-defined instruction (UDI). The PEU includes programmable logic elements and programmable interconnectors that are collectively programmed to perform at least one processing operation. A UDI loader is responsive to a UDI load instruction that specifies a UDI and a location of programming information that is used to program the PEU. The PEU may be programmed for one or more UDIs for one or more processes. An instruction table stores each UDI and corresponding information to identify the UDI and possibly to reprogram the PEU if necessary. A UDI handler consults the instruction table to identify a received UDI and to send corresponding information to the PEU to execute the corresponding user-defined function.
摘要:
A neural network unit including a register programmable with a representation of a reciprocal value of a divisor and a plurality of neural processing units (NPU). Each NPU has an ALU, an accumulator, and a reciprocal multiplier unit. The ALU performs arithmetic and logical operations on a sequence of operands to generate a sequence of results and accumulates the sequence of results as an accumulated value into the accumulator. The reciprocal multiplier unit receives the representation of the reciprocal value and the accumulated value and in response generates a result that is the quotient of the accumulated value and the divisor.
摘要:
A neural network unit (NNU) includes N neural processing units (NPU). Each NPU has an arithmetic unit and an accumulator. First and second multiplexed registers of the N NPUs collectively selectively operate as respective first and second N-word rotaters. First and second memories respectively hold rows of N weight/data words and provide the N weight/data words of a row to corresponding ones of the N NPUs. The NPUs selectively perform: multiply-accumulate operations on rows of N weight words and on a row of N data words, using the second N-word rotater; convolution operations on rows of N weight words, using the first N-word rotater, and on rows of N data words, the rows of weight words being a data matrix, and the rows of data words being elements of a convolution kernel; and pooling operations on rows of N weight words, using the first N-word rotater.