Abstract:
Systems and methods for performing convolution operations. An example processing system comprises: a processing core; and a convolver unit to apply a convolution filter to a plurality of input data elements represented by a two-dimensional array, the convolver unit comprising a plurality of multipliers coupled to two or more sets of latches, wherein each set of latches is to store a plurality of data elements of a respective one-dimensional section of the two-dimensional array.
Abstract:
A processor includes a front end, a decoder, an allocator, and a retirement unit. The decoder includes logic to identify an end-of-live-range (EOLR) indicator. The EOLR indicator specifies an architectural register and a location in code for which the architectural register is unused. The allocator includes logic to scan for a mapping of the architectural register to a physical register, based upon the EOLR indicator. The allocator also includes logic to generate a request to disassociate the architectural register from the physical register. The retirement unit includes logic to disassociate the architectural register from the physical register.
Abstract:
Systems and methods for performing convolution operations. An example processing system comprises: a processing core; and a convolver unit to apply a convolution filter to a plurality of input data elements represented by a two-dimensional array, the convolver unit comprising a plurality of multipliers coupled to two or more sets of latches, wherein each set of latches is to store a plurality of data elements of a respective one-dimensional section of the two-dimensional array.
Abstract:
An apparatus and method are described for distributed and cooperative computation in artificial neural networks. For example, one embodiment of an apparatus comprises: an input/output (I/O) interface; a plurality of processing units communicatively coupled to the I/O interface to receive data for input neurons and synaptic weights associated with each of the input neurons, each of the plurality of processing units to process at least a portion of the data for the input neurons and synaptic weights to generate partial results; and an interconnect communicatively coupling the plurality of processing units, each of the processing units to share the partial results with one or more other processing units over the interconnect, the other processing units using the partial results to generate additional partial results or final results. The processing units may share data including input neurons and weights over the shared input bus.
Abstract:
An apparatus and method are described for distributed and cooperative computation in artificial neural networks. For example, one embodiment of an apparatus comprises: an input/output (I/O) interface; a plurality of processing units communicatively coupled to the I/O interface to receive data for input neurons and synaptic weights associated with each of the input neurons, each of the plurality of processing units to process at least a portion of the data for the input neurons and synaptic weights to generate partial results; and an interconnect communicatively coupling the plurality of processing units, each of the processing units to share the partial results with one or more other processing units over the interconnect, the other processing units using the partial results to generate additional partial results or final results. The processing units may share data including input neurons and weights over the shared input bus.
Abstract:
A storage device and method are described for performing convolution operations. For example, one embodiment of an apparatus to perform convolution operations comprises a plurality of processing units to execute convolution operations on input data and partial results; a unified scratchpad memory comprising a plurality of memory banks communicatively coupled to the plurality of processing units through a plurality of read/write ports, each of the plurality of memory banks partitioned to store both the input data and partial results; a control unit to allocate the input data and partial results to the memory banks to ensure a minimum quality of service in accordance with the specified number of read/write ports and the specified convolution operation to be performed.
Abstract:
An error handling method includes identifying a code region eligible for cumulative multiply add (CMA) optimization and translating code region instructions into interpreter code instructions, which may include translating sequences of multiply add instructions in the code region instructions into fusion code including CMA instructions. Floating point (FP) exceptions generated by the fusion code may be monitored and at least a portion of the code region instructions may be re-translated to eliminate some or all fusion code if CMA intermediate rounding exceptions exceed a threshold.
Abstract:
A combination of hardware and software collect profile data for asynchronous events, at code region granularity. An exemplary embodiment is directed to collecting metrics for prefetching events, which are asynchronous in nature. Instructions that belong to a code region are identified using one of several alternative techniques, causing a profile bit to be set for the instruction, as a marker. Each line of a data block that is prefetched is similarly marked. Events corresponding to the profile data being collected and resulting from instructions within the code region are then identified. Each time that one of the different types of events is identified, a corresponding counter is incremented. Following execution of the instructions within the code region, the profile data accumulated in the counters are collected, and the counters are reset for use with a new code region.
Abstract:
Methods, apparatus, instructions and logic are disclosed providing double rounded combined floating-point multiply and add functionality as scalar or vector SIMD instructions or as fused micro-operations. Embodiments include detecting floating-point (FP) multiplication operations and subsequent FP operations specifying as source operands results of the FP multiplications. The FP multiplications and the subsequent FP operations are encoded as combined FP operations including rounding of the results of FP multiplication followed by the subsequent FP operations. The encoding of said combined FP operations may be stored and executed as part of an executable thread portion using fused-multiply-add hardware that includes overflow detection for the product of FP multipliers, first and second FP adders to add third operand addend mantissas and the products of the FP multipliers with different rounding inputs based on overflow, or no overflow, in the products of the FP multiplier. Final results are selected respectively using overflow detection.
Abstract:
Systems and methods for performing convolution operations. An example processing system comprises: a processing core; and a convolver unit to apply a convolution filter to a plurality of input data elements represented by a two-dimensional array, the convolver unit comprising a plurality of multipliers coupled to two or more sets of latches, wherein each set of latches is to store a plurality of data elements of a respective one-dimensional section of the two-dimensional array.