摘要:
Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise a configuration buffer, a sequencer coupled to the configuration buffer of each of the plurality of PEs and configured to distribute one or more PE configurations to the plurality of PEs, and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.
摘要:
Processors, systems and methods are provided for thread level parallel processing. A processor may include a plurality of columns of vector processing units arranged in a two-dimensional column array with a plurality of column stacks placed side-by-side in a first direction and each column stack having two columns stacked in a second direction and a temporary storage buffer. Each column may include a processing element (PE) that has a vector Arithmetic Logic Unit (ALU) to perform arithmetic operations in parallel threads. At a first end of the column array in the first direction, two columns in the column stack are coupled to the temporary storage buffer for one-way data flow. At a second end of the column array in the first direction, two columns are coupled to each other for one-way data flow. The column array and the temporary storage buffer may form a one-way circular data path.
摘要:
A clock-less asynchronous processing circuit or system having a plurality of pipelined processing stages utilizes self-clocked generators to tune the delay needed in each of the processing stages to complete the processing cycle. Because different processing stages may require different amounts of time to complete processing or may require different delays depending on the processing required in a particular stage, the self-clocked generators may be tuned to each stage's necessary delay(s) or may be programmably configured.
摘要:
A reconfigurable vector processor is described that allows the size of its vector units to be changed in order to process vectors of different sizes. The reconfigurable vector processor comprises a plurality of processor units. Each of the processor units comprises a control unit for decoding instructions and generating control signals, a scalar unit for processing instructions on scalar data, and a vector unit for processing instructions on vector data under control of control signals. The reconfigurable vector processor architecture also comprises a vector control selector for selectively providing control signals generated by one processor unit of the plurality of processor units to the vector unit of a different processor unit of the plurality of processor units.
摘要:
An asynchronous instruction execution apparatus and method are provided. The asynchronous instruction execution apparatus includes a vector execution unit control VXUC module and n vector execution unit data VXUD modules, where n is a positive integer. The VXUC module is configured to perform instruction decoding and token management. The n VXUD modules are cascaded, separately connected to the VXUC module, and configured to invoke an external calculation resource to perform data calculation. A bit width of data processed by the asynchronous instruction execution apparatus is M, a bit width of each VXUD module is N, and n=M/N. The asynchronous instruction execution apparatus is divided into two parts: the VXUC and the VXUD.
摘要:
Embodiments include computing devices, apparatus, and methods implemented by the apparatus for accelerating machine learning on a computing device. Raw data may be received in the computing device from a raw data source device. The apparatus may identify key features as two dimensional matrices of the raw data such that the key features are mutually exclusive from each other. The key features may be translated into key feature vectors. The computing device may generate a feature vector from at least one of the key feature vectors. The computing device may receive a first partial output resulting from an execution of a basic linear algebra subprogram (BLAS) operation using the feature vector and a weight factor. The first partial output may be combined with a plurality of partial outputs to produce an output matrix. Receiving the raw data on the computing device may include receiving streaming raw data.
摘要:
A clock-less asynchronous processing circuit or system utilizes a self-clocked generator to adjust the processing delay (latency) needed/allowed to the processing cycle in the circuit/system. The timing of the self-clocked generator is dynamically adjustable depending on various parameters. These parameters may include processing instruction, opcode information, type of processing to be performed by the circuit/system, or overall desired processing performance. The latency may also be adjusted to change processing performance, including power consumption, speed etc.
摘要:
A clock-less asynchronous processing circuit or system is configured to operation in a plurality of modes. In an initialization mode (e.g., reset, initialization, boot up), a self-clocked generator associated with the asynchronous circuit is configured to generate an active complete signal (to latch output processed data) within a first period of time after receiving a trigger signal. In a normal mode, the self-clocked generator is configured to generate the active complete signal within a second period of time after receiving the trigger signal. In one embodiment, during the initialization mode, the asynchronous circuit latches the output slower than when in the normal mode.
摘要:
The present application discloses a computing device that can provide a low-power, highly capable computing platform for computational imaging. The computing device can include one or more processing units, for example one or more vector processors and one or more hardware accelerators, an intelligent memory fabric, a peripheral device, and a power management module. The computing device can communicate with external devices, such as one or more image sensors, an accelerometer, a gyroscope, or any other suitable sensor devices.
摘要:
Embodiments of systems, apparatuses, and methods for performing in a computer processor generation of a predicate mask based on vector comparison in response to a single instruction are described.