Abstract:
A device including processors configured to execute instructions and memories storing the instructions, which when executed by the processors configure the processors to perform an operation for training a transformer model having a plurality of encoders and a plurality of decoders by configuring the processors to identify the batches of training data into a plurality of micro-batches, select layer pairs for the plurality of micro-batches, assemble a processing order of the layer pairs, determining resource information to be allocated to the layer pairs, and allocate resources to the layer pairs based on the determined resource information to be allocated to the layer pairs, dependent con the processing order of the layer pairs.
Abstract:
A device and method with batch normalization are provided. An accelerator includes: core modules, each core module including a respective plurality of cores configured to perform a first convolution operation using feature map data and a weight; local reduction operation modules adjacent to the respective core modules, each including a respective plurality of local reduction operators configured to perform a first local operation that obtains first local statistical values of the corresponding core module; a global reduction operation module configured to perform a first global operation that generates first global statistical values of the core module based on the first local statistical values of the core modules; and a normalization operation module configured to perform a first normalization operation on the feature map data based on the first global statistical values.
Abstract:
An accelerator, an operation method of the accelerator, and an accelerator apparatus including the accelerator are disclosed. The operation method includes receiving one or more workloads assigned by a main processor, performing at least one operation involved with the workloads in an internal memory of the accelerator or in a direct memory access (DMA) configured to control data input to or output from the internal memory, and providing a result of performing the at least one operation.
Abstract:
A device and method with transformer model implementation are provided. The electronic device includes a processor configured to perform an inference by implementing a transformer model including a plurality of encoders and a plurality of decoders, and a memory configured to store instructions to be executed by the processor. Each of the encoders and the decoders includes an attention block that determines an attention value. The processor is configured to perform a first sub-softmax tile-wise operation in the attention block, perform a reduction operation to determine an adjustment factor based on a resulting value of the first sub-softmax operation, and perform a second sub-softmax tile-wise operation based on a resulting value of the reduction operation.
Abstract:
A memory device includes a first memory cell, a second memory cell, a precharge circuit, a sense amplifier, a switch circuit, and a controller. The first memory cell is connected to a first bit line, the second memory cell is connected to a second bit line, and the precharge circuit connected between the first bit line and the second bit line. The sense amplifier includes a first input terminal and a second input terminal. The switch circuit is connected to the first bit line and the first input terminal and to the second bit line and the second input terminal and is configured to control a connection between the first bit line and the first input terminal and a connection between the second bit line and the second input terminal in response to a switch signal. The controller is configured to generate the switch signal in response to a command.
Abstract:
A device including processors configured to execute instructions and memories storing the instructions, which when executed by the processors configure the processors to perform an operation for training a transformer model having a plurality of encoders and a plurality of decoders by configuring the processors to identify the batches of training data into a plurality of micro-batches, select layer pairs for the plurality of micro-batches, assemble a processing order of the layer pairs, determining resource information to be allocated to the layer pairs, and allocate resources to the layer pairs based on the determined resource information to be allocated to the layer pairs, dependent con the processing order of the layer pairs.
Abstract:
An accelerator includes: a memory configured to store input data; a plurality of shift buffers each configured to shift input data received sequentially from the memory in each cycle, and in response to input data being stored in each of internal elements of the shift buffer, output the stored input data to a processing element (PE) array; a plurality of backup buffers each configured to store input data received sequentially from the memory and transfer the stored input data to one of the shift buffers; and the PE array configured to perform an operation on input data received from one or more of the shift buffers and on a corresponding kernel.
Abstract:
A neural processor is provided. The neural processor includes a matrix device which is configured to generate an output feature map by processing a standard convolution operation and which has a systolic array architecture, and accelerators with an adder-tree structure which are configured to process depth-wise convolution operations for each of elements of the output feature map corresponding to lanes of the matrix device.
Abstract:
An accelerator includes: a memory configured to store input data; a plurality of shift buffers each configured to shift input data received sequentially from the memory in each cycle, and in response to input data being stored in each of internal elements of the shift buffer, output the stored input data to a processing element (PE) array; a plurality of backup buffers each configured to store input data received sequentially from the memory and transfer the stored input data to one of the shift buffers; and the PE array configured to perform an operation on input data received from one or more of the shift buffers and on a corresponding kernel.
Abstract:
A memory module includes a memory device, a command/address buffering device, and a processing data buffer. The memory device includes a memory cell array, a first set of input/output terminals, each terminal configured to receive first command/address bits, and a second set of input/output terminals, each terminal configured to receive both data bits and second command/address bits. The command/address buffering device is configured to output the first command/address bits to the first set of input/output terminals. The processing data buffer is configured to output the data bits and second command/address bits to the second set of input/output terminals. The memory device is configured such that the first command/address bits, second command/address bits, and data bits are all used to access the memory cell array.