Abstract:
A multi-thread processor computes a function requiring only modular additions and multiplications. Memories store constants, multi-bit elements, and multiple instruction sets. A multiplier receives first and second multiplier operands, generates their product, which is fed to an adder as a first operand and added to a second adder operand, the sum being stored in an accumulator memory. Each instruction set is executed on a successive clock, and includes instructions for defining respective addresses in the memories from which constants, elements and sums are to be accessed. A scheduler maintains a schedule of threads executable by the processor in parallel, and is configured on each successive clock to cycle through the threads and initiate a first available thread. Selectors responsive to instructions received from the program memory select the required multiplier and adder operands. A multi-core system executes multiple parallel threads on multiple processors allowing complex functions to be computed efficiently.
Abstract:
A multiplication and accumulation (MAC) operator includes a residue number generating circuit configured to generate a plurality of weight residue number data for weight data and a plurality of vector residue number data for the vector data by using a plurality of divisors, a multiplication circuit configured to generate a plurality of residue number multiplication data by performing a multiplication operation on the weight residue number data and the vector residue number data, an addition circuit configured to generate residue number multiplication addition data by performing an addition operation on the multiplication data, an accumulating circuit configured to generate residue number accumulation data by performing an accumulation operation on the residue number multiplication addition data and latch data, and a mixed radix conversion circuit configured to generate the MAC result data by using the divisors and the residue number accumulation data that is transmitted by the accumulating circuit.
Abstract:
Methods and apparatus for optimization techniques for modular multiplication algorithms. The optimization techniques may be applied to variants of modular multiplication algorithms, including variants of Montgomery multiplication algorithms and Barrett multiplication algorithms. The optimization techniques reduce the number of serial steps in Montgomery reduction and Barrett reduction. Modular multiplication operations involving products of integer inputs A and B may be performed in parallel to obtain a value C that is reduced to a residual RES. Modular multiplication and modular reduction operations may be performed in parallel. The number of serial steps in the modular reductions are reduced to L, where L serial steps, where w is a digit size in bits, and L is a number of digits of operands=[k/w].
Abstract:
Embodiments are directed to homomorphic encryption for machine learning and neural networks using high-throughput Chinese remainder theorem (CRT) evaluation. An embodiment of an apparatus includes a hardware accelerator to receive a ciphertext generated by homomorphic encryption (HE) for evaluation, decompose coefficients of the ciphertext into a set of decomposed coefficients, multiply the decomposed coefficients using a set of smaller modulus determined based on a larger modulus, and convert results of the multiplying back to an original form corresponding to the larger modulus by performing a reverse Chinese remainder theorem (CRT) transform on the results of multiplying the decomposed coefficients.
Abstract:
An apparatus and method for modular multiplication. The modular multiplication apparatus includes a first operation unit for performing a first operation based on a structure of at least one of a serial multiplier and a serial squarer-based multiplier; a second operation unit for performing a second operation based on a structure of at least one of the serial multiplier and the serial squarer-based multiplier; an adder unit for outputting the sum of results of the first operation and the second operation, inputting an intermediate value stream to the first input unit, which calculates the product of the intermediate value stream and a zeta parameter, and outputting a High-Order Term as a result of Montgomery Modular Multiplication, wherein the first and second operation units output a result in digit-serial format in order from the least significant digit to the most significant digit.
Abstract:
A processor includes a decode unit to decode an instruction. The instruction indicates a first 64-bit source operand having a first 64-bit value, indicates a second 64-bit source operand having a second 64-bit value, indicates a third 64-bit source operand having a third 64-bit value, and indicates a fourth 64-bit source operand having a fourth 64-bit value. An execution unit is coupled with the decode unit. The execution unit is operable, in response to the instruction, to store a result. The result includes the first 64-bit value multiplied by the second 64-bit value added to the third 64-bit value added to the fourth 64-bit value. The execution unit may store a 64-bit least significant half of the result in a first 64-bit destination operand indicated by the instruction, and store a 64-bit most significant half of the result in a second 64-bit destination operand indicated by the instruction.
Abstract:
A calculating unit for reducing an input number with respect to a modulus, wherein the input number has input number portions of different significances, wherein the input number portions represent the input number with respect to a division number, wherein the modulus has modulus portions of different significances, and wherein the modulus portions represent the modulus with respect to the division number, includes a unit for estimating a result of an integer division of the input number by the modulus using a stored most significant portion of the number, a stored most significant portion of the modulus and the number, and for storing the estimated result in a memory of the calculating unit, and a unit for calculating a reduction result based on a subtraction of a product of the modulus and a value derived from the estimated result from the number.
Abstract:
A modular multiplication method implemented in an electronic digital processing system takes advantage of the case where one of the operands W is known in advance or used multiple times with different second operands V to speed calculation. The operands V and W and the modulus M may be integers or polynomials over a variable X. A possible choice for the type of polynomials can be polynomials of the binary finite field GF(2N). Once operand W is loaded into a data storage location, a value P=└W·Xn+δ/M┘ is pre-computed by the processing system. Then when a second operand V is loaded, the quotient q{circle around ( )} for the product V·W being reduced modulo M is quickly estimated, q{circle around ( )}=└V·P/Xn+δ┘, optionally randomized, q′=q{circle around ( )}−E, and can be used to obtain the remainder r′=V·W−q′·M, which is congruent to (V·W) mod M. A final reduction can be carried out, and the later steps repeated with other second operands V.
Abstract:
The subject invention relates to a method and apparatus for multiplication of numbers. In a specific embodiment, the subject invention can be used to perform sequential multiplication. The subject invention also pertains to a method and apparatus for modular reduction processing of a number or product of two numbers. In a specific embodiment, sequential multiplication can be incorporated to perform modular reduction processing. The subject method and apparatus can also be utilized for modular exponentiation of large numbers. In a specific embodiment, numbers larger than or equal to 2128 or even higher can be exponentiated. For example, the subject invention can be used for exponentiation of number as large as 21024, 22048, 24096, or even larger.