Abstract:
A method of reducing computational complexity for a fixed point neural network operating in a system having a limited bit width in a multiplier-accumulator (MAC) includes reducing a number of bit shift operations when computing activations in the fixed point neural network. The method also includes balancing an amount of quantization error and an overflow error when computing activations in the fixed point neural network.
Abstract:
A method of quantizing a floating point machine learning network to obtain a fixed point machine learning network using a quantizer may include selecting at least one moment of an input distribution of the floating point machine learning network. The method may also include determining quantizer parameters for quantizing values of the floating point machine learning network based at least in part on the at least one selected moment of the input distribution of the floating point machine learning network to obtain corresponding values of the fixed point machine learning network.