Abstract:
An audio processing method and an audio processing apparatus are described. A mono-channel audio signal is transformed into a plurality of first subband signals. Proportions of a desired component and a noise component are estimated in each of the subband signals. Second subband signals corresponding respectively to a plurality of channels are generated from each of the first subband signals. Each of the second subband signals comprises a first component and a second component obtained by assigning a spatial hearing property and a perceptual hearing property different from the spatial hearing property to the desired component and the noise component in the corresponding first subband signal respectively, based on a multi-dimensional auditory presentation method. The second subband signals are transformed into signals for rendering with the multi-dimensional auditory presentation method. By assigning different hearing properties to desired sound and noise, the intelligibility of the audio signal can be improved.
Abstract:
Embodiments are described for harmonicity estimation, audio classification, pitch determination and noise estimation. Measuring harmonicity of an audio signal includes calculation a log amplitude spectrum of audio signal. A first spectrum is derived by calculating each component of the first spectrum as a sum of components of the log amplitude spectrum on frequencies. In linear frequency scale, the frequencies are odd multiples of the component's frequency of the first spectrum. A second spectrum is derived by calculating each component of the second spectrum as a sum of components of the log amplitude spectrum on frequencies. In linear frequency scale, the frequencies are even multiples of the component's frequency of the second spectrum. A difference spectrum is derived subtracting the first spectrum from the second spectrum. A measure of harmonicity is generated as a monotonically increasing function of the maximum component of the difference spectrum within predetermined frequency range.
Abstract:
Apparatus and methods for controlling a jitter buffer are described. In one embodiment, the apparatus for controlling a jitter buffer includes an inter-talkspurt delay jitter estimator for estimating an offset value of the delay of a first frame in the current talkspurt with respect to the delay of a latest anchor frame in a previous talkspurt, and a jitter buffer controller for adjusting a length of the jitter buffer based on a long term length of the jitter buffer for each frame and the offset value.
Abstract:
Systems, methods, and computer program products for audio processing based on convolutional neural network (CNN) are described. A first CNN architecture may comprise a contracting path of a U-net, a multi-scale CNN, and an expansive path of a U-net. The contracting path may comprise a first encoding layer and may be configured to generate an output representation of the contracting path. The multi-scale CNN may be configured to generate, based on the output representation of the contracting path, an intermediate representation. The multi-scale CNN may comprise at least two parallel convolution paths. The expansive path may comprise a first decoding layer and may be configured to generate a final representation based on the intermediate representation generated by the multi-scale CNN. Within a second CNN architecture, the first encoding layer may comprise a first multi-scale CNN with at least two parallel convolution paths, and the first decoding layer may comprise a second multi-scale CNN with at least two parallel convolution paths.
Abstract:
In some embodiments, virtualization methods for generating a binaural signal in response to channels of a multi-channel audio signal, which apply a binaural room impulse response (BRIR) to each channel including by using at least one feedback delay network (FDN) to apply a common late reverberation to a downmix of the channels. In some embodiments, input signal channels are processed in a first processing path to apply to each channel a direct response and early reflection portion of a single-channel BRIR for the channel, and the downmix of the channels is processed in a second processing path including at least one FDN which applies the common late reverberation. Typically, the common late reverberation emulates collective macro attributes of late reverberation portions of at least some of the single-channel BRIRs. Other aspects are headphone virtualizers configured to perform any embodiment of the method.
Abstract:
The present disclosure relates to reverberation generation for headphone virtualization. A method of generating one or more components of a binaural room impulse response (BRIR) for headphone virtualization is described. In the method, directionally-controlled reflections are generated, wherein directionally-controlled reflections impart a desired perceptual cue to an audio input signal corresponding to a sound source location. Then at least the generated reflections are combined to obtain the one or more components of the BRIR. Corresponding system and computer program products are described as well.
Abstract:
An audio processing method and an audio processing apparatus are described. A mono-channel audio signal is transformed into a plurality of first subband signals. Proportions of a desired component and a noise component are estimated in each of the subband signals. Second subband signals corresponding respectively to a plurality of channels are generated from each of the first subband signals. Each of the second subband signals comprises a first component and a second component obtained by assigning a spatial hearing property and a perceptual hearing property different from the spatial hearing property to the desired component and the noise component in the corresponding first subband signal respectively, based on a multi-dimensional auditory presentation method. The second subband signals are transformed into signals for rendering with the multi-dimensional auditory presentation method. By assigning different hearing properties to desired sound and noise, the intelligibility of the audio signal can be improved.
Abstract:
An audio signal with a temporal sequence of blocks or frames is received or accessed. Features are determined as characterizing aggregately the sequential audio blocks/frames that have been processed recently, relative to current time. The feature determination exceeds a specificity criterion and is delayed, relative to the recently processed audio blocks/frames. Voice activity indication is detected in the audio signal. VAD is based on a decision that exceeds a preset sensitivity threshold and is computed over a brief time period, relative to blocks/frames duration, and relates to current block/frame features. The VAD and the recent feature determination are combined with state related information, which is based on a history of previous feature determinations that are compiled from multiple features, determined over a time prior to the recent feature determination time period. Decisions to commence or terminate the audio signal, or related gains, are outputted based on the combination.
Abstract:
A method for steering binauralization of audio is provided. The method comprises steps of: receiving (410) an audio input signal, calculating (430) a confidence value indicating a likelihood that a current audio frame of the audio input signal comprises binauralized audio; determining (450) a state signal based on the confidence value; determining (460) a steering signal, based on the first confidence value, the state signal and an energy value of the audio frame; and generating (470) an audio output signal with steered binauralization by processing the audio input signal according to the steering signal.
Abstract:
The present invention relates to a method and device for processing a first and a second audio signal representing an input binaural audio signal acquired by a binaural recording device. The present invention further relates to a method for rendering a binaural audio signal on a speaker system. The method for processing a binaural signal comprising extracting audio information from the first audio signal, computing a band gain for reducing noise in the first audio signal and applying the band gains to respective frequency bands of the first audio signal in accordance with a dynamic scaling factor, to provide a first output audio signal. Wherein the dynamic scaling factor has a value between zero and one and is selected so as to reduce quality degradation for the first audio signal.