Abstract:
A complex number quantization-based audio signal encoding method may comprise: estimating a scale factor for each subband of an input audio signal; performing complex magnitude scaling for each subband based on the scale factor; and performing polar quantization on a complex frequency coefficient for each subband, wherein the performing the polar quantization for each subband comprises applying two or more different magnitude quantization techniques based on the magnitude of the complex frequency coefficient scaled for each subband.
Abstract:
A generative adversarial network-based audio signal generation model for generating a high quality audio signal may comprise: a generator generating an audio signal with an external input; a harmonic-percussive separation model separating the generated audio signal into a harmonic component signal and a percussive component signal; and at least one discriminator evaluating whether each of the harmonic component signal and the percussive component signal is real or fake.
Abstract:
A method, executed by a processor for compressing an audio signal in multiple layers, may comprise: (a) restoring, in a highest layer, an input audio signal as a first signal; (b) restoring, in at least one intermediate layer, a signal obtained by subtracting an upsampled signal, which is obtained by upsampling the audio signal restored in the highest layer or an immediately previous intermediate layer, from the input audio signal as a second signal; and (c) restoring, in a lowest layer, a signal obtained by subtracting an upsampled signal, which is obtained by upsampling the audio signal restored in an intermediate layer immediately before the lowest layer, from the input audio signal as a third signal, wherein the first signal, the second signal, and the third signal are combined to output a final restoration audio signal.
Abstract:
A method for encoding an input signal using N flow blocks (N is a natural number greater than or equal to 2) and (N−1) split block(s), which is performed by a processor, may comprise: transmitting, by a k-th flow block (k is a natural number greater than or equal to 1 and less than or equal to N−1) among the N flow blocks, a k-th transformation signal obtained by transforming a received signal into a latent representation to a k-th split block among the (N−1) split block(s); splitting, by the k-th split block, the k-th transformation signal by a predetermined ratio, into a first split signal and a second split signal; transmitting, by the k-th split block, the first split signal to a (k+1)-th flow block; and quantizing a signal transformed by an N-th flow block and the second split signals using a quantization block.
Abstract:
Disclosed is a method and apparatus for training a speech signal. A speech signal training apparatus of the present disclosure may include a target speaker speech database storing a target speaker speech signal; a multi-speaker speech database storing a multi-speaker speech signal; a target speaker acoustic parameter extracting unit extracting an acoustic parameter of a training subject speech signal from the target speaker speech signal; a similar speaker acoustic parameter determining unit extracting at least one similar speaker speech signal from the multi-speaker speech signals, and determining an auxiliary speech feature of the similar speaker speech signal; and an acoustic parameter model training unit determining an acoustic parameter model by performing model training for a relation between the acoustic parameter and text by using the acoustic parameter and the auxiliary speech feature, and setting mapping information of the relation between the acoustic parameter model and the text.
Abstract:
Provided is an apparatus for detecting a speech/non-speech section. The apparatus includes an acquisition unit which obtains inter-channel relation information of a stereo audio signal, a classification unit which classifies each element of the stereo audio signal into a center channel element and a surround element on the basis of the inter-channel relation information, a calculation unit which calculates an energy ratio value between a center channel signal composed of center channel elements and a surround channel signal composed of surround elements, for each frame, and an energy ratio value between the stereo audio signal and a mono signal generated on the basis of the stereo audio signal, and a judgment unit which determines a speech section and a non-speech section from the stereo audio signal by comparing the energy ratio values.