摘要:
[Object] To provide a sound source separation technology capable of improving the separation performance.[Solution] An information processing apparatus including: an acquisition section configured to acquire an observation signal obtained by observing a sound; and a sound source separation section configured to separate the observation signal acquired by the acquisition section into a plurality of separated signals corresponding to a plurality of assumed sound sources by applying a non-linear function to a matrix product of an input vector and a coefficient vector corresponding to each of the plurality of sound sources.
摘要:
There is provided a sound source separation method of carrying out sound source separation of an audio signal inputted from an input device by using a modeled sound source distribution, by an information processing apparatus provided with a processing device, a storage device, the input device, and an output device. In this method, as a condition followed by the model, sound sources are independent of one another, powers which the sound sources have are modeled for each of frequency bands obtained through band division, a relationship among the powers for the frequency bands different from each other is modeled by nonnegative matrix factorization, and components obtained through the division of the sound source follow a complex normal distribution.
摘要:
Methods and systems for audio source separation in real-time are described. In an embodiment, the present disclosure describes reading and decoding an audio source into PCM samples, fragmenting Pulse Code Modulation (PCM) samples into fragments, transforming fragments into spectrograms, performing audio source separation using a deep neural network (DNN) to generate an estimated magnitude spectrogram of the component(s) of the audio source, reconstructing the estimated time domain component signals, and streaming the component signals to a playback engine. In an embodiment, a semantic equalizer graphical user allows for real-time mixing of individual component signals.
摘要:
Provided are systems and methods for microphone signal fusion. An example method commences with receiving a first and second signal representing sounds captured, respectively, by external and internal microphones. The internal microphone is located inside an ear canal and sealed for isolation from outside acoustic signals. The external microphone is located outside the ear canal. The first signal comprises a voice component. The second signal comprises a voice component modified by at least human tissue. The first and second signals are processed to obtain noise estimates. The voice component of the second signal is aligned with the voice component of the first signal. The first signal and the aligned voice component of the second signal are blended, based on the noise estimates, to generate an enhanced voice signal. Prior to aligning, the voice component of the second signal may be processed to emphasize high frequency content, improving effective alignment bandwidth.
摘要:
Technologies are described for identifying familiar or interesting parts of music content by analyzing changes in vocal power using frequency spectrums. For example, a frequency spectrum can be generated from digitized audio. Using the frequency spectrum, the harmonic content and percussive content can be separated. The vocal content can then be separated from the harmonic and/or percussive content. The vocal content can then be processed to identify surge points in the digitized audio. In some implementations, the vocal content is included in the harmonic content during the separation procedure and is then separated from the harmonic content.
摘要:
Embodiments of the example embodiment relate to audio object extraction. A method for audio object extraction from audio content is disclosed. The method comprises determining a sub-band object probability for a sub-band of the audio signal in a frame of the audio content, the sub-band object probability indicating a probability of the sub-band of the audio signal containing an audio object. The method further comprises splitting the sub-band of the audio signal into an audio object portion and a residual audio portion based on the determined sub-band object probability. Corresponding system and computer program product are also disclosed.
摘要:
The disclosure provides a method and an apparatus for detecting a voice activity in an input audio signal composed of frames. A noise attribute of the input signal is determined based on a received frame of the input audio signal. A voice activity detection (VAD) parameter is derived based on the noise attribute of the input audio signal using an adaptive function. The derived VAD parameter is compared with a threshold value to provide a voice activity detection decision. The input audio signal is processed according to the voice activity detection decision.
摘要:
Sound processing techniques using recurrent neural networks are described. In one or more implementations, temporal dependencies are captured in sound data that are modeled through use of a recurrent neural network (RNN). The captured temporal dependencies are employed as part of feature extraction performed using nonnegative matrix factorization (NMF). One or more sound processing techniques are performed on the sound data based at least in part on the feature extraction.
摘要:
A device and a method for determining a speech segment with a high degree of accuracy from a sound signal in which different sounds coexist are provided. Directional points indicating the direction of arrival of the sound signal are connected in the temporal direction, and a speech segment is detected. In this configuration, pattern classification is performed in accordance with directional characteristics with respect to the direction of arrival, and a directionality pattern and a null beam pattern are generated from the classification results. Also, an average null beam pattern is also generated by calculating the average of the null beam patterns at a time when a non-speech-like signal is input. Further, a threshold that is set at a slightly lower value than the average null beam pattern is calculated as the threshold to be used in detecting the local minimum point corresponding to the direction of arrival from each null beam pattern, and a local minimum point equal to or lower than the threshold is determined to be the point corresponding to the direction of arrival.
摘要:
Provided are systems and methods for microphone signal fusion. An example method commences with receiving a first and second signal representing sounds captured, respectively, by internal and external microphones. The second signal includes at least a voice component. The first signal and the voice component are modified by at least human tissue. The first and second signals are processed to obtain noise estimates. The first signal is aligned with the second signal. The second signal and the aligned first signal are blended based on the noise estimates to generate an enhanced voice signal. The internal microphone is located inside an ear canal and sealed for isolation from acoustic signals outside the ear canal. The external microphone is located outside the ear canal. All of parts of the processing, blending and aligning of the systems and method may be performed on a subband basis in the frequency domain.