SEPARATING SPEECH BY SOURCE IN AUDIO RECORDINGS BY PREDICTING ISOLATED AUDIO SIGNALS CONDITIONED ON SPEAKER REPRESENTATIONS

    公开(公告)号:US20210249027A1

    公开(公告)日:2021-08-12

    申请号:US17170657

    申请日:2021-02-08

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for performing speech separation. One of the methods includes obtaining a recording comprising speech from a plurality of speakers; processing the recording using a speaker neural network having speaker parameter values and configured to process the recording in accordance with the speaker parameter values to generate a plurality of per-recording speaker representations, each speaker representation representing features of a respective identified speaker in the recording; and processing the per-recording speaker representations and the recording using a separation neural network having separation parameter values and configured to process the recording and the speaker representations in accordance with the separation parameter values to generate, for each speaker representation, a respective predicted isolated audio signal that corresponds to speech of one of the speakers in the recording.

    SEMI-SUPERVISED TEXT-TO-SPEECH BY GENERATING SEMANTIC AND ACOUSTIC REPRESENTATIONS

    公开(公告)号:US20250157456A1

    公开(公告)日:2025-05-15

    申请号:US18832325

    申请日:2024-01-26

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating an audio signal from input text. In one aspect, a method comprises receiving a request to convert input text into an audio signal, wherein the input text comprises multiple tokenized text inputs, generating, using a first generative neural network, a semantic representation of the tokenized text inputs comprising semantic tokens representing semantic content of the tokenized text inputs, each semantic token being selected from a vocabulary of semantic tokens, generating, using a second generative neural network and conditioned on at least the semantic representation, an acoustic representation of the semantic representation comprising one or more respective acoustic tokens representing acoustic properties of the audio signal, and processing the acoustic representation using a decoder neural network to generate the audio signal.

    COMPRESSING AUDIO WAVEFORMS USING A STRUCTURED LATENT SPACE

    公开(公告)号:US20250022477A1

    公开(公告)日:2025-01-16

    申请号:US18278746

    申请日:2023-03-16

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training an encoder neural network and a decoder neural network. In one aspect, a method includes obtaining a first initial audio waveform and a first noisy audio waveform, obtaining a second initial audio waveform and a second noisy audio waveform, processing the first noisy audio waveform and the second noisy audio waveform using an encoder neural network, generating a blended embedding by concatenating: (i) clean feature dimensions from an embedding of the first noisy audio waveform, and (ii) noise feature dimensions from an embedding of the second noisy audio waveform, processing the blended embedding using a decoder neural network to generate a reconstructed audio waveform, determining gradients of an objective function; and updating parameter values of the encoder neural network and the decoder neural network using the gradients.

    Generating coded data representations using neural networks and vector quantizers

    公开(公告)号:US12198710B2

    公开(公告)日:2025-01-14

    申请号:US18400992

    申请日:2023-12-29

    Applicant: Google LLC

    Abstract: Methods, systems and apparatus, including computer programs encoded on computer storage media. According to one aspect, there is provided a method comprising: receiving a new input; processing the new input using an encoder neural network to generate a feature vector representing the new input; and generating a coded representation of the feature vector using a sequence of vector quantizers that are each associated with a respective codebook of code vectors, wherein the coded representation of the feature vector identifies a plurality of code vectors, including a respective code vector from the codebook of each vector quantizer, that define a quantized representation of the feature vector.

    END-TO-END SPEECH DIARIZATION VIA ITERATIVE SPEAKER EMBEDDING

    公开(公告)号:US20240144957A1

    公开(公告)日:2024-05-02

    申请号:US18544647

    申请日:2023-12-19

    Applicant: Google LLC

    Abstract: A method includes receiving an input audio signal corresponding to utterances spoken by multiple speakers. The method also includes encoding the input audio signal into a sequence of T temporal embeddings. During each of a plurality of iterations each corresponding to a respective speaker of the multiple speakers, the method includes selecting a respective speaker embedding for the respective speaker by determining a probability that the corresponding temporal embedding includes a presence of voice activity by a single new speaker for which a speaker embedding was not previously selected during a previous iteration and selecting the respective speaker embedding for the respective speaker as the temporal embedding. The method also includes, at each time step, predicting a respective voice activity indicator for each respective speaker of the multiple speakers based on the respective speaker embeddings selected during the plurality of iterations and the temporal embedding.

    COMPRESSING AUDIO WAVEFORMS USING NEURAL NETWORKS AND VECTOR QUANTIZERS

    公开(公告)号:US20230186927A1

    公开(公告)日:2023-06-15

    申请号:US18106094

    申请日:2023-02-06

    Applicant: Google LLC

    Abstract: Methods, systems and apparatus, including computer programs encoded on computer storage media. One of the methods includes receiving an audio waveform that includes a respective audio sample for each of a plurality of time steps, processing the audio waveform using an encoder neural network to generate a plurality of feature vectors representing the audio waveform, generating a respective coded representation of each of the plurality of feature vectors using a plurality of vector quantizers that are each associated with a respective codebook of code vectors, wherein the respective coded representation of each feature vector identifies a plurality of code vectors, including a respective code vector from the codebook of each vector quantizer, that define a quantized representation of the feature vector, and generating a compressed representation of the audio waveform by compressing the respective coded representation of each of the plurality of feature vectors.

Patent Agency Ranking