Two-level text-to-speech systems using synthetic training data

    公开(公告)号:US12260851B2

    公开(公告)日:2025-03-25

    申请号:US17305809

    申请日:2021-07-14

    Applicant: Google LLC

    Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect. The method also includes generating an output audio waveform corresponding to a synthesized speech representation of the input text sequence that clones the voice of the target speaker in the second accent/dialect.

    Two-Level Speech Prosody Transfer

    公开(公告)号:US20230064749A1

    公开(公告)日:2023-03-02

    申请号:US18054604

    申请日:2022-11-11

    Applicant: Google LLC

    Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation for the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

    Two-level speech prosody transfer

    公开(公告)号:US11514888B2

    公开(公告)日:2022-11-29

    申请号:US16992410

    申请日:2020-08-13

    Applicant: Google LLC

    Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation tor the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

    Two-Level Text-To-Speech Systems Using Synthetic Training Data

    公开(公告)号:US20230018384A1

    公开(公告)日:2023-01-19

    申请号:US17305809

    申请日:2021-07-14

    Applicant: Google LLC

    Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect. The method also includes generating an output audio waveform corresponding to a synthesized speech representation of the input text sequence that clones the voice of the target speaker in the second accent/dialect.

    CONTEXT-BASED SPEAKER COUNTER FOR A SPEAKER DIARIZATION SYSTEM

    公开(公告)号:US20230103060A1

    公开(公告)日:2023-03-30

    申请号:US17909879

    申请日:2020-03-13

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining the number of speakers in a video and a corresponding audio using visual context. In one aspect, a method includes detecting within the video multiple speakers, determining a bounding box for each detected speaker that includes the detected person and objects within a threshold distance of the detected person in an image frame, determining a unique descriptor for that person based in part on image information depicting the objects within the bounding box, determining a cardinality of unique speakers in the video, providing to the speaker diarization system the cardinality of unique speakers.

    Weighted answer terms for scoring answer passages

    公开(公告)号:US10019513B1

    公开(公告)日:2018-07-10

    申请号:US14824533

    申请日:2015-08-12

    Applicant: Google LLC

    CPC classification number: G06F16/3344

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating answer terms for scoring answer passages. In one aspect, a method includes accessing resource data describing a set of resources, identifying question phrases in the resources, for each identified question phrase in a resource, selecting in the resource a section of text subsequent to the question phrase as an answer, the answer having a plurality of terms, grouping the question phrases into groups of question phrases, and for each group: generating, from the terms of the answers for each question phrase in the group, answer terms and for each answer term, an answer term weight, and storing the answer terms and answer term weights in association with one or more queries.

    Two-Level Text-To-Speech Systems Using Synthetic Training Data

    公开(公告)号:US20250078808A1

    公开(公告)日:2025-03-06

    申请号:US18949095

    申请日:2024-11-15

    Applicant: Google LLC

    Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect. The method also includes generating an output audio waveform corresponding to a synthesized speech representation of the input text sequence that clones the voice of the target speaker in the second accent/dialect.

Patent Agency Ranking