PHONEME-BASED TEXT TRANSCRIPTION SEARCHING
    22.
    发明公开

    公开(公告)号:US20230386472A1

    公开(公告)日:2023-11-30

    申请号:US17804508

    申请日:2022-05-27

    Inventor: Yuchen LI

    Abstract: A computer-implemented method is disclosed. A search query of a text transcription is received. The search query includes a word or words having a specified spelling. A sequence of search phonemes corresponding to the specified spelling is generated. A sequence of transcript phonemes corresponding to the text transcription is generated from the text transcription. A search alignment in which the sequence of search phonemes is aligned to a transcript phoneme fragment is generated. Based at least on the search alignment having a quality score exceeding a quality score threshold, the transcript phoneme fragment and an associated portion of the text transcription is determined to result from an utterance of the specified spelling in an audio session corresponding to the text transcription. A search result indicating that the transcript phoneme fragment and the associated portion of the text transcription is determined to have resulted from the utterance is output.

    UNIFIED SPEECH REPRESENTATION LEARNING
    23.
    发明公开

    公开(公告)号:US20230368782A1

    公开(公告)日:2023-11-16

    申请号:US18217888

    申请日:2023-07-03

    Abstract: Systems and methods are provided for training a machine learning model to learn speech representations. Labeled speech data or both labeled and unlabeled data sets is applied to a feature extractor of a machine learning model to generate latent speech representations. The latent speech representations are applied to a quantizer to generate quantized latent speech representations and to a transformer context network to generate contextual representations. Each contextual representation included in the contextual representations is aligned with a phoneme label to generate phonetically-aware contextual representations. Quantized latent representations are aligned with phoneme labels to generate phonetically aware latent speech representations. Systems and methods also include randomly replacing a sub-set of the contextual representations with quantized latent speech representations during their alignments to phoneme labels and aligning the phonetically aware latent speech representations to the contextual representations using supervised learning.

    Entity resolution using acoustic data

    公开(公告)号:US11817090B1

    公开(公告)日:2023-11-14

    申请号:US16712394

    申请日:2019-12-12

    Abstract: A phonetic search system may pass phonetic information from an automatic speech recognition (ASR) system to a natural language understanding (NLU) system for the latter to leverage when performing entity resolution in the presence of ambiguous interpretations. The ASR system may include an acoustic model and a language model. The acoustic model can process audio data to generate hypotheses that can be mapped to acoustic data; i.e., one or more acoustic units such as phonemes. The language model can process the acoustic units to generate text data representing possible transcriptions of the audio data. ASR/NLU systems may have difficulty interpreting speech when confronted with, for example, homographs, which are words that are spelled the same, but have different meanings. When uncertainty in the final transcription is high, the system can leverage the acoustic data to improve the accuracy of entity resolution.

    Music cover identification with lyrics for search, compliance, and licensing

    公开(公告)号:US11816151B2

    公开(公告)日:2023-11-14

    申请号:US16875927

    申请日:2020-05-15

    Inventor: Erling Wold

    Abstract: Embodiments cover identifying an unidentified media content item as a cover of a known media content item using lyrical contents. In an example, a processing device receives an unidentified media content item and determines lyrical content associated with the unidentified media content item. The processing device then determines a lyrical similarity between the lyrical content associated with the unidentified media content item and additional lyrical content associated with a known media content item of a plurality of known media content items. The processing device then identifies the unidentified media content item as a cover of the known media content item based at least in part on the lyrical similarity, resulting in an identified cover-media content item.

    End-to-End Streaming Keyword Spotting
    29.
    发明公开

    公开(公告)号:US20230298576A1

    公开(公告)日:2023-09-21

    申请号:US18322207

    申请日:2023-05-23

    Applicant: Google LLC

    Abstract: A method for training hotword detection includes receiving a training input audio sequence including a sequence of input frames that define a hotword that initiates a wake-up process on a device. The method also includes feeding the training input audio sequence into an encoder and a decoder of a memorized neural network. Each of the encoder and the decoder of the memorized neural network include sequentially-stacked single value decomposition filter (SVDF) layers. The method further includes generating a logit at each of the encoder and the decoder based on the training input audio sequence. For each of the encoder and the decoder, the method includes smoothing each respective logit generated from the training input audio sequence, determining a max pooling loss from a probability distribution based on each respective logit, and optimizing the encoder and the decoder based on all max pooling losses associated with the training input audio sequence.

    Scalable Model Specialization Framework for Speech Model Personalization

    公开(公告)号:US20230298574A1

    公开(公告)日:2023-09-21

    申请号:US18184630

    申请日:2023-03-15

    Applicant: Google LLC

    CPC classification number: G10L15/16 G10L15/063 G10L15/02 G10L2015/025

    Abstract: A method for speech conversion includes obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech. The method further includes receiving a speech conversion request including input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker identifier uniquely identifying the target speaker. The method includes activating, using the speaker identifier, a particular sub-model for biasing the speech conversion model to recognize a type of the atypical speech associated with the target speaker identified by the speaker identifier. The method includes converting, using the speech conversion model biased by the activated particular sub-model, the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into output audio data corresponding to a synthesized canonical fluent speech representation of the utterance spoken by the target speaker.

Patent Agency Ranking