Unsupervised Parallel Tacotron Non-Autoregressive and Controllable Text-To-Speech

    公开(公告)号:US20220301543A1

    公开(公告)日:2022-09-22

    申请号:US17326542

    申请日:2021-05-21

    申请人: Google LLC

    IPC分类号: G10L13/08 G10L13/04

    摘要: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.

    Notification terminal with text-to-speech amplifier

    公开(公告)号:US11430305B2

    公开(公告)日:2022-08-30

    申请号:US16945426

    申请日:2020-07-31

    发明人: Scott Stogel

    摘要: A mass notification terminal may have a data parser and decoder connected to a communications terminal. Announcements may be transmitted to the communications terminal in the form of linguistic symbols and commands using low bandwidth and low power protocol transmissions. Push transmissions conserve bandwidth. An abstraction of an audio announcement may be transmitted for use with a speech synthesizer. The abstraction may be linguistic symbols such as phenomes, text, or may identify pre-stored clips. The system may provide announcement confirmation. The system may take advantage of communication protocols that have message size limitations. The announcements may be sent in one or more message transmissions. When an announcement is composed of multiple messages, using message sequence numbers and announcement identifications may facilitate grouping and arranging of the messages that make up the announcement.

    Low-latency intelligent automated assistant

    公开(公告)号:US11380310B2

    公开(公告)日:2022-07-05

    申请号:US16998786

    申请日:2020-08-20

    申请人: Apple Inc.

    摘要: Systems and processes for operating a digital assistant are provided. In an example process, low-latency operation of a digital assistant is provided. In this example, natural language processing, task flow processing, dialogue flow processing, speech synthesis, or any combination thereof can be at least partially performed while awaiting detection of a speech end-point condition. Upon detection of a speech end-point condition, results obtained from performing the operations can be presented to the user. In another example, robust operation of a digital assistant is provided. In this example, task flow processing by the digital assistant can include selecting a candidate task flow from a plurality of candidate task flows based on determined task flow scores. The task flow scores can be based on speech recognition confidence scores, intent confidence scores, flow parameter scores, or any combination thereof. The selected candidate task flow is executed and corresponding results presented to the user.

    Method, apparatus, device and storage medium for switching voice role

    公开(公告)号:US11302302B2

    公开(公告)日:2022-04-12

    申请号:US16038861

    申请日:2018-07-18

    发明人: Yu Wang Bo Xie

    摘要: Embodiments of the present disclosure disclose a method, apparatus, device, and storage medium for switching a voice role. The method includes: recognizing an instruction of switching a voice role input by a user, and determining a target voice role corresponding to the instruction of switching the voice role; switching a current voice role of a smart terminal to the target voice role, different voice roles having different role attributes, and a role attribute including a role utterance attribute; generating interactive response information corresponding to an interactive voice, based on the interactive voice input by the user and a role utterance attribute of the target voice role; and providing a response voice corresponding to the interactive response information to the user. The embodiments of the present disclosure enable different voice roles to have different role utterance attributes, so that the voice role has a role sense.

    IMAGE FORMING APPARATUS AND CONTROL METHOD FOR IMAGE FORMING APPARATUS

    公开(公告)号:US20220086302A1

    公开(公告)日:2022-03-17

    申请号:US17245712

    申请日:2021-04-30

    发明人: Hiroyuki KATO

    摘要: An image forming apparatus includes an image forming unit, a machine translation processing unit, a speech synthesis processing unit, and a processor. The image forming unit is configured to print image data. The machine translation processing unit is configured to acquire translated text data as a result of machine translation processing based on untranslated text data generated from a scanned document image. The speech synthesis processing unit is configured to generate speech data based on the translated text data. The processor determines whether to execute printing or generate speech data based on a preset setting of output method, when executing printing, generates translation document image data based on the translated text data and the document image and causes the image forming unit to print the translation document image data, and when generating speech data, causes the speech synthesis processing unit to execute speech synthesis processing.

    SYSTEM AND METHOD FOR SYNTHESIZING PHOTO-REALISTIC VIDEO OF A SPEECH

    公开(公告)号:US20220084273A1

    公开(公告)日:2022-03-17

    申请号:US17019203

    申请日:2020-09-12

    摘要: A system and a method for obtaining a photo-realistic video from a text. The method includes: providing the text and an image of a talking person; synthesizing a speech audio from the text; extracting an acoustic feature from the speech audio by an acoustic feature extractor; and generating the photo-realistic video from the acoustic feature and the image by a video generation neural network. The video generating neural network is pre-trained by: providing a training video and a training image; extracting a training acoustic feature from training audio of the training video by the acoustic feature extractor; generating video frames from the training image and the training acoustic feature by the video generation neural network; and comparing the generated video frames with ground truth video frames using generative adversarial network (GAN). The ground truth video frames correspond to the training video frames.

    Artificial intelligence-based text-to-speech system and method

    公开(公告)号:US11244669B2

    公开(公告)日:2022-02-08

    申请号:US16446833

    申请日:2019-06-20

    摘要: A technique improves training and speech quality of a text-to-speech (TTS) system having an artificial intelligence, such as a neural network. The TTS system is organized as a front-end subsystem and a back-end subsystem. The front-end subsystem is configured to provide analysis and conversion of text into input vectors, each having at least a base frequency, f0, a phenome duration, and a phoneme sequence that is processed by a signal generation unit of the back-end subsystem. The signal generation unit includes the neural network interacting with a pre-existing knowledgebase of phenomes to generate audible speech from the input vectors. The technique applies an error signal from the neural network to correct imperfections of the pre-existing knowledgebase of phenomes to generate audible speech signals. A back-end training system is configured to train the signal generation unit by applying psychoacoustic principles to improve quality of the generated audible speech signals.

    METHOD AND DEVICE FOR OUTPUTTING AN AUDIBLE VOICE MESSAGE IN AN ELEVATOR SYSTEM

    公开(公告)号:US20220036875A1

    公开(公告)日:2022-02-03

    申请号:US17309436

    申请日:2019-11-06

    申请人: Inventio AG

    发明人: Stefano Carriero

    摘要: A method and a device for outputting an audible voice message in an elevator system includes at least the following steps: transmitting the content of the voice message as a text file to be output via the Internet to a web-based text-to-speech service provider; receiving an audio file from the service provider via the Internet, the audio file having been created based upon the transmitted text file to be output; and outputting the audio file in the elevator system as the audible voice message. If necessary, the text file to be output can be obtained by translating a source language text file into a target language beforehand with the aid of a translation service provider. The targeted use of online service providers allows the outlay needed to realize voice announcements in an elevator system at different use locations with different languages to be greatly reduced.