-
公开(公告)号:US20220301543A1
公开(公告)日:2022-09-22
申请号:US17326542
申请日:2021-05-21
申请人: Google LLC
发明人: Isaac Elias , Byungha Chun , Jonathan Shen , Ye Jia , Yu Zhang , Yonghui Wu
摘要: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.
-
公开(公告)号:US11437017B2
公开(公告)日:2022-09-06
申请号:US16937342
申请日:2020-07-23
发明人: Jeffrey Owen Kephart , Hui Su , Maira Gatti de Bayser , Melina de Vasconcelos Alberio Guerra , Rahul Divekar , Matthew Peveler , Xiangyang Mou , Lisha Chen
摘要: Human speech signals that are uttered within an environment are transcribed; the environment includes one or more avatars representing one or more software agents; the human speech signals are directed to at least one of the avatars. At least one non-speech behavioral trace is obtained within the environment; the trace is representative of non-speech behavior directed to the at least one of the avatars. The transcribed human speech signals and the at least one non-speech behavioral trace are forwarded to the one or more software agents. A proposed act is obtained from at least one of the agents; responsive thereto, a command is issued to cause the avatar corresponding to the software agent from which the proposed act is obtained to emit synthesized speech and to act visually in accordance with the proposed act.
-
公开(公告)号:US11430305B2
公开(公告)日:2022-08-30
申请号:US16945426
申请日:2020-07-31
发明人: Scott Stogel
摘要: A mass notification terminal may have a data parser and decoder connected to a communications terminal. Announcements may be transmitted to the communications terminal in the form of linguistic symbols and commands using low bandwidth and low power protocol transmissions. Push transmissions conserve bandwidth. An abstraction of an audio announcement may be transmitted for use with a speech synthesizer. The abstraction may be linguistic symbols such as phenomes, text, or may identify pre-stored clips. The system may provide announcement confirmation. The system may take advantage of communication protocols that have message size limitations. The announcements may be sent in one or more message transmissions. When an announcement is composed of multiple messages, using message sequence numbers and announcement identifications may facilitate grouping and arranging of the messages that make up the announcement.
-
公开(公告)号:US11380310B2
公开(公告)日:2022-07-05
申请号:US16998786
申请日:2020-08-20
申请人: Apple Inc.
发明人: Alejandro Acero , Hepeng Zhang
IPC分类号: G10L15/18 , G10L15/22 , G10L15/30 , G10L25/78 , G10L13/04 , G06F3/16 , G10L15/183 , G10L25/87
摘要: Systems and processes for operating a digital assistant are provided. In an example process, low-latency operation of a digital assistant is provided. In this example, natural language processing, task flow processing, dialogue flow processing, speech synthesis, or any combination thereof can be at least partially performed while awaiting detection of a speech end-point condition. Upon detection of a speech end-point condition, results obtained from performing the operations can be presented to the user. In another example, robust operation of a digital assistant is provided. In this example, task flow processing by the digital assistant can include selecting a candidate task flow from a plurality of candidate task flows based on determined task flow scores. The task flow scores can be based on speech recognition confidence scores, intent confidence scores, flow parameter scores, or any combination thereof. The selected candidate task flow is executed and corresponding results presented to the user.
-
公开(公告)号:US11302302B2
公开(公告)日:2022-04-12
申请号:US16038861
申请日:2018-07-18
IPC分类号: G10L13/033 , G10L15/07 , G10L15/22 , G06F3/16 , G10L13/04
摘要: Embodiments of the present disclosure disclose a method, apparatus, device, and storage medium for switching a voice role. The method includes: recognizing an instruction of switching a voice role input by a user, and determining a target voice role corresponding to the instruction of switching the voice role; switching a current voice role of a smart terminal to the target voice role, different voice roles having different role attributes, and a role attribute including a role utterance attribute; generating interactive response information corresponding to an interactive voice, based on the interactive voice input by the user and a role utterance attribute of the target voice role; and providing a response voice corresponding to the interactive response information to the user. The embodiments of the present disclosure enable different voice roles to have different role utterance attributes, so that the voice role has a role sense.
-
公开(公告)号:US20220086302A1
公开(公告)日:2022-03-17
申请号:US17245712
申请日:2021-04-30
发明人: Hiroyuki KATO
摘要: An image forming apparatus includes an image forming unit, a machine translation processing unit, a speech synthesis processing unit, and a processor. The image forming unit is configured to print image data. The machine translation processing unit is configured to acquire translated text data as a result of machine translation processing based on untranslated text data generated from a scanned document image. The speech synthesis processing unit is configured to generate speech data based on the translated text data. The processor determines whether to execute printing or generate speech data based on a preset setting of output method, when executing printing, generates translation document image data based on the translated text data and the document image and causes the image forming unit to print the translation document image data, and when generating speech data, causes the speech synthesis processing unit to execute speech synthesis processing.
-
公开(公告)号:US20220084273A1
公开(公告)日:2022-03-17
申请号:US17019203
申请日:2020-09-12
摘要: A system and a method for obtaining a photo-realistic video from a text. The method includes: providing the text and an image of a talking person; synthesizing a speech audio from the text; extracting an acoustic feature from the speech audio by an acoustic feature extractor; and generating the photo-realistic video from the acoustic feature and the image by a video generation neural network. The video generating neural network is pre-trained by: providing a training video and a training image; extracting a training acoustic feature from training audio of the training video by the acoustic feature extractor; generating video frames from the training image and the training acoustic feature by the video generation neural network; and comparing the generated video frames with ground truth video frames using generative adversarial network (GAN). The ground truth video frames correspond to the training video frames.
-
公开(公告)号:US11249774B2
公开(公告)日:2022-02-15
申请号:US17013394
申请日:2020-09-04
申请人: Facebook, Inc.
IPC分类号: H04L29/08 , H04L12/26 , G06F9/451 , G10L15/18 , G10L15/183 , G10L15/22 , G06F16/338 , G06F16/332 , G06F16/33 , G06N20/00 , G06F16/9535 , G06Q50/00 , G06F16/176 , G10L15/06 , G10L15/16 , G06F3/01 , G06F16/9032 , G06F16/2457 , H04L12/58 , G06F3/16 , G06K9/00 , G06K9/62 , G06N3/08 , G10L15/26 , G06F16/9038 , G06F16/904 , G06F40/30 , G06F40/40 , G06F16/22 , G06F16/23 , G06F7/14 , H04L12/28 , H04L12/24 , H04W12/08 , G10L15/07 , G10L17/22 , G06N3/00 , G10L17/06 , G06F16/248 , G06F16/438 , G06F16/951 , G06F16/242 , G06F16/2455 , G10L15/02 , G06F16/903 , G06F40/205 , G10L15/187 , G06F16/28 , G10L13/00 , G10L13/04
摘要: In one embodiment, a method includes initiating a communication session with a second client system associated with a second user via a communication network, wherein the communication session is initiated in a first modality, receiving a ping to the first client system from the communication network to evaluate available bandwidth on the communication network, estimating, by the first client system, an amount of bandwidth available on the communication network for use by the first client system, determining, by the first client system, the amount of bandwidth available on the communication network for use by the first client system is insufficient for the first modality, and switching the communication session with the second client system to a second modality by the first client system, wherein the second modality uses less bandwidth than the first modality.
-
公开(公告)号:US11244669B2
公开(公告)日:2022-02-08
申请号:US16446833
申请日:2019-06-20
申请人: Telepathy Labs, Inc.
发明人: Martin Reber , Vijeta Avijeet
IPC分类号: G10L25/30 , G10L13/08 , G06K9/62 , G06N5/02 , G06N3/02 , G10L19/00 , G10L13/04 , G06N3/04 , G06N3/08
摘要: A technique improves training and speech quality of a text-to-speech (TTS) system having an artificial intelligence, such as a neural network. The TTS system is organized as a front-end subsystem and a back-end subsystem. The front-end subsystem is configured to provide analysis and conversion of text into input vectors, each having at least a base frequency, f0, a phenome duration, and a phoneme sequence that is processed by a signal generation unit of the back-end subsystem. The signal generation unit includes the neural network interacting with a pre-existing knowledgebase of phenomes to generate audible speech from the input vectors. The technique applies an error signal from the neural network to correct imperfections of the pre-existing knowledgebase of phenomes to generate audible speech signals. A back-end training system is configured to train the signal generation unit by applying psychoacoustic principles to improve quality of the generated audible speech signals.
-
公开(公告)号:US20220036875A1
公开(公告)日:2022-02-03
申请号:US17309436
申请日:2019-11-06
申请人: Inventio AG
发明人: Stefano Carriero
摘要: A method and a device for outputting an audible voice message in an elevator system includes at least the following steps: transmitting the content of the voice message as a text file to be output via the Internet to a web-based text-to-speech service provider; receiving an audio file from the service provider via the Internet, the audio file having been created based upon the transmitted text file to be output; and outputting the audio file in the elevator system as the audible voice message. If necessary, the text file to be output can be obtained by translating a source language text file into a target language beforehand with the aid of a translation service provider. The targeted use of online service providers allows the outlay needed to realize voice announcements in an elevator system at different use locations with different languages to be greatly reduced.
-
-
-
-
-
-
-
-
-