Abstract:
Systems and methods are disclosed for providing non-lexical cues in synthesized speech. An example system includes processor circuitry to generate a breathing cue to enhance speech to be synthesized from text; determine a first insertion point of the breathing cue in the text, wherein the breathing cue is identified by a first tag of a markup language; generate a prosody cue to enhance speech to be synthesized from the text; determine a second insertion point of the prosody cue in the text, wherein the prosody cue is identified by a second tag of the markup language; insert the breathing cue at the first insertion point based on the first tag and the prosody cue at the second insertion point based on the second tag; and trigger a synthesis of the speech from the text, the breathing cue, and the prosody cue.
Abstract:
Systems and methods may provide non-lexical cues in synthesized speech. A system may generate response text and a response intent based on user input. Non-lexical cue insertion points are determined based on the characteristics of the text and/or the intent. One or more non-lexical cues are inserted at insertion points to generate augmented text. The augmented text is synthesized into speech using speech units associated with the response text and the inserted response intent.
Abstract:
The present disclosure describes dynamically adjusting linguistic models for automatic speech recognition based on biometric information to produce a more reliable speech recognition experience. Embodiments include receiving a speech signal, receiving a biometric signal from a biometric sensor implemented at least partially in hardware, determining a linguistic model based on the biometric signal, and processing the speech signal for speech recognition using the linguistic model based on the biometric signal.
Abstract:
Systems and methods are disclosed for providing non-lexical cues in synthesized speech. An example system includes processor circuitry to generate a breathing cue to enhance speech to be synthesized from text; determine a first insertion point of the breathing cue in the text, wherein the breathing cue is identified by a first tag of a markup language; generate a prosody cue to enhance speech to be synthesized from the text; determine a second insertion point of the prosody cue in the text, wherein the prosody cue is identified by a second tag of the markup language; insert the breathing cue at the first insertion point based on the first tag and the prosody cue at the second insertion point based on the second tag; and trigger a synthesis of the speech from the text, the breathing cue, and the prosody cue.
Abstract:
Systems and methods are disclosed for providing non-lexical cues in synthesized speech. Original text is analyzed to determine characteristics of the text and/or to derive or augment an intent (e.g., an intent code). Non-lexical cue insertion points are determined based on the characteristics of the text and/or the intent. One or more non-lexical cues are inserted at insertion points to generate augmented text. The augmented text is synthesized into speech, including converting the non-lexical cues to speech output.
Abstract:
Embodiments are directed to receiving a speech signal representative of audible speech, processing the speech signal to interpret the speech signal by a dialog system implemented at least partially in hardware, determining, by the dialog system, that the speech signal cannot be correctly interpreted, receiving a noise signal representative of audible background noise, identifying a noise level from the noise signal, determining, by the dialog system, that the noise level is too high for the speech signal to be correctly interpreted, and providing, by the dialog system, a message indicating that the noise level is too high for the speech signal to be correctly interpreted.
Abstract:
Systems and methods are disclosed for providing non-lexical cues in synthesized speech. Original text is analyzed to determine characteristics of the text and/or to derive or augment an intent (e.g., an intent code). Non-lexical cue insertion points are determined based on the characteristics of the text and/or the intent. One or more non-lexical cues are inserted at insertion points to generate augmented text. The augmented text is synthesized into speech, including converting the non-lexical cues to speech output.
Abstract:
Systems and methods are disclosed for providing non-lexical cues in synthesized speech. An example system includes processor circuitry to generate a breathing cue to enhance speech to be synthesized from text; determine a first insertion point of the breathing cue in the text, wherein the breathing cue is identified by a first tag of a markup language; generate a prosody cue to enhance speech to be synthesized from the text; determine a second insertion point of the prosody cue in the text, wherein the prosody cue is identified by a second tag of the markup language; insert the breathing cue at the first insertion point based on the first tag and the prosody cue at the second insertion point based on the second tag; and trigger a synthesis of the speech from the text, the breathing cue, and the prosody cue.
Abstract:
Systems and methods are disclosed for providing non-lexical cues in synthesized speech. An example system includes one or more storage devices including instructions and a processor to execute the instructions. The processor is to execute the instructions to: determine a user tone of the user input; generate a response to the user input based on the user tone; and identify a response tone associated with the user tone. The example system also includes a transmitter to communicate the response and the response tone over a network.
Abstract:
Systems and methods are disclosed for providing non-lexical cues in synthesized speech. Original text is analyzed to determine characteristics of the text and/or to derive or augment an intent (e.g., an intent code). Non-lexical cue insertion points are determined based on the characteristics of the text and/or the intent. One or more non-lexical cues are inserted at insertion points to generate augmented text. The augmented text is synthesized into speech, including converting the non-lexical cues to speech output.