Abstract:
Disclosed herein are systems, methods, and computer-readable storage media for selecting a speech recognition model in a standardized speech recognition infrastructure. The system receives speech from a user, and if a user-specific supervised speech model associated with the user is available, retrieves the supervised speech model. If the user-specific supervised speech model is unavailable and if an unsupervised speech model is available, the system retrieves the unsupervised speech model. If the user-specific supervised speech model and the unsupervised speech model are unavailable, the system retrieves a generic speech model associated with the user. Next the system recognizes the received speech from the user with the retrieved model. In one embodiment, the system trains a speech recognition model in a standardized speech recognition infrastructure. In another embodiment, the system handshakes with a remote application in a standardized speech recognition infrastructure.
Abstract:
A task-oriented dialog system determines an endpoint in a user utterance by receiving incremental portions of a user utterance that is provided in real time during a task-oriented communication session between a user and a virtual agent (VA). The task-oriented dialog system recognizes words in the incremental portions using an automated speech recognition (ASR) model and generates semantic information for the incremental portions of the utterance by applying a natural language processing (NLP) model to the recognized words. An acoustic-prosodic signature of the incremental portions of the utterance is generated using an acoustic-prosodic model. The task-oriented dialog system can generate a feature vector that represents the incrementally recognized words, the semantic information, the acoustic-prosodic signature, and corresponding confidence scores of the model outputs. A model is applied to the feature vector to identify a likely endpoint in the user utterance.
Abstract:
A natural language understanding (NLU) system generates in-place annotations for natural language utterances or other types of time-based media based on stand-off annotations. The in-place annotations are associated with particular sub-sequences of an annotation, which provides richer information than stand-off annotations, which are associated only with an utterance as a whole. To generate the in-place annotations for an utterance, the NLU system applies an encoder network and a decoder network to obtain attention weights for the various tokens within the utterance. The NLU system disqualifies tokens of the utterance based on their corresponding attention weights, and selects highest-scoring contiguous sequences of tokens between the disqualified tokens. In-place annotations are associated with the selected sequences.
Abstract:
Systems and methods are disclosed herein for discerning aspects of user speech to determine user intent and/or other acoustic features of a sound input without the use of an ASR engine. To this end, a processor may receive a sound signal comprising raw acoustic data from a client device, and divides the data into acoustic units. The processor feeds the acoustic units through a first machine learning model to obtain a first output and determines a first mapping, using the first output, of each respective acoustic unit to a plurality of candidate representations of the respective acoustic unit. The processor feeds each candidate representation of the plurality through a second machine learning model to obtain a second output, determines a second mapping, using the second output, of each candidate representation to a known condition, and determines a label for the sound signal based on the second mapping.
Abstract:
A natural language processing system has a hierarchy of user intents related to a domain of interest, the hierarchy having specific intents corresponding to leaf nodes of the hierarchy, and more general intents corresponding to ancestor nodes of the leaf nodes. The system also has a trained understanding model that can classify natural language utterances according to user intent. When the understanding model cannot determine with sufficient confidence that a natural language utterance corresponds to one of the specific intents, the natural language processing system traverses the hierarchy of intents to find a more general user intent that is related to the most applicable specific intent of the utterance and for which there is sufficient confidence. The general intent can then be used to prompt the user with questions applicable to the general intent to obtain the missing information needed for a specific intent.
Abstract:
A speech interpretation module interprets the audio of user utterances as sequences of words. To do so, the speech interpretation module parameterizes a literal corpus of expressions by identifying portions of the expressions that correspond to known concepts, and generates a parameterized statistical model from the resulting parameterized corpus. When speech is received the speech interpretation module uses a hierarchical speech recognition decoder that uses both the parameterized statistical model and language sub-models that specify how to recognize a sequence of words. The separation of the language sub-models from the statistical model beneficially reduces the size of the literal corpus needed for training, reduces the size of the resulting model, provides more fine-grained interpretation of concepts, and improves computational efficiency by allowing run-time incorporation of the language sub-models.
Abstract:
A speech interpretation module interprets the audio of user utterances as sequences of words. To do so, the speech interpretation module parameterizes a literal corpus of expressions by identifying portions of the expressions that correspond to known concepts, and generates a parameterized statistical model from the resulting parameterized corpus. When speech is received the speech interpretation module uses a hierarchical speech recognition decoder that uses both the parameterized statistical model and language sub-models that specify how to recognize a sequence of words. The separation of the language sub-models from the statistical model beneficially reduces the size of the literal corpus needed for training, reduces the size of the resulting model, provides more fine-grained interpretation of concepts, and improves computational efficiency by allowing run-time incorporation of the language sub-models.
Abstract:
A system and method is provided for combining active and unsupervised learning for automatic speech recognition. This process enables a reduction in the amount of human supervision required for training acoustic and language models and an increase in the performance given the transcribed and un-transcribed data.
Abstract:
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for approximating responses to a user speech query in voice-enabled search based on metadata that include demographic features of the speaker. A system practicing the method recognizes received speech from a speaker to generate recognized speech, identifies metadata about the speaker from the received speech, and feeds the recognized speech and the metadata to a question-answering engine. Identifying the metadata about the speaker is based on voice characteristics of the received speech. The demographic features can include age, gender, socio-economic group, nationality, and/or region. The metadata identified about the speaker from the received speech can be combined with or override self-reported speaker demographic information.
Abstract:
Systems, computer-implemented methods, and tangible computer-readable media for generating a pronunciation model. The method includes identifying a generic model of speech composed of phonemes, identifying a family of interchangeable phonemic alternatives for a phoneme in the generic model of speech, labeling the family of interchangeable phonemic alternatives as referring to the same phoneme, and generating a pronunciation model which substitutes each family for each respective phoneme. In one aspect, the generic model of speech is a vocal tract length normalized acoustic model. Interchangeable phonemic alternatives can represent a same phoneme for different dialectal classes. An interchangeable phonemic alternative can include a string of phonemes.