Abstract:
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for advanced turn-taking in an interactive spoken dialog system. A system configured according to this disclosure can incrementally process speech prior to completion of the speech utterance, and can communicate partial speech recognition results upon finding particular conditions. A first condition which, if found, allows the system to communicate partial speech recognition results, is that the most recent word found in the partial results is statistically likely to be the termination of the utterance, also known as a terminal node. A second condition is the determination that all search paths within a speech lattice converge to a common node, also known as a pinch node, before branching out again. Upon finding either condition, the system can communicate the partial speech recognition results. Stability and correctness probabilities can also determine which partial results are communicated.
Abstract:
A system and method for integrating incremental speech recognition in dialog systems. An example system configured to practice the method receives incremental speech recognition results of user speech as part of a dialog with a user, and copies a dialog manager operating on the user speech to generate temporary instances of the dialog manager. Then the system evaluates actions the temporary instances of the dialog manager would take based on the incremental speech recognition results, and identifies an action that would advance the dialog and a corresponding temporary instance of the dialog manager. The system can then execute the action in the dialog and optionally replace the dialog manager with the corresponding temporary instance of the dialog manager. The action can include making a turn-taking decision in the dialog, such as whether, what, and when to speak or whether to be silent.
Abstract:
Methods, systems, devices, and media for creating a plan through multimodal search inputs are provided. A first search request comprises a first input received via a first input mode and a second input received via a different second input mode. The second input identifies a geographic area. First search results are displayed based on the first search request and corresponding to the geographic area. Each of the first search results is associated with a geographic location. A selection of one of the first search results is received and added to a plan. A second search request is received after the selection, and second search results are displayed in response to the second search request. The second search results are based on the second search request and correspond to the geographic location of the selected one of the first search results.
Abstract:
A system, method and computer-readable storage devices are disclosed for using targeted clarification (TC) questions in dialog systems in a multimodal virtual agent system (MVA) providing access to information about movies, restaurants, and musical events. In contrast with open-domain spoken systems, the MVA application covers a domain with a fixed set of concepts and uses a natural language understanding (NLU) component to mark concepts in automatically recognized speech. Instead of identifying an error segment, localized error detection (LED) identifies which of the concepts are likely to be present and correct using domain knowledge, automatic speech recognition (ASR), and NLU tags and scores. If at least concept is identified to be present but not correct, the TC component uses this information to generate a targeted clarification question. This approach computes probability distributions of concept presence and correctness for each user utterance, which can apply to automatic learning for clarification policies.
Abstract:
Methods, systems, devices, and media for creating a plan through multimodal search inputs are provided. A multimodal virtual assistant receives a first search request which comprises a geographic area. First search results are displayed in response to the first search request being received. The first search results are based on the first search request and correspond to the geographic area. Each of the first search results is associated with a geographic location. The multimodal virtual assistant receives a selection of one of the first search results, and adds the selected one of the first search results to a plan. A second search request is received after the selection, and second search results are displayed in response to the second search request being received. The second search results are based on the second search request and correspond to the geographic location of the selected one of the first search results.
Abstract:
Systems, methods, and computer-readable storage devices are for an event-driven multi-agent architecture improves via a semi-hierarchical multi-agent reinforcement learning approach. A system receives a user input during a speech dialog between a user and the system. The system then processes the user input, identifying an importance of the user input to the speech dialog based on a user classification and identifying a variable strength turn-taking signal inferred from the user input. An utterance selection agent selects an utterance for replying to the user input based on the importance of the user input, and a turn-taking agent determines whether to output the utterance based on the utterance, and the variable strength turn-taking signal. When the turn-taking agent indicates the utterance should be output, the system selects when to output the utterance.