-
公开(公告)号:US20230410833A1
公开(公告)日:2023-12-21
申请号:US18131531
申请日:2023-04-06
Applicant: Amazon Technologies, Inc.
Inventor: Shiva Kumar Sundaram , Chao Wang , Shiv Naga Prasad Vitaladevuni , Spyridon Matsoukas , Arindam Mandal
CPC classification number: G10L25/30 , G10L25/51 , G10L15/02 , G10L15/16 , G10L15/22 , G10L15/30 , G10L25/78 , G10L2015/088
Abstract: A speech-capture device can capture audio data during wakeword monitoring and use the audio data to determine if a user is present nearby the device, even if no wakeword is spoken. Audio such as speech, human originating sounds (e.g., coughing, sneezing), or other human related noises (e.g., footsteps, doors closing) can be used to detect audio. Audio frames are individually scored as to whether a human presence is detected in the particular audio frames. The scores are then smoothed relative to nearby frames to create a decision for a particular frame. Presence information can then be sent according to a periodic schedule to a remote device to create a presence “heartbeat” that regularly identifies whether a user is detected proximate to a speech-capture device.
-
公开(公告)号:US20220189458A1
公开(公告)日:2022-06-16
申请号:US17584489
申请日:2022-01-26
Applicant: Amazon Technologies, Inc.
Inventor: Spyridon Matsoukas , Aparna Khare , Vishwanathan Krishnamoorthy , Shamitha Somashekar , Arindam Mandal
Abstract: Systems, methods, and devices for verifying a user are disclosed. A speech-controlled device captures a spoken command, and sends audio data corresponding thereto to a server. The server performs ASR on the audio data to determine ASR confidence data. The server, in parallel, performs user verification on the audio data to determine user verification confidence data. The server may modify the user verification confidence data using the ASR confidence data. In addition or alternatively, the server may modify the user verification confidence data using at least one of a location of the speech-controlled device within a building, a type of the speech-controlled device, or a geographic location of the speech-controlled device.
-
公开(公告)号:US20220093101A1
公开(公告)日:2022-03-24
申请号:US17112520
申请日:2020-12-04
Applicant: Amazon Technologies, Inc.
Inventor: Prakash Krishnan , Arindam Mandal , Siddhartha Reddy Jonnalagadda , Nikko Strom , Ariya Rastrow , Ying Shi , David Chi-Wai Tang , Nishtha Gupta , Aaron Challenner , Bonan Zheng , Angeliki Metallinou , Vincent Auvray , Minmin Shen
Abstract: A system that is capable of resolving anaphora using timing data received by a local device. A local device outputs audio representing a list of entries. The audio may represent synthesized speech of the list of entries. A user can interrupt the device to select an entry in the list, such as by saying “that one.” The local device can determine an offset time representing the time between when audio playback began and when the user interrupted. The local device sends the offset time and audio data representing the utterance to a speech processing system which can then use the offset time and stored data to identify which entry on the list was most recently output by the local device when the user interrupted. The system can then resolve anaphora to match that entry and can perform additional processing based on the referred to item.
-
公开(公告)号:US11200885B1
公开(公告)日:2021-12-14
申请号:US16219228
申请日:2018-12-13
Applicant: Amazon Technologies, Inc.
Inventor: Arindam Mandal , Nikko Strom , Angeliki Metallinou , Tagyoung Chung , Dilek Hakkani-Tur , Suranjit Adhikari , Sridhar Yadav Manoharan , Ankita De , Qing Liu , Raefer Christopher Gabriel , Rohit Prasad
IPC: G10L15/22 , G10L21/00 , G10L15/06 , G10L15/18 , G06F16/332
Abstract: A dialog manager receives text data corresponding to a dialog with a user. Entities represented in the text data are identified. Context data relating to the dialog is maintained, which may include prior dialog, prior API calls, user profile information, or other data. Using the text data and the context data, an N-best list of one or more dialog models is selected to process the text data. After processing the text data, the outputs of the N-best models are ranked and a top-scoring output is selected. The top-scoring output may be an API call and/or an audio prompt.
-
公开(公告)号:US10964315B1
公开(公告)日:2021-03-30
申请号:US15639330
申请日:2017-06-30
Applicant: Amazon Technologies, Inc.
Inventor: Minhua Wu , Sankaran Panchapagesan , Ming Sun , Shiv Naga Prasad Vitaladevuni , Bjorn Hoffmeister , Ryan Paul Thomas , Arindam Mandal
Abstract: An approach to wakeword detection uses an explicit representation of non-wakeword speech in the form of subword (e.g., phonetic monophone) units that do not necessarily occur in the wakeword and that broadly represent general speech. These subword units are arranged in a “background” model, which at runtime essentially competes with the wakeword model such that a wakeword is less likely to be declare as occurring when the input matches that background model well. An HMM may be used with the model to locate possible occurrences of the wakeword. Features are determined from portions of the input corresponding to subword units of the wakeword detected using the HMM. A secondary classifier is then used to process the features to yield a decision of whether the wakeword occurred.
-
公开(公告)号:US10726830B1
公开(公告)日:2020-07-28
申请号:US16143910
申请日:2018-09-27
Applicant: Amazon Technologies, Inc.
Inventor: Arindam Mandal , Kenichi Kumatani , Nikko Strom , Minhua Wu , Shiva Sundaram , Bjorn Hoffmeister , Jeremie Lecomte
Abstract: Techniques for speech processing using a deep neural network (DNN) based acoustic model front-end are described. A new modeling approach directly models multi-channel audio data received from a microphone array using a first model (e.g., multi-channel DNN) that takes in raw signals and produces a first feature vector that may be used similarly to beamformed features generated by an acoustic beamformer. A second model (e.g., feature extraction DNN) processes the first feature vector and transforms it to a second feature vector having a lower dimensional representation. A third model (e.g., classification DNN) processes the second feature vector to perform acoustic unit classification and generate text data. These three models may be jointly optimized for speech processing (as opposed to individually optimized for signal enhancement), enabling improved performance despite a reduction in microphones and a reduction in bandwidth consumption during real-time processing.
-
公开(公告)号:US10679621B1
公开(公告)日:2020-06-09
申请号:US15927764
申请日:2018-03-21
Applicant: Amazon Technologies, Inc.
Inventor: Shiva Kumar Sundaram , Minhua Wu , Anirudh Raju , Spyridon Matsoukas , Arindam Mandal , Kenichi Kumatani
IPC: G10L15/22 , G10L15/187 , G10L15/26 , G10L15/30 , H04R3/00 , G10L21/0208 , G06F40/40 , H04W4/02 , G10L21/0216 , G10L15/08
Abstract: Systems and methods for utilizing microphone array information for acoustic modeling are disclosed. Audio data may be received from a device having a microphone array configuration. Microphone configuration data may also be received that indicates the configuration of the microphone array. The microphone configuration data may be utilized as an input vector to an acoustic model, along with the audio data, to generate phoneme data. Additionally, the microphone configuration data may be utilized to train and/or generate acoustic models, select an acoustic model to perform speech recognition with, and/or to improve trigger sound detection.
-
公开(公告)号:US09875081B2
公开(公告)日:2018-01-23
申请号:US14860400
申请日:2015-09-21
Applicant: Amazon Technologies, Inc.
Inventor: James David Meyers , Shah Samir Pravinchandra , Yue Liu , Arlen Dean , Daniel Miller , Arindam Mandal
IPC: G10L15/22 , G10L15/00 , G06F3/16 , G10L15/26 , G10L15/18 , G10L15/06 , G10L15/32 , G01L21/00 , G10L15/08
CPC classification number: G06F3/167 , G10L15/00 , G10L15/063 , G10L15/1815 , G10L15/22 , G10L15/222 , G10L15/26 , G10L15/32 , G10L2015/088 , G10L2015/223 , G10L2015/226
Abstract: A system may use multiple speech interface devices to interact with a user by speech. All or a portion of the speech interface devices may detect a user utterance and may initiate speech processing to determine a meaning or intent of the utterance. Within the speech processing, arbitration is employed to select one of the multiple speech interface devices to respond to the user utterance. Arbitration may be based in part on metadata that directly or indirectly indicates the proximity of the user to the devices, and the device that is deemed to be nearest the user may be selected to respond to the user utterance.
-
公开(公告)号:US20170278514A1
公开(公告)日:2017-09-28
申请号:US15196540
申请日:2016-06-29
Applicant: AMAZON TECHNOLOGIES, INC.
Inventor: Lambert Mathias , Thomas Kollar , Arindam Mandal , Angeliki Metallinou
CPC classification number: G10L15/22 , G06F17/277 , G06F17/279 , G06F17/30637 , G06F17/30654 , G06F17/30705 , G10L15/02 , G10L15/142 , G10L15/1815 , G10L15/26 , G10L2015/223
Abstract: A system capable of performing natural language understanding (NLU) without the concept of a domain that influences NLU results. The present system uses a hierarchical organizations of intents/commands and entity types, and trained models associated with those hierarchies, so that commands and entity types may be determined for incoming text queries without necessarily determining a domain for the incoming text. The system thus operates in a domain agnostic manner, in a departure from multi-domain architecture NLU processing where a system determines NLU results for multiple domains simultaneously and then ranks them to determine which to select as the result.
-
公开(公告)号:US20170083285A1
公开(公告)日:2017-03-23
申请号:US14860400
申请日:2015-09-21
Applicant: Amazon Technologies, Inc.
Inventor: James David Meyers , Shah Samir Pravinchandra , Yue Liu , Arlen Dean , Daniel Miller , Arindam Mandal
CPC classification number: G06F3/167 , G10L15/00 , G10L15/063 , G10L15/1815 , G10L15/22 , G10L15/222 , G10L15/26 , G10L15/32 , G10L2015/088 , G10L2015/223 , G10L2015/226
Abstract: A system may use multiple speech interface devices to interact with a user by speech. All or a portion of the speech interface devices may detect a user utterance and may initiate speech processing to determine a meaning or intent of the utterance. Within the speech processing, arbitration is employed to select one of the multiple speech interface devices to respond to the user utterance. Arbitration may be based in part on metadata that directly or indirectly indicates the proximity of the user to the devices, and the device that is deemed to be nearest the user may be selected to respond to the user utterance.
-
-
-
-
-
-
-
-
-