-
公开(公告)号:US20210012769A1
公开(公告)日:2021-01-14
申请号:US16509029
申请日:2019-07-11
Applicant: SoundHound, Inc.
Inventor: Cristina Vasconcelos , Zili Li
IPC: G10L15/22 , G10L15/02 , G10L15/30 , G10L15/18 , G10L15/187 , G10L15/24 , G10L15/16 , G10L15/06 , G06K9/46 , G06K9/62 , G06K9/72 , G06K9/00
Abstract: Systems and methods for processing speech are described. In certain examples, image data is used to generate visual feature tensors and audio data is used to generate audio feature tensors. The visual feature tensors and the audio feature tensors are used by a linguistic model to determine linguistic features that are usable to parse an utterance of a user. The generation of the feature tensors may be jointly configured with the linguistic model. Systems may be provided in a client-server architecture.
-
公开(公告)号:US20210256386A1
公开(公告)日:2021-08-19
申请号:US16790643
申请日:2020-02-13
Applicant: SoundHound, Inc.
Inventor: Maisy Wieman , Andrew Carl Spencer , Zìlì Li , Cristina Vasconcelos
Abstract: An audio processing system is described. The audio processing system uses a convolutional neural network architecture to process audio data, a recurrent neural network architecture to process at least data derived from an output of the convolutional neural network architecture, and a feed-forward neural network architecture to process at least data derived from an output of the recurrent neural network architecture. The feed-forward neural network architecture is configured to output classification scores for a plurality of sound units associated with speech. The classification scores indicate a presence of one or more sound units in the audio data. The convolutional neural network architecture has a plurality of convolutional groups arranged in series, where a convolutional group includes a combination of two data mappings arranged in parallel.
-
公开(公告)号:US20220139393A1
公开(公告)日:2022-05-05
申请号:US17547917
申请日:2021-12-10
Applicant: SoundHound, Inc.
Inventor: Zili Li , Cristina Vasconcelos
IPC: G10L15/22 , G10L15/02 , G10L15/30 , G10L15/18 , G10L15/187 , G10L15/24 , G10L15/06 , G06K9/62 , G10L15/16 , G06V10/40 , G06V10/70 , G06V20/40
Abstract: A driver interface for use within an automobile provides responses to voice commands issued for example by a driver of the automobile. The interface includes a camera and microphone for capturing image data such as gestures and audio data from the automobile driver. The image data and audio data are processed to extract image and linguistic features from the image and audio data, which image and linguistic features are processed to interpret and infer a meaning of the voice command.
-
公开(公告)号:US11257493B2
公开(公告)日:2022-02-22
申请号:US16509029
申请日:2019-07-11
Applicant: SoundHound, Inc.
Inventor: Cristina Vasconcelos , Zili Li
IPC: G10L15/00 , G10L15/22 , G10L15/02 , G10L15/30 , G10L15/18 , G10L15/187 , G10L15/24 , G10L15/06 , G06K9/46 , G06K9/62 , G06K9/72 , G06K9/00 , G10L15/16 , G10L25/30
Abstract: Systems and methods for processing speech are described. In certain examples, image data is used to generate visual feature tensors and audio data is used to generate audio feature tensors. The visual feature tensors and the audio feature tensors are used by a linguistic model to determine linguistic features that are usable to parse an utterance of a user. The generation of the feature tensors may be jointly configured with the linguistic model. Systems may be provided in a client-server architecture.
-
-
-