-
公开(公告)号:US20230418380A1
公开(公告)日:2023-12-28
申请号:US18317058
申请日:2023-05-13
Applicant: MindMaze Group SA
Inventor: Tej TADI , Robert LEEB , Nicolas BOURDAUD , Gangadhar GARIPELLI , Skander MENSI , Nicolas MERLINI , Yann LEBRUN
IPC: G06F3/01 , G06V40/16 , G06F18/245 , G06F18/2132 , G06F18/2453 , G06F18/2415
CPC classification number: G06F3/015 , G06V40/174 , G06V40/176 , G06F18/245 , G06F18/2132 , G06F18/2453 , G06F18/24155 , G10L2015/025
Abstract: A system, method and apparatus for detecting facial expressions according to EMG signals.
-
公开(公告)号:US20230386472A1
公开(公告)日:2023-11-30
申请号:US17804508
申请日:2022-05-27
Applicant: Microsoft Technology Licensing, LLC
Inventor: Yuchen LI
IPC: G10L15/26 , G10L15/02 , G06F40/284 , G10L15/187
CPC classification number: G10L15/26 , G10L15/02 , G06F40/284 , G10L15/187 , G10L2015/025
Abstract: A computer-implemented method is disclosed. A search query of a text transcription is received. The search query includes a word or words having a specified spelling. A sequence of search phonemes corresponding to the specified spelling is generated. A sequence of transcript phonemes corresponding to the text transcription is generated from the text transcription. A search alignment in which the sequence of search phonemes is aligned to a transcript phoneme fragment is generated. Based at least on the search alignment having a quality score exceeding a quality score threshold, the transcript phoneme fragment and an associated portion of the text transcription is determined to result from an utterance of the specified spelling in an audio session corresponding to the text transcription. A search result indicating that the transcript phoneme fragment and the associated portion of the text transcription is determined to have resulted from the utterance is output.
-
公开(公告)号:US20230368782A1
公开(公告)日:2023-11-16
申请号:US18217888
申请日:2023-07-03
Applicant: Microsoft Technology Licensing, LLC
Inventor: Yao QIAN , Yu WU , Kenichi KUMATANI , Shujie LIU , Furu WEI , Nanshan ZENG , Xuedong David HUANG , Chengyi WANG
IPC: G10L15/187 , G10L15/22 , G10L15/06 , G10L15/02 , G06N20/00
CPC classification number: G10L15/187 , G10L15/22 , G10L15/063 , G10L15/02 , G06N20/00 , G10L2015/025
Abstract: Systems and methods are provided for training a machine learning model to learn speech representations. Labeled speech data or both labeled and unlabeled data sets is applied to a feature extractor of a machine learning model to generate latent speech representations. The latent speech representations are applied to a quantizer to generate quantized latent speech representations and to a transformer context network to generate contextual representations. Each contextual representation included in the contextual representations is aligned with a phoneme label to generate phonetically-aware contextual representations. Quantized latent representations are aligned with phoneme labels to generate phonetically aware latent speech representations. Systems and methods also include randomly replacing a sub-set of the contextual representations with quantized latent speech representations during their alignments to phoneme labels and aligning the phonetically aware latent speech representations to the contextual representations using supervised learning.
-
公开(公告)号:US11817090B1
公开(公告)日:2023-11-14
申请号:US16712394
申请日:2019-12-12
Applicant: Amazon Technologies, Inc.
Inventor: James Claiborne Moore , Majid Laali , Yasser Gonzalez Fernandez , Siyong Liang , Ameya Ashok Limaye
IPC: G10L15/187 , G10L15/22 , G10L15/02 , G10L15/18
CPC classification number: G10L15/187 , G10L15/02 , G10L15/1815 , G10L15/22 , G10L2015/025 , G10L2015/223
Abstract: A phonetic search system may pass phonetic information from an automatic speech recognition (ASR) system to a natural language understanding (NLU) system for the latter to leverage when performing entity resolution in the presence of ambiguous interpretations. The ASR system may include an acoustic model and a language model. The acoustic model can process audio data to generate hypotheses that can be mapped to acoustic data; i.e., one or more acoustic units such as phonemes. The language model can process the acoustic units to generate text data representing possible transcriptions of the audio data. ASR/NLU systems may have difficulty interpreting speech when confronted with, for example, homographs, which are words that are spelled the same, but have different meanings. When uncertainty in the final transcription is high, the system can leverage the acoustic data to improve the accuracy of entity resolution.
-
公开(公告)号:US11816151B2
公开(公告)日:2023-11-14
申请号:US16875927
申请日:2020-05-15
Applicant: Audible Magic Corporation
Inventor: Erling Wold
IPC: G10L15/02 , G10L15/187 , G10L15/32 , G10L25/90 , G06F16/683
CPC classification number: G06F16/685 , G10L15/02 , G10L15/187 , G10L15/32 , G10L25/90 , G10L2015/025
Abstract: Embodiments cover identifying an unidentified media content item as a cover of a known media content item using lyrical contents. In an example, a processing device receives an unidentified media content item and determines lyrical content associated with the unidentified media content item. The processing device then determines a lyrical similarity between the lyrical content associated with the unidentified media content item and additional lyrical content associated with a known media content item of a plurality of known media content items. The processing device then identifies the unidentified media content item as a cover of the known media content item based at least in part on the lyrical similarity, resulting in an identified cover-media content item.
-
公开(公告)号:US11804228B2
公开(公告)日:2023-10-31
申请号:US17273542
申请日:2019-08-09
Applicant: SAMSUNG ELECTRONICS CO., LTD.
Inventor: Chisang Jung
CPC classification number: G10L17/08 , G10L15/02 , G10L15/26 , G10L17/02 , G10L17/04 , G10L15/04 , G10L2015/025
Abstract: The present disclosure relates to a speaker model adaptation method and device for enhancing text-independent speaker recognition performance. Specifically, the disclosure relates to a method and a device whereby, for the adaption of a speaker model pre-stored in an electronic device, text-independent speaker recognition performance is improved by considering variations in the amount of speaker characteristics information per phoneme unit.
-
27.
公开(公告)号:US20230329630A1
公开(公告)日:2023-10-19
申请号:US18043271
申请日:2021-08-30
Applicant: PFIZER INC.
Inventor: Shyamal Patel , Paul William Wacnik , Kara Chappie , Robert Mather , Brian Tracey , Maria del Mar Santamaria Serra
IPC: A61B5/00 , A61B5/08 , A61B7/00 , A61K31/675 , A61K38/06 , G10L15/02 , G10L25/66 , G10L15/26 , G10L15/22 , G16H10/60 , G16H40/20 , G16H20/10 , G16H50/30
CPC classification number: A61B5/4803 , A61B5/08 , A61B5/7275 , A61B5/4839 , A61B5/4848 , A61B5/4842 , A61B7/003 , A61B5/7278 , A61K31/675 , A61K38/06 , G10L15/02 , G10L25/66 , G10L15/26 , G10L15/22 , G16H10/60 , G16H40/20 , G16H20/10 , G16H50/30 , G10L2015/025
Abstract: Technology is disclosed for monitoring a user's respirator), condition and provide decision support by analyzing a user's audio data. Spoken phonemes may be detected within audio data, and acoustic features may be extracted for the phonemes. A distance metric may be computed to compare phoneme feature sets of a user. Based on the comparison, a determination about the user's respiratory condition, such as whether the user has a respiratory condition (e.g., an infection) and/or whether the condition is changing, may be made. Some aspects include predicting the user's respiratory condition in the future utilizing the phoneme feature sets. Decision support tools in the form of computer applications or services may utilize the detected or predicted respiratory condition information to initiate an action for treating a current condition or mitigating a future risk.
-
公开(公告)号:US20230317079A1
公开(公告)日:2023-10-05
申请号:US18329787
申请日:2023-06-06
Applicant: Promptu Systems Corporation
Inventor: Harry William Printz
CPC classification number: G10L15/22 , G10L15/1815 , G10L15/02 , G10L15/16 , G10L15/19 , G01C21/3608 , G10L15/32 , G06F40/295 , G10L2015/025 , G06F3/167
Abstract: Various embodiments contemplate systems and methods for performing automatic speech recognition (ASR) and natural language understanding (NLU) that enable high accuracy recognition and understanding of freely spoken utterances which may contain proper names and similar entities. The proper name entities may contain or be comprised wholly of words that are not present in the vocabularies of these systems as normally constituted. Recognition of the other words in the utterances in question, e.g. words that are not part of the proper name entities, may occur at regular, high recognition accuracy. Various embodiments provide as output not only accurately transcribed running text of the complete utterance, but also a symbolic representation of the meaning of the input, including appropriate symbolic representations of proper name entities, adequate to allow a computer system to respond appropriately to the spoken request without further analysis of the user's input.
-
公开(公告)号:US20230298576A1
公开(公告)日:2023-09-21
申请号:US18322207
申请日:2023-05-23
Applicant: Google LLC
Inventor: Raziel Alvarez Guevara , Hyun Jin Park , Patrick Violette
CPC classification number: G10L15/16 , G10L15/02 , G10L15/063 , G10L15/22 , G10L2015/025 , G10L2015/088
Abstract: A method for training hotword detection includes receiving a training input audio sequence including a sequence of input frames that define a hotword that initiates a wake-up process on a device. The method also includes feeding the training input audio sequence into an encoder and a decoder of a memorized neural network. Each of the encoder and the decoder of the memorized neural network include sequentially-stacked single value decomposition filter (SVDF) layers. The method further includes generating a logit at each of the encoder and the decoder based on the training input audio sequence. For each of the encoder and the decoder, the method includes smoothing each respective logit generated from the training input audio sequence, determining a max pooling loss from a probability distribution based on each respective logit, and optimizing the encoder and the decoder based on all max pooling losses associated with the training input audio sequence.
-
公开(公告)号:US20230298574A1
公开(公告)日:2023-09-21
申请号:US18184630
申请日:2023-03-15
Applicant: Google LLC
Inventor: Fadi Biadsy , Youzheng Chen , Xia Zhang , Oleg Rybakov , Andrew M. Rosenberg , Pedro J.Moreno Mengibar
CPC classification number: G10L15/16 , G10L15/063 , G10L15/02 , G10L2015/025
Abstract: A method for speech conversion includes obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech. The method further includes receiving a speech conversion request including input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker identifier uniquely identifying the target speaker. The method includes activating, using the speaker identifier, a particular sub-model for biasing the speech conversion model to recognize a type of the atypical speech associated with the target speaker identified by the speaker identifier. The method includes converting, using the speech conversion model biased by the activated particular sub-model, the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into output audio data corresponding to a synthesized canonical fluent speech representation of the utterance spoken by the target speaker.