-
公开(公告)号:US20200335119A1
公开(公告)日:2020-10-22
申请号:US16434537
申请日:2019-06-07
发明人: Xiong XIAO , Zhuo CHEN , Takuya YOSHIOKA , Changliang LIU , Hakan ERDOGAN , Dimitrios Basile DIMITRIADIS , Yifan GONG , James Garnet Droppo, III
IPC分类号: G10L21/028 , G10L21/0208
摘要: Embodiments are associated with determination of a first plurality of multi-dimensional vectors, each of the first plurality of multi-dimensional vectors representing speech of a target speaker, determination of a multi-dimensional vector representing a speech signal of two or more speakers, determination of a weighted vector representing speech of the target speaker based on the first plurality of multi-dimensional vectors and on similarities between the multi-dimensional vector and each of the first plurality of multi-dimensional vectors, and extraction of speech of the target speaker from the speech signal based on the weighted vector and the speech signal.
-
公开(公告)号:US20200334538A1
公开(公告)日:2020-10-22
申请号:US16410741
申请日:2019-05-13
发明人: Zhong MENG , Jinyu LI , Yong ZHAO , Yifan GONG
IPC分类号: G06N3/08 , G10L15/16 , G06N3/04 , G10L15/183
摘要: Embodiments are associated with conditional teacher-student model training. A trained teacher model configured to perform a task may be accessed and an untrained student model may be created. A model training platform may provide training data labeled with ground truths to the teacher model to produce teacher posteriors representing the training data. When it is determined that a teacher posterior matches the associated ground truth label, the platform may conditionally use the teacher posterior to train the student model. When it is determined that a teacher posterior does not match the associated ground truth label, the platform may conditionally use the ground truth label to train the student model. The models might be associated with, for example, automatic speech recognition (e.g., in connection with domain adaptation and/or speaker adaptation).
-
公开(公告)号:US20220159047A1
公开(公告)日:2022-05-19
申请号:US17345703
申请日:2021-06-11
发明人: Akash Alok MAHAJAN , Yifan GONG
摘要: Systems are provided for managing and coordinating STT/TTS systems and the communications between these systems when they are connected in online meetings and for mitigating connectivity issues that may arise during the online meetings to provide a seamless and reliable meeting experience with either live captions and/or rendered audio. Initially, online meeting communications are transmitted over a lossy connectionless type protocol/channel. Then, in response to detected connectivity problems with one or more systems involved in the online meeting, which can cause jitter or packet loss, for example, an instruction is dynamically generated and processed for causing one or more of the connected systems to transmit and/or process the online meeting content with a more reliable connection/protocol, such as a connection-oriented protocol. Codecs at the systems are used, when needed to convert speech to text with related speech attribute information and to convert text to speech.
-
公开(公告)号:US20220157324A1
公开(公告)日:2022-05-19
申请号:US17665862
申请日:2022-02-07
发明人: Yong ZHAO , Tianyan ZHOU , Jinyu LI , Yifan GONG , Jian WU , Zhuo CHEN
摘要: Embodiments may include determination, for each of a plurality of speech frames associated with an acoustic feature, of a phonetic feature based on the associated acoustic feature, generation of one or more two-dimensional feature maps based on the plurality of phonetic features, input of the one or more two-dimensional feature maps to a trained neural network to generate a plurality of speaker embeddings, and aggregation of the plurality of speaker embeddings into a speaker embedding based on respective weights determined for each of the plurality of speaker embeddings, wherein the speaker embedding is associated with an identity of the speaker.
-
公开(公告)号:US20210065683A1
公开(公告)日:2021-03-04
申请号:US16675515
申请日:2019-11-06
发明人: Zhong MENG , Yashesh GAUR , Jinyu LI , Yifan GONG
IPC分类号: G10L15/065 , G10L15/22 , G10L19/00 , G10L15/06
摘要: Embodiments are associated with a speaker-independent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-independent attention-based encoder-decoder model associated with a first output distribution, a speaker-dependent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-dependent attention-based encoder-decoder model associated with a second output distribution, training of the second attention-based encoder-decoder model to classify output tokens based on input speech frames of a target speaker and simultaneously training the speaker-dependent attention-based encoder-decoder model to maintain a similarity between the first output distribution and the second output distribution, and performing automatic speech recognition on speech frames of the target speaker using the trained speaker-dependent attention-based encoder-decoder model.
-
公开(公告)号:US20200334526A1
公开(公告)日:2020-10-22
申请号:US16410659
申请日:2019-05-13
发明人: Jinyu LI , Vadim MAZALOV , Changliang LIU , Liang LU , Yifan GONG
IPC分类号: G06N3/08 , G10L15/22 , G10L15/16 , G10L15/183
摘要: According to some embodiments, a machine learning model may include an input layer to receive an input signal as a series of frames representing handwriting data, speech data, audio data, and/or textual data. A plurality of time layers may be provided, and each time layer may comprise a uni-directional recurrent neural network processing block. A depth processing block may scan hidden states of the recurrent neural network processing block of each time layer, and the depth processing block may be associated with a first frame and receive context frame information of a sequence of one or more future frames relative to the first frame. An output layer may output a final classification as a classified posterior vector of the input signal. For example, the depth processing block may receive the context from information from an output of a time layer processing block or another depth processing block of the future frame.
-
公开(公告)号:US20190341053A1
公开(公告)日:2019-11-07
申请号:US16019318
申请日:2018-06-26
发明人: Shixiong ZHANG , Lingfeng WU , Eyal KRUPKA , Xiong XIAO , Yifan GONG
摘要: A computerized conference assistant includes a camera and a microphone. A face location machine of the computerized conference assistant finds a physical location of a human, based on a position of a candidate face in digital video captured by the camera. A beamforming machine of the computerized conference assistant outputs a beamformed signal isolating sounds originating from the physical location of the human. A diarization machine of the computerized conference assistant attributes information encoded in the beamformed signal to the human.
-
公开(公告)号:US20190279614A1
公开(公告)日:2019-09-12
申请号:US15917082
申请日:2018-03-09
发明人: Guoli YE , James DROPPO , Jinyu LI , Rui ZHAO , Yifan GONG
IPC分类号: G10L15/187 , G10L15/16 , G10L15/06 , G10L15/22
摘要: Non-limiting examples of the present disclosure describe advancements in acoustic-to-word modeling that improve accuracy in speech recognition processing through the replacement of out-of-vocabulary (OOV) tokens. During the decoding of speech signals, better accuracy in speech recognition processing is achieved through training and implementation of multiple different solutions that present enhanced speech recognition models. In one example, a hybrid neural network model for speech recognition processing combines a word-based neural network model as a primary model and a character-based neural network model as an auxiliary model. The primary word-based model emits a word sequence, and an output of character-based auxiliary model is consulted at a segment where the word-based model emits an OOV token. In another example, a mixed unit speech recognition model is developed and trained to generate a mixed word and character sequence during decoding of a speech signal without requiring generation of OOV tokens.
-
-
-
-
-
-
-