-
公开(公告)号:WO2023085191A1
公开(公告)日:2023-05-19
申请号:PCT/JP2022/041037
申请日:2022-11-02
Applicant: ソニーグループ株式会社
Inventor: 小野 淳也
Abstract: 情報処理装置(100)は、発話を所定の条件で区切った要素であるターンを複数含む発話群を取得する取得部(131)と、前記発話群のうち、所定のタスクにおける出力を得る対象であるカレントターンと、時系列において当該カレントターンの前後に位置する複数のターンとを一つのセンテンスに合成して事前学習モデルに入力することで、各ターンに対応する特徴量を出力する前処理部(132)と、前記前処理部によって出力された特徴量が入力されるニューラルネットワークの中間層のいずれかにおいて、カレントターンに対応する特徴量に対して注意機構(アテンション)から出力されるスコアに所定の重み付けを施すニューラルネットワークを用いて、前記所定のタスクに対する出力を得る推定部(133)と、を備える。
-
公开(公告)号:WO2023060002A1
公开(公告)日:2023-04-13
申请号:PCT/US2022/077124
申请日:2022-09-27
Applicant: GOOGLE LLC
Inventor: LU, Zhiyun , DOUTRE, Thibault , PAN, Yanwei , CAO, Liangliang , PRABHAVALKAR, Rohit , STROHMAN, Trevor , ZHANG, Chao
IPC: G10L15/06 , G10L15/16 , G10L15/04 , G10L15/063 , G10L15/197 , G10L15/22
Abstract: A method (700) includes obtaining training samples (400), each training sample including a corresponding sequence of speech segments (405) corresponding to a training utterance and a corresponding sequence of ground-truth transcriptions (415) for the sequence of speech segments, and each ground-truth transcription including a start time (414) and an end time (416) of a corresponding speech segment. For each of the training samples, the method includes processing, using a speech recognition model (200), the corresponding sequence of speech segments to obtain one or more speech recognition hypotheses (522) for the training utterance; and, for each speech recognition hypothesis obtained for the training utterance, identifying a respective number of word errors relative to the corresponding sequence of ground-truth transcriptions. The method trains the speech recognition model to minimize word error rate based on the respective number of word errors identified for each speech recognition hypothesis obtained for the training utterance.
-
公开(公告)号:WO2023059992A1
公开(公告)日:2023-04-13
申请号:PCT/US2022/076893
申请日:2022-09-22
Applicant: GOOGLE LLC
Inventor: LI, Bo , SAINATH, Tara N , PANG, Ruoming , CHANG, Shuo-yiin , XU, Qiumin , STROHMAN, Trevor , CHEN, Vince , LIANG, Qiao , LIU, Heguang , HE, Yanzhang , HAGHANI, Parisa , BIDICHANDANI, Sameer
IPC: G10L15/16 , G10L15/005 , G10L15/063 , G10L15/22 , G10L15/30 , G10L2015/226
Abstract: A method (500) includes receiving a sequence of acoustic frames (110) characterizing one or more utterances (106) as input to a multilingual automated speech recognition (ASR) model (200). The method also includes generating a higher order feature representation (204) for a corresponding acoustic frame. The method also includes generating a hidden representation (355) based on a sequence of non-blank symbols output (222) by a final softmax layer (240). The method also includes generating a probability distribution over possible speech recognition hypotheses based on the hidden representation and the higher order feature representation. The method also includes predicting an end of utterance (EOU) token (232) at an end of each utterance. The method also includes classifying each acoustic frame as either speech, initial silence, intermediate silence, or final silence.
-
公开(公告)号:WO2023022323A1
公开(公告)日:2023-02-23
申请号:PCT/KR2022/005165
申请日:2022-04-11
Applicant: 박봉래
Inventor: 박봉래 PARK, Bonglae
IPC: G09B19/06 , G09B5/04 , G10L15/02 , G10L15/26 , G10L15/16 , G10L15/06 , G10L15/197 , G06Q50/20 , G06Q50/10 , H04N21/488
Abstract: 컴퓨팅 장치에 의해 수행되는 방법에 있어서, 외국어 음성을 획득하는 단계, 상기 외국어 음성에 대응하는 외국어 텍스트를 획득하는 단계 및 상기 외국어 음성 및 상기 외국어 텍스트에 기초하여 상기 외국어 음성의 청취 난이도를 평가하는 단계를 포함하는, 외국어 음성의 청취 난이도 평가 방법이 개시된다.
-
公开(公告)号:WO2023012994A1
公开(公告)日:2023-02-09
申请号:PCT/JP2021/029212
申请日:2021-08-05
Applicant: 日本電信電話株式会社
IPC: G10L15/16
Abstract: 音声認識装置(1b)は、ラベル推定部(103)と、トリガー発火型ラベル推定部(302)と、RNN-Tトリガー推定部(301)とを備える。ラベル推定部(103)は、RNN-Tにより学習されたモデルを用いて、音声データの中間音響特徴量系列および中間シンボル特徴量系列に基づき、音声データのシンボル系列を予測する。トリガー発火型ラベル推定部(302)は、上記の音声データの中間音響特徴量系列に基づき、注意機構を用いて、音声データの次のシンボルを予測する。RNN-Tトリガー推定部(301)は、ラベル推定部(103)により予測された音声データのシンボル系列に基づき、音声データにblank以外のシンボルが生起する確率が最大となるタイミングを計算する。そして、RNN-Tトリガー推定部(301)は、計算したタイミングを、トリガー発火型ラベル推定部(302)を動作させるトリガーとして出力する。
-
公开(公告)号:WO2023009020A1
公开(公告)日:2023-02-02
申请号:PCT/RU2021/000316
申请日:2021-07-26
Applicant: "STC"-INNOVATIONS LIMITED"
Abstract: This invention relates to a method of training a neural network for the purpose of emotion recognition in speech segments and to a system for segmenting speech and recognizing an emotion in said speech segments, more particularly, the invention is directed to selecting speech segments with a required emotion from long audio recordings. The presented method of training neural network for the purpose of emotion recognition in a speech segment includes the following steps of: freezing an OpenL3 convolutional neural network; forming a labeled utterances database containing utterances not exceeding 10 seconds in length, wherein a corresponding emotion label or a noise label is attributed to each utterance using assessors, wherein the assessors are a group of assessors excluding assessors that do not meet the Fleiss' Kappa agreement level of 0.4; training a low-capacity recurrent neural network built on said pre-trained OpenL3 convolutional neural network using the formed labeled utterances database; unfreezing the upper layers of said pre-trained OpenL3 convolutional neural network for further training of the neural network.
-
公开(公告)号:WO2023278952A1
公开(公告)日:2023-01-05
申请号:PCT/US2022/073067
申请日:2022-06-21
Applicant: GOOGLE LLC
Inventor: CHEN, Zhehuai , RAMABHADRAN, Bhuvana , ROSENBERG, Andrew M. , ZHANG, Yu , MENGIBAR, Pedro J. Moreno
IPC: G10L15/06 , G10L15/16 , G10L13/047 , G10L13/08 , G10L15/063
Abstract: A method (500) includes receiving training data that includes unspoken text utterances (320) and un-transcribed non-synthetic speech utterances (306). Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation (332) for each unspoken textual utterance of the received training data using a text-to-speech model (330). The method also includes pre-training an audio encoder (210) on the synthetic speech representations generated for the unspoken textual utterances and the un-transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.
-
公开(公告)号:WO2023276251A1
公开(公告)日:2023-01-05
申请号:PCT/JP2022/006283
申请日:2022-02-04
Applicant: MITSUBISHI ELECTRIC CORPORATION
Inventor: MORITZ, Niko , HORI, Takaaki , LE ROUX, Jonathan
Abstract: The present disclosure provides an artificial intelligence (AI) system for sequence-to-sequence modeling with attention adapted for streaming applications. The AI system comprises at least one processor; and memory having instructions stored thereon that, when executed by the processor, cause the AI system to process each input frame in a sequence of input frames through layers of a deep neural network (DNN) to produce a sequence of outputs. At least some of the layers of the DNN include a dual self-attention module having a dual non-causal and causal architecture attending to non-causal frames and causal frames. Further, the AI system renders the sequence of outputs.
-
公开(公告)号:WO2022271570A1
公开(公告)日:2022-12-29
申请号:PCT/US2022/034084
申请日:2022-06-17
Applicant: AMAZON TECHNOLOGIES, INC.
Inventor: KARLAPATI, Sri Vishnu Kumar , KARANASOU, Panagiota , JOLY, Arnaud Vincent Pierre Yves , MOINET, Alexis Pierre , DRUGMAN, Thomas Renaud , MAKAROV, Petr , BOLLEPALLI, Bajibabu , ABBAS, Syed Ammar , SLANGEN, Simon
IPC: G10L13/08 , G10L15/16 , G10L15/183
Abstract: Techniques for utilizing memory for a neural network are described. For example, some techniques utilize a plurality of memory types to respond to a query from a neural network including a short-term memory to store fine-grained information for recent text of a document and receiving a first value in response, an episodic long-term memory to store information discarded from the short-term memory in a compressed form and receiving a second value in response, and a semantic long-term memory to store relevant facts per entity in the document.
-
公开(公告)号:WO2022267380A1
公开(公告)日:2022-12-29
申请号:PCT/CN2021/137489
申请日:2021-12-13
Applicant: 达闼科技(北京)有限公司
IPC: G06K9/00 , G06K9/62 , G06N3/04 , G10L15/08 , G10L15/16 , G06F18/214 , G06F18/22 , G06F18/241 , G06N3/045
Abstract: 本发明实施例涉及计算机信息技术领域,公开了一种基于语音驱动的人脸动作合成方法、电子设备及存储介质。通过对待识别人脸动作的语音信号进行处理,得到所述语音信号对应的音频向量;将所述音频向量输入参数识别模型进行处理,输出所述待识别人脸动作对应的人脸肌肉运动参数;通过所述待识别人脸动作的人脸肌肉运动参数,控制人脸模型中按人脸肌肉分布划分的多个弹性体上的角点运动,得到待识别人脸动作结果。本方案可以普遍适用于包含多种角点数量的人物模型,且输出的人脸动作丰富,表情效果自然。
-
-
-
-
-
-
-
-
-