情報処理装置、情報処理方法及び情報処理プログラム

    公开(公告)号:WO2023085191A1

    公开(公告)日:2023-05-19

    申请号:PCT/JP2022/041037

    申请日:2022-11-02

    Inventor: 小野 淳也

    Abstract: 情報処理装置(100)は、発話を所定の条件で区切った要素であるターンを複数含む発話群を取得する取得部(131)と、前記発話群のうち、所定のタスクにおける出力を得る対象であるカレントターンと、時系列において当該カレントターンの前後に位置する複数のターンとを一つのセンテンスに合成して事前学習モデルに入力することで、各ターンに対応する特徴量を出力する前処理部(132)と、前記前処理部によって出力された特徴量が入力されるニューラルネットワークの中間層のいずれかにおいて、カレントターンに対応する特徴量に対して注意機構(アテンション)から出力されるスコアに所定の重み付けを施すニューラルネットワークを用いて、前記所定のタスクに対する出力を得る推定部(133)と、を備える。

    TRAINING FOR LONG-FORM SPEECH RECOGNITION
    2.
    发明申请

    公开(公告)号:WO2023060002A1

    公开(公告)日:2023-04-13

    申请号:PCT/US2022/077124

    申请日:2022-09-27

    Applicant: GOOGLE LLC

    Abstract: A method (700) includes obtaining training samples (400), each training sample including a corresponding sequence of speech segments (405) corresponding to a training utterance and a corresponding sequence of ground-truth transcriptions (415) for the sequence of speech segments, and each ground-truth transcription including a start time (414) and an end time (416) of a corresponding speech segment. For each of the training samples, the method includes processing, using a speech recognition model (200), the corresponding sequence of speech segments to obtain one or more speech recognition hypotheses (522) for the training utterance; and, for each speech recognition hypothesis obtained for the training utterance, identifying a respective number of word errors relative to the corresponding sequence of ground-truth transcriptions. The method trains the speech recognition model to minimize word error rate based on the respective number of word errors identified for each speech recognition hypothesis obtained for the training utterance.

    音声認識装置、音声認識方法、および、音声認識プログラム

    公开(公告)号:WO2023012994A1

    公开(公告)日:2023-02-09

    申请号:PCT/JP2021/029212

    申请日:2021-08-05

    Abstract: 音声認識装置(1b)は、ラベル推定部(103)と、トリガー発火型ラベル推定部(302)と、RNN-Tトリガー推定部(301)とを備える。ラベル推定部(103)は、RNN-Tにより学習されたモデルを用いて、音声データの中間音響特徴量系列および中間シンボル特徴量系列に基づき、音声データのシンボル系列を予測する。トリガー発火型ラベル推定部(302)は、上記の音声データの中間音響特徴量系列に基づき、注意機構を用いて、音声データの次のシンボルを予測する。RNN-Tトリガー推定部(301)は、ラベル推定部(103)により予測された音声データのシンボル系列に基づき、音声データにblank以外のシンボルが生起する確率が最大となるタイミングを計算する。そして、RNN-Tトリガー推定部(301)は、計算したタイミングを、トリガー発火型ラベル推定部(302)を動作させるトリガーとして出力する。

    NEURAL NETWORK TRAINING AND SEGMENTING AN AUDIO RECORDING FOR EMOTION RECOGNITION

    公开(公告)号:WO2023009020A1

    公开(公告)日:2023-02-02

    申请号:PCT/RU2021/000316

    申请日:2021-07-26

    Abstract: This invention relates to a method of training a neural network for the purpose of emotion recognition in speech segments and to a system for segmenting speech and recognizing an emotion in said speech segments, more particularly, the invention is directed to selecting speech segments with a required emotion from long audio recordings. The presented method of training neural network for the purpose of emotion recognition in a speech segment includes the following steps of: freezing an OpenL3 convolutional neural network; forming a labeled utterances database containing utterances not exceeding 10 seconds in length, wherein a corresponding emotion label or a noise label is attributed to each utterance using assessors, wherein the assessors are a group of assessors excluding assessors that do not meet the Fleiss' Kappa agreement level of 0.4; training a low-capacity recurrent neural network built on said pre-trained OpenL3 convolutional neural network using the formed labeled utterances database; unfreezing the upper layers of said pre-trained OpenL3 convolutional neural network for further training of the neural network.

    INJECTING TEXT IN SELF-SUPERVISED SPEECH PRE-TRAINING

    公开(公告)号:WO2023278952A1

    公开(公告)日:2023-01-05

    申请号:PCT/US2022/073067

    申请日:2022-06-21

    Applicant: GOOGLE LLC

    Abstract: A method (500) includes receiving training data that includes unspoken text utterances (320) and un-transcribed non-synthetic speech utterances (306). Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation (332) for each unspoken textual utterance of the received training data using a text-to-speech model (330). The method also includes pre-training an audio encoder (210) on the synthetic speech representations generated for the unspoken textual utterances and the un-transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

Patent Agency Ranking