-
公开(公告)号:WO2022086589A1
公开(公告)日:2022-04-28
申请号:PCT/US2021/030364
申请日:2021-05-01
Applicant: GOOGLE LLC
Inventor: NARAYANAN, Arun , SAINATH, Tara , CHIU, Chung-cheng , PANG, Ruoming , YU, Jiahui , VARIANI, Ehsan , STROHMAN, Trevor
Abstract: An automated speech recognition (ASR) model (200) includes a first encoder (210), a second encoder (220), and a decoder (204). The first encoder receives, as input, a sequence of acoustic frames (110), and generates, at each of a plurality of output steps, a first higher order feature representation (203) for a corresponding acoustic frame. The second encoder receives, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps, and generates, at each of the plurality of output steps, a second higher order feature representation (205) for a corresponding first higher order feature frame. The decoder receives, as input, the second higher order feature representation generated by the second encoder at each of the plurality of output steps, and generates, at each of the plurality of time steps, a first probability distribution over possible speech recognition hypotheses.
-
公开(公告)号:WO2022086640A1
公开(公告)日:2022-04-28
申请号:PCT/US2021/049738
申请日:2021-09-09
Applicant: GOOGLE LLC
Inventor: YU, Jiahui , CHIU, Chung-Cheng , LI, Bo , CHANG, Shuo-Yiin , SAINATH, Tara, N. , HAN, Wei , GULATI, Anmol , HE, Yanzhang , NARAYANAN, Arun , WU, Yonghui , PANG, Ruoming
Abstract: A computer-implemented method (400) of training a streaming speech recognition model (200) that includes receiving, as input to the streaming speech recognition model, a sequence of acoustic frames (122). The streaming speech recognition model is configured to learn an alignment probability (206) between the sequence of acoustic frames and an output sequence of vocabulary tokens (204). The vocabulary tokens include a plurality of label tokens and a blank token. At each output step, the method includes determining a first probability (264) of emitting one of the label tokens and determining a second probability (266) of emitting the blank token. The method also includes generating the alignment probability at a sequence level based on the first probability and the second probability. The method also includes applying a tuning parameter (282) to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens.
-
公开(公告)号:WO2021178916A1
公开(公告)日:2021-09-10
申请号:PCT/US2021/021234
申请日:2021-03-05
Applicant: GOOGLE LLC
Inventor: YU, Jiahui , JIN, Pengchong , LIU, Hanxiao , BENDER, Gabriel Mintzer , KINDERMANS, Pieter-Jan , TAN, Mingxing , SONG, Xiaodan , PANG, Ruoming , LE, Quoc V.
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for selecting a neural network to perform a particular machine learning task while satisfying a set of constraints.
-
公开(公告)号:WO2023059699A1
公开(公告)日:2023-04-13
申请号:PCT/US2022/045756
申请日:2022-10-05
Applicant: GOOGLE LLC
Inventor: YU, Jiahui , LI, Xin , ZHANG, Han , VASUDEVAN, Vijay , KU, Alexander Yeong-Shiuh , BALDRIDGE, Jason Michael , XU, Yuanzhong , KOH, Jing Yu , LUONG, Thang Minh , BAID, Gunjan , WANG, Zirui , WU, Yonghui
IPC: H04N19/94 , H04N19/61 , H04N19/46 , G06N3/02 , H04N19/12 , H04N19/124 , H04N19/17 , H04N19/463 , H04N19/467
Abstract: Systems and methods are provided for vector-quantized image modeling using vision transformers and improved codebook handling. In particular, the present disclosure provides a Vector-quantized Image Modeling (VIM) approach that involves pretraining a machine learning model (e.g., Transformer model) to predict rasterized image tokens autoregressively. The discrete image tokens can be encoded from a learned Vision-Transformer-based VQGAN (example implementations of which can be referred to as ViT-VQGAN). The present disclosure proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional image generation, conditioned image generation (e.g., class-conditioned image generation), and unsupervised representation learning.
-
公开(公告)号:WO2022072801A2
公开(公告)日:2022-04-07
申请号:PCT/US2021/053128
申请日:2021-10-01
Applicant: GOOGLE LLC
Inventor: YU, Jiahui , PANG, Ruoming , HAN, Wei , GULATI, Anmol , CHIU, Chung-Cheng , LI, Bo , SAINATH, Tara N. , WU, Yonghui
Abstract: Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.
-
-
-
-