Patent search ap:("GOOGLE LLC") AND inv:"YU Page Jiahui"

1.

发明申请
CASCADED ENCODERS FOR SIMPLIFIED STREAMING AND NON-STREAMING SPEECH RECOGNITION 审中-公开

公开(公告)号：WO2022086589A1

公开(公告)日：2022-04-28

申请号：PCT/US2021/030364

申请日：2021-05-01

Applicant: GOOGLE LLC

Inventor： NARAYANAN, Arun , SAINATH, Tara , CHIU, Chung-cheng , PANG, Ruoming , YU, Jiahui , VARIANI, Ehsan , STROHMAN, Trevor

IPC: G10L15/16 , G10L15/32 , G06N3/04

Abstract: An automated speech recognition (ASR) model (200) includes a first encoder (210), a second encoder (220), and a decoder (204). The first encoder receives, as input, a sequence of acoustic frames (110), and generates, at each of a plurality of output steps, a first higher order feature representation (203) for a corresponding acoustic frame. The second encoder receives, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps, and generates, at each of the plurality of output steps, a second higher order feature representation (205) for a corresponding first higher order feature frame. The decoder receives, as input, the second higher order feature representation generated by the second encoder at each of the plurality of output steps, and generates, at each of the plurality of time steps, a first probability distribution over possible speech recognition hypotheses.

2.

发明申请
FAST EMIT LOW-LATENCY STREAMING ASR WITH SEQUENCE-LEVEL EMISSION REGULARIZATION 审中-公开

公开(公告)号：WO2022086640A1

公开(公告)日：2022-04-28

申请号：PCT/US2021/049738

申请日：2021-09-09

Applicant: GOOGLE LLC

Inventor： YU, Jiahui , CHIU, Chung-Cheng , LI, Bo , CHANG, Shuo-Yiin , SAINATH, Tara, N. , HAN, Wei , GULATI, Anmol , HE, Yanzhang , NARAYANAN, Arun , WU, Yonghui , PANG, Ruoming

IPC: G10L15/06 , G06N3/08

Abstract: A computer-implemented method (400) of training a streaming speech recognition model (200) that includes receiving, as input to the streaming speech recognition model, a sequence of acoustic frames (122). The streaming speech recognition model is configured to learn an alignment probability (206) between the sequence of acoustic frames and an output sequence of vocabulary tokens (204). The vocabulary tokens include a plurality of label tokens and a blank token. At each output step, the method includes determining a first probability (264) of emitting one of the label tokens and determining a second probability (266) of emitting the blank token. The method also includes generating the alignment probability at a sequence level based on the first probability and the second probability. The method also includes applying a tuning parameter (282) to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens.

3.

发明申请
SINGLE-STAGE MODEL TRAINING FOR NEURAL ARCHITECTURE SEARCH 审中-公开

公开(公告)号：WO2021178916A1

公开(公告)日：2021-09-10

申请号：PCT/US2021/021234

申请日：2021-03-05

Applicant: GOOGLE LLC

Inventor： YU, Jiahui , JIN, Pengchong , LIU, Hanxiao , BENDER, Gabriel Mintzer , KINDERMANS, Pieter-Jan , TAN, Mingxing , SONG, Xiaodan , PANG, Ruoming , LE, Quoc V.

IPC: G06N3/04 , G06N3/08

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for selecting a neural network to perform a particular machine learning task while satisfying a set of constraints.

4.

发明申请
VECTOR-QUANTIZED IMAGE MODELING 审中-公开

公开(公告)号：WO2023059699A1

公开(公告)日：2023-04-13

申请号：PCT/US2022/045756

申请日：2022-10-05

Applicant: GOOGLE LLC

Inventor： YU, Jiahui , LI, Xin , ZHANG, Han , VASUDEVAN, Vijay , KU, Alexander Yeong-Shiuh , BALDRIDGE, Jason Michael , XU, Yuanzhong , KOH, Jing Yu , LUONG, Thang Minh , BAID, Gunjan , WANG, Zirui , WU, Yonghui

IPC: H04N19/94 , H04N19/61 , H04N19/46 , G06N3/02 , H04N19/12 , H04N19/124 , H04N19/17 , H04N19/463 , H04N19/467

Abstract: Systems and methods are provided for vector-quantized image modeling using vision transformers and improved codebook handling. In particular, the present disclosure provides a Vector-quantized Image Modeling (VIM) approach that involves pretraining a machine learning model (e.g., Transformer model) to predict rasterized image tokens autoregressively. The discrete image tokens can be encoded from a learned Vision-Transformer-based VQGAN (example implementations of which can be referred to as ViT-VQGAN). The present disclosure proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional image generation, conditioned image generation (e.g., class-conditioned image generation), and unsupervised representation learning.

5.

发明申请
SYSTEMS AND METHODS FOR TRAINING DUAL-MODE MACHINE-LEARNED SPEECH RECOGNITION MODELS 审中-公开

公开(公告)号：WO2022072801A2

公开(公告)日：2022-04-07

申请号：PCT/US2021/053128

申请日：2021-10-01

Applicant: GOOGLE LLC

Inventor： YU, Jiahui , PANG, Ruoming , HAN, Wei , GULATI, Anmol , CHIU, Chung-Cheng , LI, Bo , SAINATH, Tara N. , WU, Yonghui

IPC: G10L15/16 , G10L15/22

Abstract: Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.

Patent Agency Ranking