Invention Grant
- Patent Title: Extreme language model compression with optimal sub-words and shared projections
-
Application No.: US16749570Application Date: 2020-01-22
-
Publication No.: US11797862B2Publication Date: 2023-10-24
- Inventor: Yang Song , Raghav Gupta , Dengyong Zhou , Sanqiang Zhao
- Applicant: Google LLC
- Applicant Address: US CA Mountain View
- Assignee: GOOGLE LLC
- Current Assignee: GOOGLE LLC
- Current Assignee Address: US CA Mountain View
- Agency: Dority & Manning, P.A.
- Main IPC: G06N3/088
- IPC: G06N3/088 ; G06F40/284 ; G06N3/045

Abstract:
Provided is a knowledge distillation technique for training a student language model that, relative to a larger teacher language model, has a significantly smaller vocabulary, lower embedding dimensions, and/or hidden state dimensions. Specifically, aspects of the present disclosure are directed to a dual-training mechanism that trains the teacher and student language models simultaneously to obtain optimal word embeddings for the student vocabulary. In some implementations, this approach can be combined with learning shared projection matrices that transfer layer-wise knowledge from the teacher language model to the student language model. Example experimental results have also demonstrated higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques, including the ability to compress the BERTBASE model by more than 60×, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7 MB.
Public/Granted literature
- US20210224660A1 Extreme Language Model Compression with Optimal Sub-Words and Shared Projections Public/Granted day:2021-07-22
Information query