-
公开(公告)号:US11797862B2
公开(公告)日:2023-10-24
申请号:US16749570
申请日:2020-01-22
Applicant: Google LLC
Inventor: Yang Song , Raghav Gupta , Dengyong Zhou , Sanqiang Zhao
IPC: G06N3/088 , G06F40/284 , G06N3/045
CPC classification number: G06N3/088 , G06F40/284 , G06N3/045
Abstract: Provided is a knowledge distillation technique for training a student language model that, relative to a larger teacher language model, has a significantly smaller vocabulary, lower embedding dimensions, and/or hidden state dimensions. Specifically, aspects of the present disclosure are directed to a dual-training mechanism that trains the teacher and student language models simultaneously to obtain optimal word embeddings for the student vocabulary. In some implementations, this approach can be combined with learning shared projection matrices that transfer layer-wise knowledge from the teacher language model to the student language model. Example experimental results have also demonstrated higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques, including the ability to compress the BERTBASE model by more than 60×, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7 MB.
-
公开(公告)号:US20240013059A1
公开(公告)日:2024-01-11
申请号:US18471866
申请日:2023-09-21
Applicant: Google LLC
Inventor: Yang Song , Raghav Gupta , Dengyong Zhou , Sanqiang Zhao
IPC: G06N3/0455 , G06F40/40 , G06N3/08
CPC classification number: G06N3/0455 , G06F40/40 , G06N3/08 , G06F40/284
Abstract: Provided is a knowledge distillation technique for training a student language model that, relative to a larger teacher language model, has a significantly smaller vocabulary, lower embedding dimensions, and/or hidden state dimensions. Specifically, aspects of the present disclosure are directed to a dual-training mechanism that trains the teacher and student language models simultaneously to obtain optimal word embeddings for the student vocabulary. In some implementations, this approach can be combined with learning shared projection matrices that transfer layer-wise knowledge from the teacher language model to the student language model. Example experimental results have also demonstrated higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques, including the ability to compress the BERTBASE model by more than 60×, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7 MB.
-
公开(公告)号:US20210224660A1
公开(公告)日:2021-07-22
申请号:US16749570
申请日:2020-01-22
Applicant: Google LLC
Inventor: Yang Song , Raghav Gupta , Dengyong Zhou , Sanqiang Zhao
IPC: G06N3/08 , G06N3/04 , G06F40/284
Abstract: Provided is a knowledge distillation technique for training a student language model that, relative to a larger teacher language model, has a significantly smaller vocabulary, lower embedding dimensions, and/or hidden state dimensions. Specifically, aspects of the present disclosure are directed to a dual-training mechanism that trains the teacher and student language models simultaneously to obtain optimal word embeddings for the student vocabulary. In some implementations, this approach can be combined with learning shared projection matrices that transfer layer-wise knowledge from the teacher language model to the student language model. Example experimental results have also demonstrated higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques, including the ability to compress the BERTBAsE model by more than 60×, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7 MB.
-
公开(公告)号:US12260340B2
公开(公告)日:2025-03-25
申请号:US18471866
申请日:2023-09-21
Applicant: Google LLC
Inventor: Yang Song , Raghav Gupta , Dengyong Zhou , Sanqiang Zhao
IPC: G06N3/088 , G06F40/284 , G06N3/045
Abstract: Provided is a knowledge distillation technique for training a student language model that, relative to a larger teacher language model, has a significantly smaller vocabulary, lower embedding dimensions, and/or hidden state dimensions. Specifically, aspects of the present disclosure are directed to a dual-training mechanism that trains the teacher and student language models simultaneously to obtain optimal word embeddings for the student vocabulary. In some implementations, this approach can be combined with learning shared projection matrices that transfer layer-wise knowledge from the teacher language model to the student language model. Example experimental results have also demonstrated higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques, including the ability to compress the BERTBASE model by more than 60×, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7 MB.
-
-
-