Extreme language model compression with optimal sub-words and shared projections

    公开(公告)号:US12260340B2

    公开(公告)日:2025-03-25

    申请号:US18471866

    申请日:2023-09-21

    Applicant: Google LLC

    Abstract: Provided is a knowledge distillation technique for training a student language model that, relative to a larger teacher language model, has a significantly smaller vocabulary, lower embedding dimensions, and/or hidden state dimensions. Specifically, aspects of the present disclosure are directed to a dual-training mechanism that trains the teacher and student language models simultaneously to obtain optimal word embeddings for the student vocabulary. In some implementations, this approach can be combined with learning shared projection matrices that transfer layer-wise knowledge from the teacher language model to the student language model. Example experimental results have also demonstrated higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques, including the ability to compress the BERTBASE model by more than 60×, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7 MB.

    Instruction Fine-Tuning Machine-Learned Models Using Intermediate Reasoning Steps

    公开(公告)号:US20240256965A1

    公开(公告)日:2024-08-01

    申请号:US18424624

    申请日:2024-01-26

    Applicant: Google LLC

    CPC classification number: G06N20/00

    Abstract: An example method for training a machine-learned sequence processing model includes obtaining a plurality of training examples for training the machine-learned sequence processing model. For each respective training example of the plurality of training examples, the example method includes: obtaining a respective query associated with the respective training example; inputting the respective query to the machine-learned sequence processing model; obtaining, from the machine-learned sequence processing model a response to the respective query and a trace of intermediate states from the respective query to the response; evaluating the response using a ground truth response associated with the respective training example; evaluating the trace using a ground truth trace associated with the respective training example; and updating one or more parameters of the machine-learned sequence processing model based on the evaluation of the response and based on the evaluation of the trace.

    Extreme Language Model Compression with Optimal Sub-Words and Shared Projections

    公开(公告)号:US20210224660A1

    公开(公告)日:2021-07-22

    申请号:US16749570

    申请日:2020-01-22

    Applicant: Google LLC

    Abstract: Provided is a knowledge distillation technique for training a student language model that, relative to a larger teacher language model, has a significantly smaller vocabulary, lower embedding dimensions, and/or hidden state dimensions. Specifically, aspects of the present disclosure are directed to a dual-training mechanism that trains the teacher and student language models simultaneously to obtain optimal word embeddings for the student vocabulary. In some implementations, this approach can be combined with learning shared projection matrices that transfer layer-wise knowledge from the teacher language model to the student language model. Example experimental results have also demonstrated higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques, including the ability to compress the BERTBAsE model by more than 60×, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7 MB.

    Machine-Learned State Space Model for Joint Forecasting

    公开(公告)号:US20210065066A1

    公开(公告)日:2021-03-04

    申请号:US17008338

    申请日:2020-08-31

    Applicant: Google LLC

    Abstract: A deep state space generative model is augmented with intervention prediction. The state space model provides a principled way to capture the interactions among observations, interventions, critical event occurrences, true states, and associated uncertainty. The state space model can include a discrete-time hazard rate model that provides flexible fitting of general survival time distributions. The state space model can output a joint prediction of event risk, observation and intervention trajectories based on patterns in temporal progressions, and correlations between past measurements and interventions.

    Machine-learned state space model for joint forecasting

    公开(公告)号:US12217144B2

    公开(公告)日:2025-02-04

    申请号:US17008338

    申请日:2020-08-31

    Applicant: Google LLC

    Abstract: A deep state space generative model is augmented with intervention prediction. The state space model provides a principled way to capture the interactions among observations, interventions, critical event occurrences, true states, and associated uncertainty. The state space model can include a discrete-time hazard rate model that provides flexible fitting of general survival time distributions. The state space model can output a joint prediction of event risk, observation and intervention trajectories based on patterns in temporal progressions, and correlations between past measurements and interventions.

    Extreme Language Model Compression with Optimal Sub-Words and Shared Projections

    公开(公告)号:US20240013059A1

    公开(公告)日:2024-01-11

    申请号:US18471866

    申请日:2023-09-21

    Applicant: Google LLC

    CPC classification number: G06N3/0455 G06F40/40 G06N3/08 G06F40/284

    Abstract: Provided is a knowledge distillation technique for training a student language model that, relative to a larger teacher language model, has a significantly smaller vocabulary, lower embedding dimensions, and/or hidden state dimensions. Specifically, aspects of the present disclosure are directed to a dual-training mechanism that trains the teacher and student language models simultaneously to obtain optimal word embeddings for the student vocabulary. In some implementations, this approach can be combined with learning shared projection matrices that transfer layer-wise knowledge from the teacher language model to the student language model. Example experimental results have also demonstrated higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques, including the ability to compress the BERTBASE model by more than 60×, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7 MB.

    Knowledge Graph Completion and Multi-Hop Reasoning in Knowledge Graphs at Scale

    公开(公告)号:US20230289626A1

    公开(公告)日:2023-09-14

    申请号:US18183410

    申请日:2023-03-14

    Applicant: Google LLC

    CPC classification number: G06N5/022 G06F16/2453

    Abstract: Provided are computing systems, methods, and platforms for negative sampling in knowledge graphs with improved efficiency. A knowledge graph comprising entities and links between the entities can be obtained. A query computation graph comprising nodes and edges can be generated based on the knowledge graph. The nodes of the query computation graph can include anchor nodes, a root node, and intermediate nodes positioned in paths between the anchor nodes and the root node. A node cut of a query of the query computation graph can be determined and can include at least one node that cuts at least one path between each anchor node and the root node of the query computation graph. Negative samples can be identified by bidirectionally traversing the query computation graph in a first direction from the anchor nodes to the node cut and in a second direction from the root node to the node cut.

    Systems And Methods For Parameter Sharing To Reduce Computational Costs Of Training Machine-Learned Models

    公开(公告)号:US20220108221A1

    公开(公告)日:2022-04-07

    申请号:US17493442

    申请日:2021-10-04

    Applicant: Google LLC

    Abstract: Systems and methods of the present disclosure are directed to a computer-implemented method. The method can include obtaining a machine-learned model comprising a plurality of model units, wherein each model unit comprises a plurality of parameters that are tied to a shared plurality of parameters. The method can include performing a first plurality of training iterations with the machine-learned model to adjust parameters of the shared plurality of parameters. The method can include detecting, based on the first plurality of training iterations, an occurrence of an untying condition. The method can include untying the parameters of one or more model units from the shared plurality of parameters. The method can include performing a second plurality of training iterations with the machine-learned model to adjust parameters of the one or more model units independent of the shared plurality of parameters.

Patent Agency Ranking