Inference methods for word or wordpiece tokenization

    公开(公告)号:US11763083B2

    公开(公告)日:2023-09-19

    申请号:US17798638

    申请日:2020-05-18

    Applicant: Google LLC

    CPC classification number: G06F40/284 G06F16/322 G06F40/40

    Abstract: Systems and methods for performing inference for word or wordpiece tokenization are disclosed using a left-to-right longest-match-first greedy process. In some examples, the vocabulary may be organized into a trie structure in which each node includes a precomputed token or token ID and a fail link, so that the tokenizer can parse the trie in a single pass to generate a list of only those tokens or token IDs that correspond to the longest matching vocabulary entries in the sample string, without the need for backtracking. In some examples, the vocabulary may be organized into a trie in which each node has a fail link, and any node that would share token(s) or token_ID(s) of a preceding node is instead given a prev_match link that points back to a chain of nodes with those token(s) or token_ID(s).

    Inference Methods For Word Or Wordpiece Tokenization

    公开(公告)号:US20240054288A1

    公开(公告)日:2024-02-15

    申请号:US18205609

    申请日:2023-06-05

    Applicant: Google LLC

    CPC classification number: G06F40/284 G06F16/322 G06F40/40

    Abstract: Systems and methods for performing inference for word or wordpiece tokenization are disclosed using a left-to-right longest-match-first greedy process. In some examples, the vocabulary may be organized into a trie structure in which each node includes a precomputed token or token_ID and a fail link, so that the tokenizer can parse the trie in a single pass to generate a list of only those tokens or token_IDs that correspond to the longest matching vocabulary entries in the sample string, without the need for backtracking. In some examples, the vocabulary may be organized into a trie in which each node has a fail link, and any node that would share token(s) or token_ID(s) of a preceding node is instead given a prev_match link that points back to a chain of nodes with those token(s) or token_ID(s).

    QUERY RESPONSE USING A CUSTOM CORPUS
    5.
    发明公开

    公开(公告)号:US20240362093A1

    公开(公告)日:2024-10-31

    申请号:US18231606

    申请日:2023-08-08

    Applicant: GOOGLE LLC

    CPC classification number: G06F9/547 G06F16/243

    Abstract: At least utilizing a custom corpus of documents to condition a large language model (LLM) when generating a response to a user query. In some implementations, a user query associated with a client device is received. An API query for an external application is generated by an LLM based on the user query. The external application has access to a custom corpus of documents comprising a plurality of documents. The external application is queried using the API query. Data representative of one or more documents in the custom corpus of documents is received from the external application in response to the API query. The LLM generates a response to the query that is conditioned on the data representing one or more of the documents in the custom corpus of documents received from the external application. The response to the user query is caused to be rendered on the client device.

Patent Agency Ranking