Invention Grant
- Patent Title: Inference methods for word or wordpiece tokenization
-
Application No.: US17798638Application Date: 2020-05-18
-
Publication No.: US11763083B2Publication Date: 2023-09-19
- Inventor: Xinying Song , Yang Song
- Applicant: Google LLC
- Applicant Address: US CA Mountain View
- Assignee: Google LLC
- Current Assignee: Google LLC
- Current Assignee Address: US CA Mountain View
- Agency: Botos Churchill IP Law
- International Application: PCT/US2020/033419 2020.05.18
- International Announcement: WO2021/236052A 2021.11.25
- Date entered country: 2022-08-10
- Main IPC: G06F40/30
- IPC: G06F40/30 ; G06F40/284 ; G06F16/31 ; G06F40/40

Abstract:
Systems and methods for performing inference for word or wordpiece tokenization are disclosed using a left-to-right longest-match-first greedy process. In some examples, the vocabulary may be organized into a trie structure in which each node includes a precomputed token or token ID and a fail link, so that the tokenizer can parse the trie in a single pass to generate a list of only those tokens or token IDs that correspond to the longest matching vocabulary entries in the sample string, without the need for backtracking. In some examples, the vocabulary may be organized into a trie in which each node has a fail link, and any node that would share token(s) or token_ID(s) of a preceding node is instead given a prev_match link that points back to a chain of nodes with those token(s) or token_ID(s).
Public/Granted literature
- US20230124402A1 Inference Methods For Word Or Wordpiece Tokenization Public/Granted day:2023-04-20
Information query