Inference methods for word or wordpiece tokenization

Invention Grant

US11763083B2 Inference methods for word or wordpiece tokenization 有权

Please log in to see more content

Patent Title: Inference methods for word or wordpiece tokenization
Application No.: US17798638

Application Date: 2020-05-18
Publication No.: US11763083B2

Publication Date: 2023-09-19
Inventor: Xinying Song , Yang Song
Applicant: Google LLC
Applicant Address: US CA Mountain View
Assignee: Google LLC
Current Assignee: Google LLC
Current Assignee Address: US CA Mountain View
Agency: Botos Churchill IP Law
International Application: PCT/US2020/033419 2020.05.18
International Announcement: WO2021/236052A 2021.11.25
Date entered country: 2022-08-10
Main IPC: G06F40/30
IPC: G06F40/30 ; G06F40/284 ; G06F16/31 ; G06F40/40

Inference methods for word or wordpiece tokenization

Abstract:

Systems and methods for performing inference for word or wordpiece tokenization are disclosed using a left-to-right longest-match-first greedy process. In some examples, the vocabulary may be organized into a trie structure in which each node includes a precomputed token or token ID and a fail link, so that the tokenizer can parse the trie in a single pass to generate a list of only those tokens or token IDs that correspond to the longest matching vocabulary entries in the sample string, without the need for backtracking. In some examples, the vocabulary may be organized into a trie in which each node has a fail link, and any node that would share token(s) or token_ID(s) of a preceding node is instead given a prev_match link that points back to a chain of nodes with those token(s) or token_ID(s).

Public/Granted literature

US20230124402A1 Inference Methods For Word Or Wordpiece Tokenization Public/Granted day:2023-04-20

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F40/00	处理自然语言数据（语音分析或综合，语音识别G10L）
G06F40/30	.语义分析