PRE-TRAINING TECHNIQUES FOR ENTITY EXTRACTION IN LOW RESOURCE DOMAINS

Invention Publication

US20230153533A1 PRE-TRAINING TECHNIQUES FOR ENTITY EXTRACTION IN LOW RESOURCE DOMAINS 审中-公开

Please log in to see more content

Patent Title: PRE-TRAINING TECHNIQUES FOR ENTITY EXTRACTION IN LOW RESOURCE DOMAINS
Application No.: US17525311

Application Date: 2021-11-12
Publication No.: US20230153533A1

Publication Date: 2023-05-18
Inventor: Aniruddha Mahapatra , Sharmila Reddy Nangi , Aparna Garimella , Anandha velu Natarajan
Applicant: ADOBE INC.
Applicant Address: US CA SAN JOSE
Assignee: ADOBE INC.
Current Assignee: ADOBE INC.
Current Assignee Address: US CA SAN JOSE
Main IPC: G06F40/289
IPC: G06F40/289 ; G06F40/211 ; G06F40/42

PRE-TRAINING TECHNIQUES FOR ENTITY EXTRACTION IN LOW RESOURCE DOMAINS

Abstract:

Embodiments of the present invention provide systems, methods, and computer storage media for pre-training entity extraction models to facilitate domain adaptation in resource-constrained domains. In an example embodiment, a first machine learning model is used to encode sentences of a source domain corpus and a target domain corpus into sentence embeddings. The sentence embeddings of the target domain corpus are combined into a target corpus embedding. Training sentences from the source domain corpus within a threshold of similarity to the target corpus embedding are selected. A second machine learning model is trained on the training sentences selected from the source domain corpus.

Public/Granted literature

US12159109B2 Pre-training techniques for entity extraction in low resource domains Public/Granted day:2024-12-03

Information query

Global Dossier Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F40/00	处理自然语言数据（语音分析或综合，语音识别G10L）
G06F40/20	.自然语言分析（自然语言的语义分析入G06F40/30）
G06F40/279	..文字实体的识别
G06F40/289	...短语分析，例如有限状态技术或分块