Generating synthetic code-switched data for training language models

Invention Grant

US12242820B2 Generating synthetic code-switched data for training language models 有权

Please log in to see more content

Patent Title: Generating synthetic code-switched data for training language models
Application No.: US17651555

Application Date: 2022-02-17
Publication No.: US12242820B2

Publication Date: 2025-03-04
Inventor: Cesa Salaam , Seunghyun Yoon , Trung Huu Bui , Franck Dernoncourt
Applicant: Adobe Inc.
Applicant Address: US CA San Jose
Assignee: Adobe Inc.
Current Assignee: Adobe Inc.
Current Assignee Address: US CA San Jose
Agency: Weaver Austin Villeneuve & Sampson LLP
Main IPC: G10L15/22
IPC: G10L15/22 ; G06F40/47 ; G06F40/58 ; G06N3/045 ; G06N3/08

Generating synthetic code-switched data for training language models

Abstract:

Techniques for training a language model for code switching content are disclosed. Such techniques include, in some embodiments, generating a dataset, which includes identifying one or more portions within textual content in a first language, the identified one or more portions each including one or more of offensive content or non-offensive content; translating the identified one or more salient portions to a second language; and reintegrating the translated one or more portions into the textual content to generate code-switched textual content. In some cases, the textual content in the first language includes offensive content and non-offensive content, the identified one or more portions include the offensive content, and the translated one or more portions include a translated version of the offensive content. In some embodiments, the code-switched textual content is at least part of a synthetic dataset usable to train a language model, such as a multilingual classification model.

Public/Granted literature

US20230259718A1 GENERATING SYNTHETIC CODE-SWITCHED DATA FOR TRAINING LANGUAGE MODELS Public/Granted day:2023-08-17

Information query

Espacenet

IPC分类:

G	物理
G10	乐器；声学
G10L	语音分析或合成；语音识别；语音或声音处理；语音或音频编码或解码
G10L15/00	语音识别（G10L17/00优先）
G10L15/22	.在语音识别过程中（例如在人机对话过程中）使用的程序