DATA AUGMENTATION AND BATCH BALANCING FOR TRAINING MULTI-LINGUAL MODEL

Invention Publication

US20240135116A1 DATA AUGMENTATION AND BATCH BALANCING FOR TRAINING MULTI-LINGUAL MODEL 审中-公开

Please log in to see more content

Patent Title: DATA AUGMENTATION AND BATCH BALANCING FOR TRAINING MULTI-LINGUAL MODEL
Application No.: US18485779

Application Date: 2023-10-12
Publication No.: US20240135116A1

Publication Date: 2024-04-25
Inventor: Duy Vu , Poorya Zaremoodi , Nagaraj N. Bhat , Srijon Sarkar , Varsha Kuppur Rajendra , Thanh Long Duong , Mark Edward Johnson , Pramir Sarkar , Shahid Reza
Applicant: Oracle International Corporation
Applicant Address: US CA Redwood Shores
Assignee: Oracle International Corporation
Current Assignee: Oracle International Corporation
Current Assignee Address: US CA Redwood Shores
Priority: IN 2241058581 2022.10.13
Main IPC: G06F40/58
IPC: G06F40/58 ; G06F40/20

DATA AUGMENTATION AND BATCH BALANCING FOR TRAINING MULTI-LINGUAL MODEL

Abstract:

A computer-implemented method includes: accessing a plurality of datasets, where each dataset of the plurality of datasets includes training examples; selecting datasets that include the training examples in a source language and a target language; and sampling, based on a sampling weight that is determined for each of the selected datasets, the training examples from the selected datasets to generate the training batches; training an ML model for performing at least a first task using the training examples of the training batches, by interleavingly inputting the training batches to the ML model; and outputting the trained ML model configured to perform the at least the first task on input utterances provided in at least one among the source language and the target language. The sampling weight is determined for each of the selected datasets based on one or more attributes common to the training examples of the selected dataset.

Information query

Global Dossier Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F40/00	处理自然语言数据（语音分析或综合，语音识别G10L）
G06F40/40	.自然语言的处理或翻译(自然语言分析入G06F40/20；语义分析入G06F40/30)
G06F40/58	..使用机器翻译，例如用于多语言检索，用于客户端设备的服务器端翻译或实时翻译。