-
公开(公告)号:US20250078814A1
公开(公告)日:2025-03-06
申请号:US18819280
申请日:2024-08-29
Applicant: Lemon Inc. , Beijing Zitiao Network Technology Co., Ltd.
Inventor: Dong Guo , Zihao He , Weituo Hao , Xuchen Song , Zongyu Yin , Jingsong Gao , Wei Tsung Lu , Junyu Dai
IPC: G10L15/06 , G06F40/126 , G10L25/30
Abstract: The present disclosure provides a multi-modal encoder processing method and apparatus, a computer device and a storage medium. The method includes: acquiring a pair of mask samples to be processed, the pair of mask samples including a text sample and an audio sample associated with each other, and at least one of the text sample and the audio sample is masked; based on a multi-modal encoder, generating a text encoding feature of the text sample, and generating an audio encoding feature of the audio sample, a linear spectrum feature of the audio sample being fused in the text encoding feature, and a linear word feature of the text sample being fused in the audio encoding feature; and predicting masked mask information according to the text encoding feature and the audio encoding feature, and correcting the multi-modal encoder based on an accuracy of the mask information.