-
公开(公告)号:US20230121764A1
公开(公告)日:2023-04-20
申请号:US17502890
申请日:2021-10-15
Applicant: Lemon Inc.
Inventor: Ju-Chiang Wang , Jordan Smith , Wei Tsung Lu
Abstract: Devices, systems, and methods related to implementing supervised metric learning during a training of a deep neural network model are disclosed herein. In examples, audio input may be received, where the audio input includes a plurality of song fragments from a plurality of songs. For each song fragment, an aligning function may be performed to center the song fragment based on determined beat information, thereby creating a plurality of aligned song fragments. For each song fragment of the plurality of song fragments, an embedding vector may be obtained from the deep neural network. Thus, a batch of aligned song fragments from the plurality of aligned song fragments may be selected, such that a training tuple may be selected. A loss metric may be generated based on the selected training tuple and one or more weights of the deep neural network model may be updated based on the loss metric.
-
公开(公告)号:US20230124006A1
公开(公告)日:2023-04-20
申请号:US17502863
申请日:2021-10-15
Applicant: Lemon Inc.
Inventor: Wei Tsung Lu , Ju-Chiang Wang , Minz Won , Keunwoo Choi , Xuchen Song
Abstract: Devices, systems and methods related to causing an apparatus to generate music information of audio data using a transformer-based neural network model with a multilevel transformer for audio analysis, using a spectral and a temporal transformer, are disclosed herein. The processor generates a time-frequency representation of obtained audio data to be applied as input for a transformer-based neural network model; determines spectral embeddings and first temporal embeddings of the audio data based on the time-frequency representation of the audio data; determines each vector of a second frequency class token (FCT) by passing each vector of the first FCT in the spectral embeddings through the spectral transformer; determines second temporal embeddings by adding a linear projection of the second FCT to the first temporal embeddings; determines third temporal embeddings by passing the second temporal embeddings through the temporal transformer; and generates music information based on the third temporal embeddings.
-
公开(公告)号:US20250078814A1
公开(公告)日:2025-03-06
申请号:US18819280
申请日:2024-08-29
Applicant: Lemon Inc. , Beijing Zitiao Network Technology Co., Ltd.
Inventor: Dong Guo , Zihao He , Weituo Hao , Xuchen Song , Zongyu Yin , Jingsong Gao , Wei Tsung Lu , Junyu Dai
IPC: G10L15/06 , G06F40/126 , G10L25/30
Abstract: The present disclosure provides a multi-modal encoder processing method and apparatus, a computer device and a storage medium. The method includes: acquiring a pair of mask samples to be processed, the pair of mask samples including a text sample and an audio sample associated with each other, and at least one of the text sample and the audio sample is masked; based on a multi-modal encoder, generating a text encoding feature of the text sample, and generating an audio encoding feature of the audio sample, a linear spectrum feature of the audio sample being fused in the text encoding feature, and a linear word feature of the text sample being fused in the audio encoding feature; and predicting masked mask information according to the text encoding feature and the audio encoding feature, and correcting the multi-modal encoder based on an accuracy of the mask information.
-
公开(公告)号:US12106740B2
公开(公告)日:2024-10-01
申请号:US17502890
申请日:2021-10-15
Applicant: Lemon Inc.
Inventor: Ju-Chiang Wang , Jordan Smith , Wei Tsung Lu
CPC classification number: G10H1/0008 , G06N3/08 , G10H2210/076 , G10H2250/311
Abstract: Devices, systems, and methods related to implementing supervised metric learning during a training of a deep neural network model are disclosed herein. In examples, audio input may be received, where the audio input includes a plurality of song fragments from a plurality of songs. For each song fragment, an aligning function may be performed to center the song fragment based on determined beat information, thereby creating a plurality of aligned song fragments. For each song fragment of the plurality of song fragments, an embedding vector may be obtained from the deep neural network. Thus, a batch of aligned song fragments from the plurality of aligned song fragments may be selected, such that a training tuple may be selected. A loss metric may be generated based on the selected training tuple and one or more weights of the deep neural network model may be updated based on the loss metric.
-
公开(公告)号:US11854558B2
公开(公告)日:2023-12-26
申请号:US17502863
申请日:2021-10-15
Applicant: Lemon Inc.
Inventor: Wei Tsung Lu , Ju-Chiang Wang , Minz Won , Keunwoo Choi , Xuchen Song
Abstract: Devices, systems and methods related to causing an apparatus to generate music information of audio data using a transformer-based neural network model with a multilevel transformer for audio analysis, using a spectral and a temporal transformer, are disclosed herein. The processor generates a time-frequency representation of obtained audio data to be applied as input for a transformer-based neural network model; determines spectral embeddings and first temporal embeddings of the audio data based on the time-frequency representation of the audio data; determines each vector of a second frequency class token (FCT) by passing each vector of the first FCT in the spectral embeddings through the spectral transformer; determines second temporal embeddings by adding a linear projection of the second FCT to the first temporal embeddings; determines third temporal embeddings by passing the second temporal embeddings through the temporal transformer; and generates music information based on the third temporal embeddings.
-
-
-
-