-
公开(公告)号:US20250078857A1
公开(公告)日:2025-03-06
申请号:US18241138
申请日:2023-08-31
Applicant: Lemon Inc.
Inventor: Wei Tsung LU , Ju-Chiang WANG
IPC: G10L21/0308 , G10L25/18 , G10L25/30
Abstract: The present disclosure describes techniques for implementing improved audio source separation. A complex spectrum X is split into a plurality of K bands along a frequency axis by applying band-split operations on the complex spectrum X. The complex spectrum is a time-frequency representation of audio signals. Each of the plurality of K bands is denoted as Xk, k=1, . . . , K. Each band Xk comprises one or more frequency bins. Each individual multilayer perceptron is applied to each band Xk to extract latent representations and obtain outputs Hk0. A time-domain transformer and a frequency-domain transformer are applied on a stacked representation H0. Time-domain and frequency-domain transformers are repeatedly applying in an interleaved manner for L times to obtain HL output from the transformer blocks. The HL is input into a multi-band mask estimation sub-model. A complex ideal ratio mask is generated based on outputs from the multi-band mask estimation sub-model.
-
公开(公告)号:US20240404494A1
公开(公告)日:2024-12-05
申请号:US18204855
申请日:2023-06-01
Applicant: Lemon Inc.
Inventor: Wei Tsung LU , Ju-Chiang WANG , Yun-Ning HUNG
IPC: G10H1/00
Abstract: The present disclosure describes techniques for implementing automatic music audio transcription. A deep neural network model may be configured. The deep neural network model comprises a spectral cross-attention sub-model configured to project a spectral representation of each time step t, denoted as St, into a set of latent arrays at the time step t, denoted as θth, h representing an h-th iteration. The deep neutral network model comprises a plurality of latent transformers configured to perform self-attention on the set of latent arrays θth. The deep neural network model further comprises a set of temporal transformers configured to enable communications between any pairs of latent arrays θthat different time steps. Training data may be augmented by randomly mixing a plurality of types of datasets comprising a vocal dataset and an instrument dataset. The deep neural network model may be trained using the augmented training data.
-
公开(公告)号:US20240395231A1
公开(公告)日:2024-11-28
申请号:US18200924
申请日:2023-05-23
Applicant: Lemon Inc.
Inventor: Yun-Ning HUNG , Ju-Chiang WANG , Mojtaba HEYDARI
IPC: G10H1/00
Abstract: The present disclosure describes techniques for tracking beats and downbeats of audio, such as human voices, in real time. Audio may be received in real time. The audio may be split into a sequence of segments. A sequence of audio features representing the sequence of segments of the audio may be extracted. A continuous sequence of activations indicative of probabilities of beats or downbeats occurring in the sequence of segments of the audio may be generated using a machine learning model with causal mechanisms. Timings of the beats or the downbeats occurring in the sequence of segments of the audio may be determined based on the continuous sequence of activations by fusing local rhythmic information with respect to each instant segment with information indicative of beats or downbeats in previous segments among the sequence of segments.
-
-