System and method for training a transformer-in-transformer-based neural network model for audio data
Abstract:
Devices, systems and methods related to causing an apparatus to generate music information of audio data using a transformer-based neural network model with a multilevel transformer for audio analysis, using a spectral and a temporal transformer, are disclosed herein. The processor generates a time-frequency representation of obtained audio data to be applied as input for a transformer-based neural network model; determines spectral embeddings and first temporal embeddings of the audio data based on the time-frequency representation of the audio data; determines each vector of a second frequency class token (FCT) by passing each vector of the first FCT in the spectral embeddings through the spectral transformer; determines second temporal embeddings by adding a linear projection of the second FCT to the first temporal embeddings; determines third temporal embeddings by passing the second temporal embeddings through the temporal transformer; and generates music information based on the third temporal embeddings.
Information query
Patent Agency Ranking
0/0