-
公开(公告)号:US20240386048A1
公开(公告)日:2024-11-21
申请号:US18319202
申请日:2023-05-17
Applicant: Adobe Inc.
Inventor: Bryan RUSSELL , Justin SALAMON , Daniel McKEE , Josef SIVIC
IPC: G06F16/438 , G06F16/432
Abstract: Embodiments are disclosed for an audio recommendation system trained to recommend music audio sequences for pairing with query video sequences using neural networks. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving an input including a query video sequence and natural language text. The disclosed systems and methods further comprise generating a fused visual-text embedding based on a visual embedding and a text embedding corresponding to the input. The disclosed systems and methods further comprise comparing audio embeddings for music audio sequences of a music audio sequences database with the fused visual-text embedding. The disclosed systems and methods further comprise determining a music audio sequence from the music audio sequences database as the recommended music audio sequence for pairing with the query video sequence based on a similarity metric calculated between an audio embedding for the music audio sequence and the fused visual-text embedding.
-
公开(公告)号:US20230368503A1
公开(公告)日:2023-11-16
申请号:US17742322
申请日:2022-05-11
Applicant: Adobe Inc.
Inventor: Justin SALAMON , Bryan RUSSELL , Didac SURIS COLL-VINENT
IPC: G06V10/774 , G06V20/40 , G06V10/74 , G10L25/57 , G10L25/03
CPC classification number: G06V10/774 , G06V20/49 , G06V20/46 , G06V10/761 , G10L25/57 , G10L25/03
Abstract: Embodiments are disclosed for correlating video sequences and audio sequences by a media recommendation system using a trained encoder network. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving a training input including a media sequence, including a video sequence paired with an audio sequence, segmenting the media sequence into a set of video sequence segments and a set of audio sequence segments, extracting visual features for each video sequence segment and audio features for each audio sequence segment, generating, by transformer networks, contextualized visual features from the extracted visual features and contextualized audio features from the extracted audio features, the transformer networks including a visual transformer and an audio transformer, generating predicted video and audio sequence segment pairings based on the contextualized visual and audio features, and training the visual transformer and the audio transformer to generate the contextualized visual and audio features.
-