Automated sound matching within an audio recording

    公开(公告)号:US11501102B2

    公开(公告)日:2022-11-15

    申请号:US16691205

    申请日:2019-11-21

    Applicant: Adobe Inc.

    Abstract: Certain embodiments involve techniques for automatically identifying sounds in an audio recording that match a selected sound. An audio search and editing system receives the audio recording and preprocesses the audio recording into audio portions. The audio portions are provided as a query to the neural network that includes a trained embedding model used to analyze the audio portions in view of the selected sound to estimate feature vectors. The audio search and editing system compares the feature vectors for the audio portions against the feature vector for the selected sound and the feature vector for the negative samples to generate an audio score that is a numerical representation of the level of similarity between the audio portion and the selected sound and uses the audio scores to classify the audio portions into a first class of matching sounds and a second class of non-matching sounds.

    Representation learning from video with spatial audio

    公开(公告)号:US11308329B2

    公开(公告)日:2022-04-19

    申请号:US16868805

    申请日:2020-05-07

    Applicant: Adobe Inc.

    Abstract: A computer system is trained to understand audio-visual spatial correspondence using audio-visual clips having multi-channel audio. The computer system includes an audio subnetwork, video subnetwork, and pretext subnetwork. The audio subnetwork receives the two channels of audio from the audio-visual clips, and the video subnetwork receives the video frames from the audio-visual clips. In a subset of the audio-visual clips the audio-visual spatial relationship is misaligned, causing the audio-visual spatial cues for the audio and video to be incorrect. The audio subnetwork outputs an audio feature vector for each audio-visual clip, and the video subnetwork outputs a video feature vector for each audio-visual clip. The audio and video feature vectors for each audio-visual clip are merged and provided to the pretext subnetwork, which is configured to classify the merged vector as either having a misaligned audio-visual spatial relationship or not. The subnetworks are trained based on the loss calculated from the classification.

    AUTOMATED SOUND MATCHING WITHIN AN AUDIO RECORDING

    公开(公告)号:US20210158086A1

    公开(公告)日:2021-05-27

    申请号:US16691205

    申请日:2019-11-21

    Applicant: Adobe Inc.

    Abstract: Certain embodiments involve techniques for automatically identifying sounds in an audio recording that match a selected sound. An audio search and editing system receives the audio recording and preprocesses the audio recording into audio portions. The audio portions are provided as a query to the neural network that includes a trained embedding model used to analyze the audio portions in view of the selected sound to estimate feature vectors. The audio search and editing system compares the feature vectors for the audio portions against the feature vector for the selected sound and the feature vector for the negative samples to generate an audio score that is a numerical representation of the level of similarity between the audio portion and the selected sound and uses the audio scores to classify the audio portions into a first class of matching sounds and a second class of non-matching sounds.

    REPRESENTATION LEARNING FROM VIDEO WITH SPATIAL AUDIO

    公开(公告)号:US20210350135A1

    公开(公告)日:2021-11-11

    申请号:US16868805

    申请日:2020-05-07

    Applicant: Adobe Inc.

    Abstract: A computer system is trained to understand audio-visual spatial correspondence using audio-visual clips having multi-channel audio. The computer system includes an audio subnetwork, video subnetwork, and pretext subnetwork. The audio subnetwork receives the two channels of audio from the audio-visual clips, and the video subnetwork receives the video frames from the audio-visual clips. In a subset of the audio-visual clips the audio-visual spatial relationship is misaligned, causing the audio-visual spatial cues for the audio and video to be incorrect. The audio subnetwork outputs an audio feature vector for each audio-visual clip, and the video subnetwork outputs a video feature vector for each audio-visual clip. The audio and video feature vectors for each audio-visual clip are merged and provided to the pretext subnetwork, which is configured to classify the merged vector as either having a misaligned audio-visual spatial relationship or not. The subnetworks are trained based on the loss calculated from the classification.

Patent Agency Ranking