摘要:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for selecting, from among a collection of videos, a set of candidate videos that (i) are identified as being associated with a particular song, and (ii) are classified as a cappella video recordings; extracting, from each of the candidate videos of the set, a monophonic melody line from an audio channel of the candidate video; selecting, from among the set of candidate videos, a subset of the candidate videos based on a similarity of the monophonic melody line of the candidate videos of the subset with each other; and providing, to a recognizer that recognizes songs from sounds produced by a human voice, (i) an identifier of the particular song, and (ii) one or more of the monophonic melody lines of the candidate videos of the subset.
摘要:
Methods, including computer programs encoded on a computer storage medium, for collaborative language model biasing. In one aspect, a method includes receiving (i) data including a set of terms associated with a target user, and, (ii) from each of multiple other users, data including a set of terms associated with the other user, selecting a particular other user based at least on comparing the set of terms associated with the target user to the sets of terms associated with the other users, selecting one or more terms from the set of terms that is associated with the particular other user, obtaining, based on the selected terms that are associated with the particular other user, a biased language model, and providing the biased language model to an automated speech recognizer.
摘要:
In one example, a system comprises at least one processor configured to determine an indication of an audio portion of video content, determine, based at least in part on the indication, one or more candidate audio tracks, determine, based at least in part on the one or more candidate audio tracks, one or more search terms, and provide a search query that includes the search terms. The at least one processor may be further configured to, in response to the search query, receive a response that indicates a number of search results, wherein each one of the search results is associated with content that includes the one or more search terms, select, based at least in part on the response, a particular audio track of the one or more candidate audio tracks, and send a message that associates the video content with at least the particular audio track.
摘要:
Systems and methods are provided herein relating to real-time detection of inactive broadcasts during live stream ingestion. Both audio fingerprints and video fingerprints can be dynamically and continuously generated for a live stream ingestion. Sets of video fingerprints and sets of audio fingerprints can be continuously generated based on common successive overlapping time windows. A set of audio fingerprints and a set of video fingerprints can be associated with each time window. Video similarity scores and audio similarity scores can be generates for each time window to determine whether the stream is inactive or static during the time window. Only fingerprints relating to an active broadcast can be indexed in a fingerprint index.
摘要:
A matching system receives probe audio samples for comparison to references of a data store. Comparisons are generated to determine a sufficient match for a portion or a first amount of the probe sample. Ranking scores are assigned to the resulting match references. The match references are retained, unless meeting a score threshold. Comparisons are continually generated with second amounts of the probe sample and the retained references are updated with further matching references assigned ranking scores. The retained results are merged and determined to satisfy a score threshold for release as outputted results for matching references.
摘要:
The present disclosure provides methods and apparatuses that enable an apparatus to identify sounds from short samples of audio. The apparatus may capture an audio sample and create several audio signals of different lengths, each containing audio from the captured audio sample. The apparatus my process the several audio signals in an attempt to identify features of the audio signal that indicate an identification of the captured sound. Because shorter audio samples can be analyzed more quickly, the system may first process the shortest audio samples in order to quickly identify features of the audio signal. Because longer audio samples contain more information, the system may be able to more accurately identify features in the audio signal in longer audio samples. However, analyzing longer audio signals takes more buffered audio than identifying features in shorter signals. Therefore, the present system attempts to identify features in the shortest audio signals first.
摘要:
Systems and methods prevent or restrict the mining of content on a mobile device. For example, a method may include determining that content to be displayed on a screen includes content that matches a mining-restriction trigger, inserting a mining-restriction mark in the content that protects at least a portion of the content, and displaying the content with the mining-restriction mark on the screen. As another example, a method may include identifying, by a first application running on a mobile device, a mining-restriction mark in frame buffer data, the mining-restriction mark having been inserted by a second application, and determining whether the mining-restriction mark prevents mining of content. The method may also include preventing mining when the mining-restriction mark prevents mining and, when the mining-restriction mark does not prevent mining, determining a restriction for the data based on the mining-restriction mark and providing the restriction with the data for further processing.
摘要:
The present disclosure provides methods and apparatuses that enable an apparatus to identify sounds from short samples of audio. The apparatus may capture an audio sample and create several audio signals of different lengths, each containing audio from the captured audio sample. The apparatus my process the several audio signals in an attempt to identify features of the audio signal that indicate an identification of the captured sound. Because shorter audio samples can be analyzed more quickly, the system may first process the shortest audio samples in order to quickly identify features of the audio signal. Because longer audio samples contain more information, the system may be able to more accurately identify features in the audio signal in longer audio samples. However, analyzing longer audio signals takes more buffered audio than identifying features in shorter signals. Therefore, the present system attempts to identify features in the shortest audio signals first.
摘要:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speaker verification. The methods, systems, and apparatus include actions of inputting speech data that corresponds to a particular utterance to a first neural network and determining an evaluation vector based on output at a hidden layer of the first neural network. Additional actions include obtaining a reference vector that corresponds to a past utterance of a particular speaker. Further actions include inputting the evaluation vector and the reference vector to a second neural network that is trained on a set of labeled pairs of feature vectors to identify whether speakers associated with the labeled pairs of feature vectors are the same speaker. More actions include determining, based on an output of the second neural network, whether the particular utterance was likely spoken by the particular speaker.
摘要:
Systems and techniques for adding pitch shift resistance to an audio fingerprint are presented. In particular, an audio track for a media file is received. A first audio fingerprint for the audio track with a first pitch shift and an Nth audio fingerprint for the audio track with an Mth pitch shift are generated, where N is an integer greater than or equal to two and M is an integer greater than or equal to two. A combined audio fingerprint is generated from at least the first audio fingerprint and the Nth audio fingerprint.