Abstract:
A method and system generates and compares fingerprints for videos in a video library. The video fingerprints provide a compact representation of the temporal locations of discontinuities in the video that can be used to quickly and efficiently identify video content. Discontinuities can be, for example, shot boundaries in the video frame sequence or silent points in the audio stream. Because the fingerprints are based on structural discontinuity characteristics rather than exact bit sequences, visual content of videos can be effectively compared even when there are small differences between the videos in compression factors, source resolutions, start and stop times, frame rates, and so on. Comparison of video fingerprints can be used, for example, to search for and remove copyright protected videos from a video library. Furthermore, duplicate videos can be detected and discarded in order to preserve storage space.
Abstract:
A computer-implemented method can include receiving training data that includes a set of non-matching pairs and a set of matching pairs. The method can further include calculating a non-matching collision probability for each non-matching pair of the set of non-matching pairs and a matching collision probability for each matching pair of the set of matching pairs. The method can also include generating a machine learning model that includes a first threshold and a second threshold. An unknown item and a particular known item are classified as not matching when their collision probability is less than the first threshold, and as matching when their collision probability is greater than the second threshold. The first threshold and the second threshold can be selected based on a minimization of errors in classification of matching and non-matching pairs in the training data, and a maximization of a retrieval efficiency metric.
Abstract:
A video server receives an uploaded video and determines whether the video contains third-party content and which portions of the uploaded video match third-party content. The video server determines whether to degrade the matching portions and/or how (e.g., extent, type) to do so. The video server separates the matching portion from original portions in the uploaded video and generates a degraded version of the matching content by applying an effect such as compression, edge distortion, temporal distortion, noise addition, color distortion, or audio distortion. The video server combines the degraded portions with the original portions to output a degraded version of the uploaded video. The video server stores and/or distributes the degraded version of the uploaded video. The video server may offer the uploading user licensing terms with the content owner that the user may accept to reverse the degradation.
Abstract:
A method and apparatus are provided for performing an image search based on a search query having a portion P1 and a portion P2. Based on the first search query, a second search query is generated that includes a portion P3 and the portion P2 such that the second search query is broader in scope than the first search query, while still retaining the portion P2 of the first query. A first image search is then performed for the first search query to obtain a first set of search results and a second image search is performed for the second search query to obtain a second set of search results. Consequently, an image from the first set of search results is selected for presentation to a user, wherein the selection is based on content of the second set of search results.
Abstract:
This disclosure relates to transformation invariant media matching. A fingerprinting component can generate a transformation invariant identifier for media content by adaptively encoding the relative ordering of signal markers in media content. The signal markers can be adaptively encoded via reference point geometry, or ratio histograms. An identification component compares the identifier against a set of identifiers for known media content, and the media content can be matched or identified as a function of the comparison.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining image search results. One of the methods includes generating a plurality of feature vectors for each image in a collection of images, wherein each feature vector is associated with an image tile of an image, wherein each feature vector corresponds to one of a plurality of predetermined visual words. All images in the collection of images that share at least a threshold number of matching visual words associated with matching image tiles are classified as near-duplicate images.
Abstract:
A method and apparatus are provided for performing an image search based on a search query having a portion P1 and a portion P2. Based on the first search query, a second search query is generated that includes a portion P3 and the portion P2 such that the second search query is broader in scope than the first search query, while still retaining the portion P2 of the first query. A first image search is then performed for the first search query to obtain a first set of search results and a second image search is performed for the second search query to obtain a second set of search results. Consequently, an image from the first set of search results is selected for presentation to a user, wherein the selection is based on content of the second set of search results.
Abstract:
This disclosure relates to transformation invariant media matching. A fingerprinting component can generate a transformation invariant identifier for media content by adaptively encoding the relative ordering of signal markers in media content. The signal markers can be adaptively encoded via reference point geometry, or ratio histograms. An identification component compares the identifier against a set of identifiers for known media content, and the media content can be matched or identified as a function of the comparison.
Abstract:
The present disclosure provides systems and methods that enable distance metric learning using proxies. A machine-learned distance model can be trained in a proxy space in which a loss function compares an embedding provided for an anchor data point of a training dataset to a positive proxy and one or more negative proxies, where each of the positive proxy and the one or more negative proxies serve as a proxy for two or more data points included in the training dataset. Thus, each proxy can approximate a number of data points, enabling faster convergence. According to another aspect, the proxies of the proxy space can themselves be learned parameters, such that the proxies and the model are trained jointly. Thus, the present disclosure enables faster convergence (e.g., reduced training time). The present disclosure provides example experiments which demonstrate a new state of the art on several popular training datasets.
Abstract:
A neural network system that includes: multiple subnetworks that includes: a first subnetwork including multiple first modules, each first module including: a pass-through convolutional layer configured to process the subnetwork input for the first subnetwork to generate a pass-through output; an average pooling stack of neural network layers that collectively processes the subnetwork input for the first subnetwork to generate an average pooling output; a first stack of convolutional neural network layers configured to collectively process the subnetwork input for the first subnetwork to generate a first stack output; a second stack of convolutional neural network layers that are configured to collectively process the subnetwork input for the first subnetwork to generate a second stack output; and a concatenation layer configured to concatenate the pass-through output, the average pooling output, the first stack output, and the second stack output to generate a first module output for the first module.