Abstract:
Font recognition and similarity determination techniques and systems are described. In a first example, localization techniques are described to train a model using machine learning (e.g., a convolutional neural network) using training images. The model is then used to localize text in a subsequently received image, and may do so automatically and without user intervention, e.g., without specifying any of the edges of a bounding box. In a second example, a deep neural network is directly learned as an embedding function of a model that is usable to determine font similarity. In a third example, techniques are described that leverage attributes described in metadata associated with fonts as part of font recognition and similarity determinations.
Abstract:
Techniques and systems are described to generate a compact video feature representation for sequences of frames in a video. In one example, values of features are extracted from each frame of a plurality of frames of a video using machine learning, e.g., through use of a convolutional neural network. A video feature representation is generated of temporal order dynamics of the video, e.g., through use of a recurrent neural network. For example, a maximum value is maintained of each feature of the plurality of features that has been reached for the plurality of frames in the video. A timestamp is also maintained as indicative of when the maximum value is reached for each feature of the plurality of features. The video feature representation is then output as a basis to determine similarity of the video with at least one other video based on the video feature representation.
Abstract:
Techniques for image captioning with weak supervision are described herein. In implementations, weak supervision data regarding a target image is obtained and utilized to provide detail information that supplements global image concepts derived for image captioning. Weak supervision data refers to noisy data that is not closely curated and may include errors. Given a target image, weak supervision data for visually similar images may be collected from sources of weakly annotated images, such as online social networks. Generally, images posted online include “weak” annotations in the form of tags, titles, labels, and short descriptions added by users. Weak supervision data for the target image is generated by extracting keywords for visually similar images discovered in the different sources. The keywords included in the weak supervision data are then employed to modulate weights applied for probabilistic classifications during image captioning analysis.
Abstract:
Embodiments of the present invention are directed at providing a font similarity system. In one embodiment, a new font is detected on a computing device. In response to the detection of the new font, a pre-computed font list is checked to determine whether the new font is included therein. The pre-computed font list including feature representations, generated independently of the computing device, for corresponding fonts. In response to a determination that the new font is absent from the pre-computed font list, a feature representation for the new font is generated. The generated feature representation capable of being utilized for a similarity analysis of the new font. The feature representation is then stored in a supplemental font list to enable identification of one or more fonts installed on the computing device that are similar to the new font. Other embodiments may be described and/or claimed.
Abstract:
Font recognition and similarity determination techniques and systems are described. In a first example, localization techniques are described to train a model using machine learning (e.g., a convolutional neural network) using training images. The model is then used to localize text in a subsequently received image, and may do so automatically and without user intervention, e.g., without specifying any of the edges of a bounding box. In a second example, a deep neural network is directly learned as an embedding function of a model that is usable to determine font similarity. In a third example, techniques are described that leverage attributes described in metadata associated with fonts as part of font recognition and similarity determinations.
Abstract:
A system is configured to annotate an image with tags. As configured, the system accesses an image and generates a set of vectors for the image. The set of vectors may be generated by mathematically transforming the image, such as by applying a mathematical transform to predetermined regions of the image. The system may then query a database of tagged images by submitting the set of vectors as search criteria to a search engine. The querying of the database may obtain a set of tagged images. Next, the system may rank the obtained set of tagged images according to similarity scores that quantify degrees of similarity between the image and each tagged image obtained. Tags from a top-ranked subset of the tagged images may be extracted by the system, which may then annotate the image with these extracted tags.
Abstract:
A system and method for distributed similarity learning for high-dimensional image features are described. A set of data features is accessed. Subspaces from a space formed by the set of data features are determined using a set of projection matrices. Each subspace has a dimension lower than a dimension of the set of data features. Similarity functions are computed for the subspaces. Each similarity function is based on the dimension of the corresponding subspace. A linear combination of the similarity functions is performed to determine a similarity function for the set of data features.
Abstract:
Photometric stabilization for time-compressed video is described. Initially, video content captured by a video capturing device is time-compressed by selecting a subset of frames from the video content according to a frame sampling technique. Photometric characteristics are then stabilized across the frames of the time-compressed video. This involves determining correspondences of pixels in adjacent frames of the time-compressed video. Photometric transformations are then determined that describe how photometric characteristics (e.g., one or both of luminance and chrominance) change between the adjacent frames, given movement of objects through the captured scene. Based on the determined photometric transformations, filters are computed for smoothing photometric characteristic changes across the time-compressed video. Photometrically stabilized time-compressed video is generated from the time-compressed video by using the filters to smooth the photometric characteristic changes.
Abstract:
Photometric stabilization for time-compressed video is described. Initially, video content captured by a video capturing device is time-compressed by selecting a subset of frames from the video content according to a frame sampling technique. Photometric characteristics are then stabilized across the frames of the time-compressed video. This involves determining correspondences of pixels in adjacent frames of the time-compressed video. Photometric transformations are then determined that describe how photometric characteristics (e.g., one or both of luminance and chrominance) change between the adjacent frames, given movement of objects through the captured scene. Based on the determined photometric transformations, filters are computed for smoothing photometric characteristic changes across the time-compressed video. Photometrically stabilized time-compressed video is generated from the time-compressed video by using the filters to smooth the photometric characteristic changes.
Abstract:
In some embodiments, techniques for synthesizing an image style based on a plurality of neural networks are described. A computer system selects a style image based on user input that identifies the style image. The computer system generates an image based on a generator neural network and a loss neural network. The generator neural network outputs the synthesized image based on a noise vector and the style image and is trained based on style features generated from the loss neural network. The loss neural network outputs the style features based on a training image. The training image and the style image have a same resolution. The style features are generated at different resolutions of the training image. The computer system provides the synthesized image to a user device in response to the user input.