Abstract:
A convolutional neural network (CNN) is trained for font recognition and font similarity learning. In a training phase, text images with font labels are synthesized by introducing variances to minimize the gap between the training images and real-world text images. Training images are generated and input into the CNN. The output is fed into an N-way softmax function dependent on the number of fonts the CNN is being trained on, producing a distribution of classified text images over N class labels. In a testing phase, each test image is normalized in height and squeezed in aspect ratio resulting in a plurality of test patches. The CNN averages the probabilities of each test patch belonging to a set of fonts to obtain a classification. Feature representations may be extracted and utilized to define font similarity between fonts, which may be utilized in font suggestion, font browsing, or font recognition applications.
Abstract:
Systems and methods provide for providing a stereoscopic six-degree of freedom viewing experience with a monoscopic 360-degree video are provided. A monoscopic 360-degree video of a subject scene can be preprocessed by analyzing each frame to recover a three-dimensional geometric representation of the subject scene, and further recover a camera motion path that includes various parameters associated with the camera, such as orientation, translational movement, and the like, as evidenced by the recording. Utilizing the recovered three-dimensional geometric representation of the subject scene and recovered camera motion path, a dense three-dimensional geometric representation of the subject scene is generated utilizing random assignment and propagation operations. Once preprocessing is complete, the processed video can be provided for stereoscopic display via a device, such as a head-mounted display. As user motion data is detected and received, novel viewpoints can be stereoscopically synthesized for presentation to the user in real time, so as to provide an immersive virtual reality experience to the user based on the original monoscopic 360-degree video and the user's detected movement(s).
Abstract:
Example systems and methods for classifying visual patterns into a plurality of classes are presented. Using reference visual patterns of known classification, at least one image or visual pattern classifier is generated, which is then employed to classify a plurality of candidate visual patterns of unknown classification. The classification scheme employed may be hierarchical or nonhierarchical. The types of visual patterns may be fonts, human faces, or any other type of visual patterns or images subject to classification.
Abstract:
A framework is provided for associating images with topics utilizing embedding learning. The framework is trained utilizing images, each having multiple visual characteristics and multiple keyword tags associated therewith. Visual features are computed from the visual characteristics utilizing a convolutional neural network and an image feature vector is generated therefrom. The keyword tags are utilized to generate a weighted word vector (or “soft topic feature vector”) for each image by calculating a weighted average of word vector representations that represent the keyword tags associated with the image. The image feature vector and the soft topic feature vector are aligned in a common embedding space and a relevancy score is computed for each of the keyword tags. Once trained, the framework can automatically tag images and a text-based search engine can rank image relevance with respect to queried keywords based upon predicted relevancy scores.
Abstract:
Embodiments of the present invention relate to learning image representation by distilling from multi-task networks. In implementation, more than one single-task network is trained with heterogeneous labels. In some embodiments, each of the single-task networks is transformed into a Siamese structure with three branches of sub-networks so that a common triplet ranking loss can be applied to each branch. A distilling network is trained that approximates the single-task networks on a common ranking task. In some embodiments, the distilling network is a Siamese network whose ranking function is optimized to approximate an ensemble ranking of each of the single-task networks. The distilling network can be utilized to predict tags to associate with a test image or identify similar images to the test image.
Abstract:
A first set of attributes (e.g., style) is generated through pre-trained single column neural networks and leveraged to regularize the training process of a regularized double-column convolutional neural network (RDCNN). Parameters of the first column (e.g., style) of the RDCNN are fixed during RDCNN training Parameters of the second column (e.g., aesthetics) are fine-tuned while training the RDCNN and the learning process is supervised by the label identified by the second column (e.g., aesthetics). Thus, features of the images may be leveraged to boost classification accuracy of other features by learning a RDCNN.
Abstract:
A framework is provided for associating dense images with topics. The framework is trained utilizing images, each having multiple regions, multiple visual characteristics and multiple keyword tags associated therewith. For each region of each image, visual features are computed from the visual characteristics utilizing a convolutional neural network, and an image feature vector is generated from the visual features. The keyword tags are utilized to generate a weighted word vector for each image by calculating a weighted average of word vector representations representing keyword tags associated with the image. The image feature vector and the weighted word vector are aligned in a common embedding space and a heat map is computed for the image. Once trained, the framework can be utilized to automatically tag images and rank the relevance of images with respect to queried keywords based upon associated heat maps.
Abstract:
This disclosure involves personalizing user experiences with electronic content based on application usage data. For example, a user representation model that facilitates content recommendations is iteratively trained with action histories from a content manipulation application. Each iteration involves selecting, from an action history for a particular user, an action sequence including a target action. An initial output is computed in each iteration by applying a probability function to the selected action sequence and a user representation vector for the particular user. The user representation vector is adjusted to maximize an output that is generated by applying the probability function to the action sequence and the user representation vector. This iterative training process generates a user representation model, which includes a set of adjusted user representation vectors, that facilitates content recommendations corresponding to users' usage pattern in the content manipulation application.
Abstract:
Embodiments of the present invention relate to finding semantic parts in images. In implementation, a convolutional neural network (CNN) is applied to a set of images to extract features for each image. Each feature is defined by a feature vector that enables a subset of the set of images to be clustered in accordance with a similarity between feature vectors. Normalized cuts may be utilized to help preserve pose within each cluster. The images in the cluster are aligned and part proposals are generated by sampling various regions in various sizes across the aligned images. To determine which part proposal corresponds to a semantic part, a classifier is trained for each part proposal and semantic part to determine which part proposal best fits the correlation pattern given by the true semantic part. In this way, semantic parts in images can be identified without any previous part annotations.
Abstract:
Font graphs are defined having a finite set of nodes representing fonts and a finite set of undirected edges denoting similarities between fonts. The font graphs enable users to browse and identify similar fonts. Indications corresponding to a degree of similarity between connected nodes may be provided. A selection of a desired font or characteristics associated with one or more attributes of the desired font is received from a user interacting with the font graph. The font graph is dynamically redefined based on the selection.