摘要:
A method and system for generating and using a combined model to identify whether a bid term is relevant to an advertisement is provided. A relevance system trains a combined model that includes an initial model and a decision tree model that are trained using features that represent relationships between bid terms and advertisements. The relevance system trains the initial model to map initial model features to a modeled relevance. The relevance system trains the decision tree model to map the decision tree features and the modeled relevance to a final relevance. The trained initial model and decision tree model represent the combined model. The relevance system then uses the combined model to determine the relevance of bid terms to advertisements.
摘要:
A scalable two-pass scalable probabilistic latent semantic analysis (PLSA) methodology is disclosed that may perform more efficiently, and in some cases more accurately, than traditional PLSA, especially where large and/or sparse data sets are provided for analysis. The improved methodology can greatly reduce the storage and/or computational costs of training a PLSA model. In the first pass of the two-pass methodology, objects are clustered into groups, and PLSA is performed on the groups instead of the original individual objects. In the second pass, the conditional probability of a latent class, given an object, is obtained. This may be done by extending the training results of the first pass. During the second pass, the most likely latent classes for each object are identified.
摘要:
An influential persons identification system and method for identifying a set of influential persons (or influencers) in a social network (such as an online social network). The influential persons set is generated such that by sending a message to the set the message will be propagated through the network at the greatest speed and coverage. A ranking of users is generated, and a pruning process is performed starting with the top-ranked user and working down the list. For each user on the list, the user is identified as an influencer and then the user and each of his friends are deleted from the social network users list. Next, the same process is performed for the second-ranked user, the third-ranked user, and so forth. The process terminates when the list of users of the social network is exhausted or the desired number of influencers on the influential person set is reached.
摘要:
Computer-readable media having computer-executable instructions and apparatuses provide a keyphrase navigation map (KNM) for a document page. Keyphrases are extracted from the document page. Keyphrase clusters are subsequently formed by a measure of relevancy, and a salient keyphrase is determined for each cluster. A thumbnail is formed with tags corresponding to the salient keyphrases. A selected tag is expanded with associated keyphrases. An associated keyphrase may be further selected in order to facilitate the navigation of the document page. The displayed tags on the thumbnail are positioned in accordance with locations of associated keyphrases in the document page.
摘要:
Embodiments of the invention relate to improvements to the support vector machine (SVM) classification model. When text data is significantly unbalanced (i.e., positive and negative labeled data are in disproportion), the classification quality of standard SVM deteriorates. Embodiments of the invention are directed to a weighted proximal SVM (WPSVM) model that achieves substantially the same accuracy as the traditional SVM model while requiring significantly less computational time. A weighted proximal SVM (WPSVM) model in accordance with embodiments of the invention may include a weight for each training error and a method for estimating the weights, which automatically solves the unbalanced data problem. And, instead of solving the optimization problem via the KKT (Karush-Kuhn-Tucker) conditions and the Sherman-Morrison-Woodbury formula, embodiments of the invention use an iterative algorithm to solve an unconstrained optimization problem, which makes WPSVM suitable for classifying relatively high dimensional data.
摘要:
Embodiments of the invention relate to improvements to the support vector machine (SVM) classification model. When text data is significantly unbalanced (i.e., positive and negative labeled data are in disproportion), the classification quality of standard SVM deteriorates. Embodiments of the invention are directed to a weighted proximal SVM (WPSVM) model that achieves substantially the same accuracy as the traditional SVM model while requiring significantly less computational time. A weighted proximal SVM (WPSVM) model in accordance with embodiments of the invention may include a weight for each training error and a method for estimating the weights, which automatically solves the unbalanced data problem. And, instead of solving the optimization problem via the KKT (Karush-Kuhn-Tucker) conditions and the Sherman-Morrison-Woodbury formula, embodiments of the invention use an iterative algorithm to solve an unconstrained optimization problem, which makes WPSVM suitable for classifying relatively high dimensional data.
摘要:
In an embodiment, a method of predicting an active user's rating for an item is disclosed. A database of users may be sorted into clusters. The data associated with the users in each cluster may be smoothed to filling in ratings for items that the users have not personally rated. An active user may then be compared to a set of users, where the set may be all or some portion of the database, to determine the K users that are most similar to the active user. The ratings of the K users regarding the item may be used to predict the active user's rating for the item. In an embodiment, the rating of each of the K users is assigned a confidence value associated with whether the user personally rated the item or if the rating was generated by the data smoothing process.
摘要:
Computer-readable media having computer-executable instructions and apparatuses categorize documents or corpus of documents. A Tensor Space Model (TSM), which models the text by a higher-order tensor, represents a document or a corpus of documents. Supported by techniques of multilinear algebra, TSM provides a framework for analyzing the multifactor structures. TSM is further supported by operations and presented tools, such as the High-Order Singular Value Decomposition (HOSVD) for a reduction of the dimensions of the higher-order tensor. The dimensionally reduced tensor is compared with tensors that represent possible categories. Consequently, a category is selected for the document or corpus of documents. Experimental results on the dataset for 20 Newsgroups suggest that TSM is advantageous to a Vector Space Model (VSM) for text classification.
摘要:
The invention provides a method of interactively crawling data records on a web page. Users may select various data records of interest on a web page to generate templates to search for similar data items on the same web page or on different web pages. A tree matching algorithm may be used to compare and extract data matching the generated template.
摘要:
An influential persons identification system and method for identifying a set of influential persons (or influencers) in a social network (such as an online social network). The influential persons set is generated such that by sending a message to the set the message will be propagated through the network at the greatest speed and coverage. A ranking of users is generated, and a pruning process is performed starting with the top-ranked user and working down the list. For each user on the list, the user is identified as an influencer and then the user and each of his friends are deleted from the social network users list. Next, the same process is performed for the second-ranked user, the third-ranked user, and so forth. The process terminates when the list of users of the social network is exhausted or the desired number of influencers on the influential person set is reached.