摘要:
A system is disclosed for obtaining and aggregating opinions generated by multiple sources with respect to one or more objects. The disclosed system uses observed variables associated with an opinion and a probabilistic model to estimate latent properties of that opinion. With those latent properties, the disclosed system may enable publishers to reliably and comprehensively present object information to interested users.
摘要:
A system is disclosed for obtaining and aggregating opinions generated by multiple sources with respect to one or more objects. The disclosed system uses observed variables associated with an opinion and a probabilistic model to estimate latent properties of that opinion. With those latent properties, the disclosed system may enable publishers to reliably and comprehensively present object information to interested users.
摘要:
Disclosed are methods and apparatus for extracting (or annotating) structured information from web content. Web content of interest from a particular domain is represented as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances. The particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content. A structured data instance that conforms to the concept schema is extracted from the one or more tree instances based on the domain knowledge for the particular domain. Extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and (ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments. The extracted structured data instance is stored as structured output records in a database.
摘要:
A system is disclosed for reconciling opinions generated by agents with respect to one or more predicates. The disclosed system may use observed variables and a probabilistic model including latent parameters to estimate a truth score associated with each of the predicates. The truth score, as well as one or more of the latent parameters of the probabilistic model, may be estimated based on the observed variables. The truth score generated by the disclosed system may enable publishers to reliably represent the truth of a predicate to interested users.
摘要:
Methods and apparatus for performing computer-implemented extraction of temporal information for business entities and events are disclosed. In one embodiment, a sequence of text is obtained. A label is assigned to one or more of a plurality of segments of the text such that each of the one or more of the plurality of segments of the text is classified as temporal data in one of a plurality of classes of temporal data. One or more rules are applied to the one or more segments of the text that have been classified as temporal data to generate a structured representation of the temporal data, where the rules include one or more schematic rules. Each of the schematic rules pertains to one or more of the plurality of classes of temporal data and indicates a structure in which temporal data in the corresponding one or more of the plurality of classes is to be stored.
摘要:
Methods and apparatus for performing computer-implemented extraction of temporal information for business entities and events are disclosed. In one embodiment, a sequence of text is obtained. A label is assigned to one or more of a plurality of segments of the text such that each of the one or more of the plurality of segments of the text is classified as temporal data in one of a plurality of classes of temporal data. One or more rules are applied to the one or more segments of the text that have been classified as temporal data to generate a structured representation of the temporal data, where the rules include one or more schematic rules. Each of the schematic rules pertains to one or more of the plurality of classes of temporal data and indicates a structure in which temporal data in the corresponding one or more of the plurality of classes is to be stored.
摘要:
Disclosed are methods and apparatus for segmenting and labeling a collection of token sequences. A plurality of segments of one or more tokens in a token sequence collection are partially labeled with labels from a set of target labels using high precision domain-specific labelers so as to generate a partially labeled sequence collection having a plurality of labeled segments and a plurality of unlabeled segments. Any label conflicts in the partially labeled sequence collection are resolved. One or more of the labeled segments of the partially labeled sequence collection are expanded so as to cover one or more additional tokens of the partially labeled sequence collection. A statistical model, for labeling segments using local token and segment features of the sequence collection, is trained based on the partially labeled sequence collection. This trained model is then used to label the unlabeled segments and the labeled segments of the sequence collection so as to generate a labeled sequence collection. The labeled sequence collection is then stored as structured output records in a database.
摘要:
A classifier development process seamlessly and intelligently integrates different forms of human feedback on instances and features into the data preparation, learning and evaluation stages. A query utility based active learning approach is applicable to different types of editorial feedback. A bi-clustering based technique may be used to further speed up the active learning process.
摘要:
A method of predicting a response relationship between elements of two sets includes: specifying a dyadic response matrix; specifying covariates that measure additional dyadic relationships; specifying a number of row clusters and a number of column clusters for clustering the rows and columns of the response matrix; specifying a rank for cluster factors that model average interactions between row clusters and column clusters by products of cluster factors; and determining prediction parameters for predicting responses between elements of the first set and the second set by improving a likelihood value that relates the prediction parameters to the response matrix, the covariates, the observation weights, the row clusters and the column clusters. Determining the prediction parameters includes: updating the prediction parameters for fixed assignments of row clusters and column clusters, and updating assignments for row clusters and column clusters for fixed prediction parameters.
摘要:
A method for managing user-generated questions and answers across multiple social media data sources can begin with the receiving of query parameters, including a user-entered question, via the user interface of a social media Q&A manage. Social media data sources can be queried for knowledge related to the user-entered question. When knowledge related to the user-entered question exists, the existing related knowledge can be organized and presented in the user interface according to a determined answer quality. When knowledge related to the user-entered question does not exist or is deemed unsatisfactory by a user, the user-entered question can be automatically submitted to applicable social media data sources by the social media Q&A manager on behalf of the user. A status of the submitted user-entered question can be monitored. When the status of the submitted user-entered question changes, the method can be re-executed at the querying step.