摘要:
Techniques for high precision web extraction using site knowledge are provided. Portions of repeating text are identified in unlabeled web pages from a particular web site. Based on the portions of repeating text, the unlabeled web pages are partitioned into a set of segments. Multiple labels are assigned to respectively corresponding multiple attributes in the set of segments, where assigning the multiple labels comprises applying a classification model to each separate segment in the set of segments. First one or more labels are identified that were erroneously assigned to one or more attributes in the set of segments. Second one or more correct labels for the one or more attributes are determined. The first one or more labels in the set of segments are corrected by assigning the second one or more labels to the one or more attributes.
摘要:
A method and apparatus for automatically extracting information from a large number of documents through applying machine learning techniques and exploiting structural similarities among documents. A machine learning model is trained to have at least 50% accuracy. The trained machine learning model is used to identify information attributes in a sample of pages from a cluster of structurally similar documents. A structure-specific model of the cluster is created by compiling a list of top-K locations for each attribute identified by the trained machine learning model in the sample. These top-K lists are used to extract information from the pages of the cluster from which the sample of pages was taken.
摘要:
Methods and apparatus are described for use with information extraction techniques based on sequential models. Additional statistics are maintained during inference and employed to boost the accuracy of the extraction algorithm and mitigate the effects of training bias.