摘要:
Disclosed is a computer-implemented method of determining smarty between first and second elements of an electronic document. The method uses a computer to calculate a plurality of measures of similarity between the first and second elements in at least two representations of the electronic document. A computer program product and system implementing this method are also disclosed.
摘要:
A computer-implemented method for obtaining the rendering co-ordinates of visible text elements on a web page is disclosed. The web page is represented by an input data structure comprising a plurality of text nodes, each of which represents a text element on the web page. The method comprises the following steps: a) using a computer device, wrapping each of the plurality of text nodes in a pair of mark-up language tags; b) using said computer device, obtaining the co-ordinates of a bounding rectangle for each text node using the mark-up language tags; c) using said computer device, attaching an attribute specifying the co-ordinates of the bounding rectangle to each text node; and d) using said computer device, determining whether each text node is invisible, and if it is, excluding it from an output data structure comprising the plurality of text nodes and attached attributes.
摘要:
A system and method are provided for extracting main content from a web page. Web page segmentation is performed on a web page to provide affinity-grouped segments. Descriptive features of at least one of the affinity-grouped segments are computed. At least one of the affinity-grouped segments is classified as a main body segment based on the computed descriptive features. Additional affinity-grouped segments are classified as to a document function based on the computed descriptive features. Classified affinity-grouped segments are assembled according to their classified document functions to provide the main content.
摘要:
Segmenting a web page (110) into coherent function blocks (705-1 to 705-8) includes parsing content from the web page (110) into multiple coherent, collectively exhaustive nodes (405-1 to 405-37); calculating at least one matrix (500, 600, 605-1 to 605-4) of affinity values between each of the nodes (405-1 to 405-37); and clustering the nodes (405-1 to 405-37) into functional blocks (705-1 to 705-8) based on the affinity values in the at least one matrix (500, 600, 605-1 to 605-4).
摘要:
A system and method for an adaptive threshold Web Page segmenting is disclosed. In one embodiment, a method performed by a physical computing system having one or more processors for segmenting a Web page including a plurality of nodes includes parsing content in the Web page into the plurality of nodes using the physical computing system, obtaining feature values between each pair of nodes using the physical computing system, estimating an adaptive threshold value using the obtained feature values using the physical computing system, and segmenting the Web page by comparing the feature values associated with each pair of nodes with the estimated adaptive threshold value.
摘要:
Semantically ranking content in a website (110) with a computerized ranking device (105) includes: parsing content from the website (110) into multiple autonomous content blocks (415-1 to 415-17) with the computerized ranking device (105) and assigning an importance ranking with said computerized ranking device (105) to each of the content blocks (415-1 to 415-17) based on a degree to which a substance of the content block (415-1 to 415-17) is relevant to one of a plurality of predefined categories.
摘要:
Semantically ranking content in a website (110) with a computerized ranking device (105) includes: parsing content from the website (110) into multiple autonomous content blocks (415-1 to 415-17) with the computerized ranking device (105) and assigning an importance ranking with said computerized ranking device (105) to each of the content blocks (415-1 to 415-17) based on a degree to which a substance of the content block (415-1 to 415-17) is relevant to one of a plurality of predefined categories.
摘要:
A method for producing web page content includes identifying blocks within a web page. The blocks are selectively assembled into sections. The sections are selectively assembled into article candidates. An article candidate that includes article content is distinguished from article candidates that do not include article content. Content is produced only from the article candidate distinguished as including article content.
摘要:
A method for producing web page content includes identifying blocks within a web page. The blocks are selectively assembled into sections. The sections are selectively assembled into article candidates. An article candidate that includes article content is distinguished from article candidates that do not include article content. Content is produced only from the article candidate distinguished as including article content.
摘要:
A system and method for selectively filtering web page contents are disclosed. In one example embodiment a document object model (DOM) structure and visual information of the web page contents are generated. The document object model (DOM) structure and the visual information are analyzed to determine multiple web page content attributes. One or more filtering parameters are selected from the multiple web page content attributes. The web page is filtered based on the one or more filtering parameters.