Obtaining Rendering Co-ordinates Of Visible Text Elements
    2.
    发明申请
    Obtaining Rendering Co-ordinates Of Visible Text Elements 审中-公开
    获取可见文本元素的渲染坐标

    公开(公告)号:US20130159889A1

    公开(公告)日:2013-06-20

    申请号:US13808856

    申请日:2010-07-07

    IPC分类号: G06F3/0481

    摘要: A computer-implemented method for obtaining the rendering co-ordinates of visible text elements on a web page is disclosed. The web page is represented by an input data structure comprising a plurality of text nodes, each of which represents a text element on the web page. The method comprises the following steps: a) using a computer device, wrapping each of the plurality of text nodes in a pair of mark-up language tags; b) using said computer device, obtaining the co-ordinates of a bounding rectangle for each text node using the mark-up language tags; c) using said computer device, attaching an attribute specifying the co-ordinates of the bounding rectangle to each text node; and d) using said computer device, determining whether each text node is invisible, and if it is, excluding it from an output data structure comprising the plurality of text nodes and attached attributes.

    摘要翻译: 公开了一种用于获得网页上的可视文本元素的渲染坐标的计算机实现的方法。 网页由包括多个文本节点的输入数据结构表示,每个文本节点表示网页上的文本元素。 该方法包括以下步骤:a)使用计算机设备,将多个文本节点中的每一个包裹在一对标记语言标签中; b)使用所述计算机设备,使用所述标记语言标签获得每个文本节点的边界矩形的坐标; c)使用所述计算机设备,将指定所述边界矩形的坐标的属性附加到每个文本节点; 以及d)使用所述计算机设备,确定每个文本节点是否不可见,并且如果是,则将其从包括所述多个文本节点和附加属性的输出数据结构中排除。

    Extraction of Content from a Web Page
    3.
    发明申请
    Extraction of Content from a Web Page 审中-公开
    从网页提取内容

    公开(公告)号:US20130283148A1

    公开(公告)日:2013-10-24

    申请号:US13817656

    申请日:2010-10-26

    IPC分类号: G06F17/22

    CPC分类号: G06F17/2247 G06F16/986

    摘要: A system and method are provided for extracting main content from a web page. Web page segmentation is performed on a web page to provide affinity-grouped segments. Descriptive features of at least one of the affinity-grouped segments are computed. At least one of the affinity-grouped segments is classified as a main body segment based on the computed descriptive features. Additional affinity-grouped segments are classified as to a document function based on the computed descriptive features. Classified affinity-grouped segments are assembled according to their classified document functions to provide the main content.

    摘要翻译: 提供了一种用于从网页提取主要内容的系统和方法。 在网页上执行网页分割以提供关联分组的段。 计算至少一个亲和力分组段的描述性特征。 基于所计算的描述特征,至少一个亲和度分组的段被分类为主体段。 基于所计算的描述特征,附加的亲和组合段被分类为文档功能。 分类的亲和度分组段根据其分类的文档功能进行组装以提供主要内容。

    Segmenting a Web Page into Coherent Functional Blocks
    4.
    发明申请
    Segmenting a Web Page into Coherent Functional Blocks 审中-公开
    将网页分割成相干功能块

    公开(公告)号:US20130275854A1

    公开(公告)日:2013-10-17

    申请号:US13635410

    申请日:2010-04-19

    IPC分类号: G06F17/22

    CPC分类号: G06F17/2247 G06F17/2705

    摘要: Segmenting a web page (110) into coherent function blocks (705-1 to 705-8) includes parsing content from the web page (110) into multiple coherent, collectively exhaustive nodes (405-1 to 405-37); calculating at least one matrix (500, 600, 605-1 to 605-4) of affinity values between each of the nodes (405-1 to 405-37); and clustering the nodes (405-1 to 405-37) into functional blocks (705-1 to 705-8) based on the affinity values in the at least one matrix (500, 600, 605-1 to 605-4).

    摘要翻译: 将网页(110)分段成相干功能块(705-1至705-8)包括将来自网页(110)的内容解析为多个相干,共同穷举的节点(405-1至405-37); 计算每个节点(405-1至405-37)之间的亲和度值的至少一个矩阵(500,600,605-1至605-4); 以及基于所述至少一个矩阵(500,600,605-1至605-4)中的所述亲和度值将所述节点(405-1至405-37)聚类成功能块(705-1至705-8)。

    SYSTEM AND METHOD FOR WEB PAGE SEGMENTATION USING ADAPTIVE THRESHOLD COMPUTATION
    5.
    发明申请
    SYSTEM AND METHOD FOR WEB PAGE SEGMENTATION USING ADAPTIVE THRESHOLD COMPUTATION 审中-公开
    使用自适应阈值计算的网页分段的系统和方法

    公开(公告)号:US20130061132A1

    公开(公告)日:2013-03-07

    申请号:US13696625

    申请日:2010-05-19

    IPC分类号: G06F17/00

    摘要: A system and method for an adaptive threshold Web Page segmenting is disclosed. In one embodiment, a method performed by a physical computing system having one or more processors for segmenting a Web page including a plurality of nodes includes parsing content in the Web page into the plurality of nodes using the physical computing system, obtaining feature values between each pair of nodes using the physical computing system, estimating an adaptive threshold value using the obtained feature values using the physical computing system, and segmenting the Web page by comparing the feature values associated with each pair of nodes with the estimated adaptive threshold value.

    摘要翻译: 公开了一种用于自适应阈值网页分割的系统和方法。 在一个实施例中,具有用于分割包括多个节点的网页的一个或多个处理器的物理计算系统执行的方法包括使用物理计算系统将网页中的内容解析为多个节点,从而获得每个 使用所述物理计算系统的一对节点,使用所述物理计算系统使用所获得的特征值来估计自适应阈值,以及通过将与每对节点相关联的特征值与所估计的自适应阈值进行比较来分割所述网页。

    SYSTEMS AND METHODS FOR FILTERING WEB PAGE CONTENTS
    10.
    发明申请
    SYSTEMS AND METHODS FOR FILTERING WEB PAGE CONTENTS 审中-公开
    用于过滤网页内容的系统和方法

    公开(公告)号:US20130145255A1

    公开(公告)日:2013-06-06

    申请号:US13817366

    申请日:2010-08-20

    IPC分类号: G06F17/21

    摘要: A system and method for selectively filtering web page contents are disclosed. In one example embodiment a document object model (DOM) structure and visual information of the web page contents are generated. The document object model (DOM) structure and the visual information are analyzed to determine multiple web page content attributes. One or more filtering parameters are selected from the multiple web page content attributes. The web page is filtered based on the one or more filtering parameters.

    摘要翻译: 公开了一种用于选择性地过滤网页内容的系统和方法。 在一个示例实施例中,生成文档对象模型(DOM)结构和网页内容的视觉信息。 分析文档对象模型(DOM)结构和视觉信息以确定多个网页内容属性。 从多个网页内容属性中选择一个或多个过滤参数。 基于一个或多个过滤参数对网页进行过滤。