SYSTEM AND METHOD FOR WEB PAGE SEGMENTATION USING ADAPTIVE THRESHOLD COMPUTATION
    2.
    发明申请
    SYSTEM AND METHOD FOR WEB PAGE SEGMENTATION USING ADAPTIVE THRESHOLD COMPUTATION 审中-公开
    使用自适应阈值计算的网页分段的系统和方法

    公开(公告)号:US20130061132A1

    公开(公告)日:2013-03-07

    申请号:US13696625

    申请日:2010-05-19

    IPC分类号: G06F17/00

    摘要: A system and method for an adaptive threshold Web Page segmenting is disclosed. In one embodiment, a method performed by a physical computing system having one or more processors for segmenting a Web page including a plurality of nodes includes parsing content in the Web page into the plurality of nodes using the physical computing system, obtaining feature values between each pair of nodes using the physical computing system, estimating an adaptive threshold value using the obtained feature values using the physical computing system, and segmenting the Web page by comparing the feature values associated with each pair of nodes with the estimated adaptive threshold value.

    摘要翻译: 公开了一种用于自适应阈值网页分割的系统和方法。 在一个实施例中,具有用于分割包括多个节点的网页的一个或多个处理器的物理计算系统执行的方法包括使用物理计算系统将网页中的内容解析为多个节点,从而获得每个 使用所述物理计算系统的一对节点,使用所述物理计算系统使用所获得的特征值来估计自适应阈值,以及通过将与每对节点相关联的特征值与所估计的自适应阈值进行比较来分割所述网页。

    Detecting separator lines in a web page
    3.
    发明授权
    Detecting separator lines in a web page 有权
    检测网页中的分隔线

    公开(公告)号:US08867837B2

    公开(公告)日:2014-10-21

    申请号:US13812421

    申请日:2010-07-30

    IPC分类号: G06K9/34 C07D309/28 G06K9/00

    CPC分类号: G06K9/00463 C07D309/28

    摘要: A system and method of detecting separator lines in a web page may include determining coordinates of visible web elements on a web page, generating an edge image of the web page based on the coordinates of the web elements, filtering edges belonging to non-separator line elements within the edge image, detecting horizontal lines within the edge image, detecting vertical lines within the edge image, and filtering short lines within the edge image. A system for detecting separator lines in a web page may include a memory device, and a processor communicatively coupled to the memory, in which the processor determines coordinates of visible web elements on a web page, generates an edge image of the web page based on the coordinates of the web elements, filters edges belonging to non-separator line elements within the edge image, detects horizontal lines within the edge image, detects vertical lines within the edge image, and filters short lines within the edge image.

    摘要翻译: 检测网页中的分隔线的系统和方法可以包括确定网页上的可视网页元素的坐标,基于网页元素的坐标生成网页的边缘图像,过滤属于非分隔线的边 边缘图像内的元素,检测边缘图像内的水平线,检测边缘图像内的垂直线,以及过滤边缘图像内的短线。 用于检测网页中的分隔线的系统可以包括存储器设备和通信地耦合到存储器的处理器,其中处理器确定网页上的可视网页元素的坐标,基于网页生成网页的边缘图像 网页元素的坐标,属于边缘图像内的非分隔线元素的滤镜边缘,检测边缘图像内的水平线,检测边缘图像内的垂直线,并对边缘图像内的短线进行滤波。

    METHOD OF EXTRACTING NAMED ENTITY
    4.
    发明申请
    METHOD OF EXTRACTING NAMED ENTITY 审中-公开
    提取有名实体的方法

    公开(公告)号:US20130204835A1

    公开(公告)日:2013-08-08

    申请号:US13643925

    申请日:2010-04-27

    IPC分类号: G06N5/04

    CPC分类号: G06N5/048 G06F17/278

    摘要: Presented is a method of extracting named entities from a large-scale document corpus. The method includes identifying named entities in the corpus and forming a set of seed entities manually or automatically using some existing resources, constructing a named entity graph to discover same-type probability between any given pair of named entities, expanding the set of seed entities and performing a confidence propagation of the seed entities on the named entity graph.

    摘要翻译: 提出的是从大型文档语料库中提取命名实体的方法。 该方法包括识别语料库中的命名实体,并使用一些现有资源手动或自动地形成一组种子实体,构建命名实体图,以发现任何给定的一对命名实体之间的相似概率,扩展种子实体集合 在命名实体图上执行种子实体的置信度传播。

    Obtaining Rendering Co-ordinates Of Visible Text Elements
    5.
    发明申请
    Obtaining Rendering Co-ordinates Of Visible Text Elements 审中-公开
    获取可见文本元素的渲染坐标

    公开(公告)号:US20130159889A1

    公开(公告)日:2013-06-20

    申请号:US13808856

    申请日:2010-07-07

    IPC分类号: G06F3/0481

    摘要: A computer-implemented method for obtaining the rendering co-ordinates of visible text elements on a web page is disclosed. The web page is represented by an input data structure comprising a plurality of text nodes, each of which represents a text element on the web page. The method comprises the following steps: a) using a computer device, wrapping each of the plurality of text nodes in a pair of mark-up language tags; b) using said computer device, obtaining the co-ordinates of a bounding rectangle for each text node using the mark-up language tags; c) using said computer device, attaching an attribute specifying the co-ordinates of the bounding rectangle to each text node; and d) using said computer device, determining whether each text node is invisible, and if it is, excluding it from an output data structure comprising the plurality of text nodes and attached attributes.

    摘要翻译: 公开了一种用于获得网页上的可视文本元素的渲染坐标的计算机实现的方法。 网页由包括多个文本节点的输入数据结构表示,每个文本节点表示网页上的文本元素。 该方法包括以下步骤:a)使用计算机设备,将多个文本节点中的每一个包裹在一对标记语言标签中; b)使用所述计算机设备,使用所述标记语言标签获得每个文本节点的边界矩形的坐标; c)使用所述计算机设备,将指定所述边界矩形的坐标的属性附加到每个文本节点; 以及d)使用所述计算机设备,确定每个文本节点是否不可见,并且如果是,则将其从包括所述多个文本节点和附加属性的输出数据结构中排除。

    PRODUCT INFORMATION
    6.
    发明申请
    PRODUCT INFORMATION 审中-公开
    产品信息

    公开(公告)号:US20130159209A1

    公开(公告)日:2013-06-20

    申请号:US13817361

    申请日:2010-08-18

    IPC分类号: G06Q10/06

    CPC分类号: G06Q10/067 G06Q30/02

    摘要: Disclosed is a method of generating a model representation of product information. The method obtains a list of products from a source of product information. A hierarchical tree is then constructed from the obtained list of products, wherein each hierarchical layer of the tree corresponds to a different category of product information.

    摘要翻译: 公开了一种产生产品信息的模型表示的方法。 该方法从产品信息来源获取产品列表。 然后从所获得的产品列表中构建分层树,其中树的每个分级层对应于不同类别的产品信息。

    VISUAL SEPARATOR DETECTION IN WEB PAGES USING CODE ANALYSIS
    7.
    发明申请
    VISUAL SEPARATOR DETECTION IN WEB PAGES USING CODE ANALYSIS 审中-公开
    使用代码分析的WEB页面中的视觉分离器检测

    公开(公告)号:US20130124684A1

    公开(公告)日:2013-05-16

    申请号:US13812092

    申请日:2010-07-30

    IPC分类号: H04L29/08

    CPC分类号: H04L29/0809 G06F17/272

    摘要: A method for detection of visual separators in web pages using code analysis includes receiving a web page and its associated web code by a web page analysis device and analyzing the web code to detect visual separators in the web page. A web page analysis device for visual separator detection in web pages is also provided.

    摘要翻译: 使用代码分析来检测网页中的视觉分离器的方法包括通过网页分析设备接收网页及其相关联的网络代码,并分析网页代码以检测网页中的可视分隔符。 还提供了用于网页中的视觉分离器检测的网页分析装置。

    Extraction of Content from a Web Page
    9.
    发明申请
    Extraction of Content from a Web Page 审中-公开
    从网页提取内容

    公开(公告)号:US20130283148A1

    公开(公告)日:2013-10-24

    申请号:US13817656

    申请日:2010-10-26

    IPC分类号: G06F17/22

    CPC分类号: G06F17/2247 G06F16/986

    摘要: A system and method are provided for extracting main content from a web page. Web page segmentation is performed on a web page to provide affinity-grouped segments. Descriptive features of at least one of the affinity-grouped segments are computed. At least one of the affinity-grouped segments is classified as a main body segment based on the computed descriptive features. Additional affinity-grouped segments are classified as to a document function based on the computed descriptive features. Classified affinity-grouped segments are assembled according to their classified document functions to provide the main content.

    摘要翻译: 提供了一种用于从网页提取主要内容的系统和方法。 在网页上执行网页分割以提供关联分组的段。 计算至少一个亲和力分组段的描述性特征。 基于所计算的描述特征,至少一个亲和度分组的段被分类为主体段。 基于所计算的描述特征,附加的亲和组合段被分类为文档功能。 分类的亲和度分组段根据其分类的文档功能进行组装以提供主要内容。

    SEED SET EXPANSION
    10.
    发明申请
    SEED SET EXPANSION 审中-公开
    种子膨胀

    公开(公告)号:US20130238607A1

    公开(公告)日:2013-09-12

    申请号:US13883934

    申请日:2010-11-10

    IPC分类号: G06F17/30

    CPC分类号: G06F16/2453 G06F16/986

    摘要: Systems and methods for seed set expansion are provided. A context-based extractor (22) generates a set of context-based candidate members of a seed set from a set of web pages associated with an organization as words connected with a seed set member by a contextual pattern and a context confidence value for each candidate member. A list-based extractor (24) generates a set of list-based candidate members from elements within a plurality of lists in the set of web pages and a list confidence value associated with each candidate member. A confidence arbitrator (26) determines an intersection set of candidate members present in both sets of candidate members and determines a final confidence value for each of the intersection set of candidate members based on their respective context confidence value and list confidence value. A candidate selector (28) selects a candidate member for inclusion in a seed set (21).

    摘要翻译: 提供种子扩张的系统和方法。 基于上下文的提取器(22)从与组织相关联的一组网页生成一组基于上下文的候选成员,作为通过上下文模式和每个上下文置信度值与种子集成员连接的单词 候选人。 基于列表的提取器(24)从网页集合中的多个列表中的元素和与每个候选成员相关联的列表置信度值中的元素生成一组基于列表的候选成员。 置信度仲裁器(26)确定存在于两组候选成员中的候选成员的交集,并且基于它们各自的上下文置信度值和列表置信度值来确定候选成员的交集中的每一个的最终置信度值。 候选选择器(28)选择候选成员包括在种子集(21)中。