Assigning human-understandable labels to web pages
    2.
    发明授权
    Assigning human-understandable labels to web pages 有权
    为网页分配人性化的标签

    公开(公告)号:US08185528B2

    公开(公告)日:2012-05-22

    申请号:US12144036

    申请日:2008-06-23

    Applicant: Ashwin Tengli

    Inventor: Ashwin Tengli

    CPC classification number: G06F17/30861 G06F17/30616 G06F17/30696

    Abstract: Methods and systems that label a web page by collecting a set of inbound labels for the web page, estimating a language model for the web page, computing the likelihood of generating each inbound label given the language model and assigning a score to each inbound label based on this likelihood, and assigning a label to the web page based on the score assigned to each of the set of inbound labels. Inbound labels are preferably collected from the set of web documents linking to the web page. Labels assigned are useful in providing labeled links to web pages from top hosts in search results pages.

    Abstract translation: 通过收集网页的一组入站标签,估计网页的语言模型,计算给定语言模型并为每个入站标签分配分数的可能性,计算生成每个入站标签的可能性的方法和系统 基于这种可能性,并且基于分配给每组入站标签的分数将标签分配给网页。 入站标签优选地从链接到网页的一组web文档中收集。 分配的标签在提供标签链接到搜索结果页面中顶级主机的网页时很有用。

    METHOD AND SYSTEM FOR WEB EXTRACTION
    3.
    发明申请
    METHOD AND SYSTEM FOR WEB EXTRACTION 审中-公开
    网络提取的方法和系统

    公开(公告)号:US20120005207A1

    公开(公告)日:2012-01-05

    申请号:US12828305

    申请日:2010-07-01

    CPC classification number: G06F16/9535

    Abstract: A method includes generating, a plurality of sets of pairs of records from a set of records, for each attribute-position pair in the set of records. Each attribute-position pair being indicative of a position of an attribute in a record. Further, the method includes forming, electronically, a plurality of groups, each group comprising two attribute-position pairs having different attributes. Further, the method also includes determining, electronically for each group, number of pairs of records that are common in the two attribute-position pairs of that group. Furthermore, the method includes extracting results based on a first group of the plurality of groups if the number of pairs of records that are common in the two attribute-position pairs of the first group is greater than a second threshold, is highest among the plurality of groups, and no group having three or more attribute-position pairs with different attributes is possible.

    Abstract translation: 一种方法包括针对该组记录中的每个属性位置对,从一组记录生成多组记录对。 每个属性位置对指示记录中属性的位置。 此外,该方法包括以电子方式形成多个组,每个组包括具有不同属性的两个属性位置对。 此外,该方法还包括以电子方式确定每组的在该组的两个属性位置对中共有的记录对数。 此外,该方法包括:如果第一组的两个属性位置对中共同的记录对数大于第二阈值,则基于多个组中的第一组来提取结果,在多个组中是最高的 的组,并且没有具有三个或更多个具有不同属性的属性位置对的组是可能的。

    METHOD AND SYSTEM FOR DETERMINING SIMILARITY SCORE
    4.
    发明申请
    METHOD AND SYSTEM FOR DETERMINING SIMILARITY SCORE 有权
    用于确定相似度的方法和系统

    公开(公告)号:US20110225173A1

    公开(公告)日:2011-09-15

    申请号:US12721577

    申请日:2010-03-11

    CPC classification number: G06K9/3266 G06K9/723 G06K2209/01

    Abstract: A method includes generating, electronically, one or more matching patterns for one or more pairs of attribute values. Each pair includes two attribute values. The two attribute values include a first attribute value from a first record and a second attribute value from a second record. The first attribute value and the second attribute value satisfy a first criterion. Further, the method includes identifying, electronically, matching segment between the first attribute value and the second attribute value of a first pair. The method also includes repeating identifying for each pair. Moreover, the method includes computing a similarity score for the first pair using one of the first pair and the matching segment based on the one or more matching patterns and matching segments of the one or more pairs satisfying a second criterion. The method also includes repeating computing for each pair.

    Abstract translation: 一种方法包括以电子方式生成一对或多对属性值的一个或多个匹配模式。 每对包含两个属性值。 两个属性值包括来自第一记录的第一属性值和来自第二记录的第二属性值。 第一属性值和第二属性值满足第一标准。 此外,该方法包括识别电子地匹配第一属性值与第一对的第二属性值之间的片段。 该方法还包括每对重复识别。 此外,该方法包括基于一个或多个匹配模式和满足第二标准的一个或多个对中的匹配片段,使用第一对和匹配片段中的一个来计算第一对的相似性得分。 该方法还包括对每对重复计算。

    Assigning Human-Understandable Labels to Web Pages
    5.
    发明申请
    Assigning Human-Understandable Labels to Web Pages 有权
    将人为可理解的标签分配给网页

    公开(公告)号:US20090319533A1

    公开(公告)日:2009-12-24

    申请号:US12144036

    申请日:2008-06-23

    Applicant: Ashwin Tengli

    Inventor: Ashwin Tengli

    CPC classification number: G06F17/30861 G06F17/30616 G06F17/30696

    Abstract: Methods and systems that label a web page collect a set of inbound labels for the web page, estimate a language model for the web page, compute the likelihood of generating each inbound label given the language model and assign a score to each inbound label based on this likelihood, and assign a label to the web page based on the score assigned to each of the set of inbound labels. Inbound labels are preferably collected from the set of web documents linking to the web page. Labels assigned are useful in providing labeled links to web pages from top hosts in search result pages.

    Abstract translation: 标记网页的方法和系统收集网页的一组入站标签,估计网页的语言模型,计算给定语言模型并生成每个入站标签的每个入站标签的可能性,并根据 这种可能性,并且基于分配给每组入站标签的分数将标签分配给网页。 入站标签优选地从链接到网页的一组web文档中收集。 分配的标签对于在搜索结果页面中从顶级主机提供到网页的标签链接很有用。

    Techniques for categorizing web pages
    7.
    发明授权
    Techniques for categorizing web pages 有权
    技术分类网页

    公开(公告)号:US08768926B2

    公开(公告)日:2014-07-01

    申请号:US12652624

    申请日:2010-01-05

    CPC classification number: G06F7/00 G06F17/30834 G06F17/30864

    Abstract: Web pages are efficiently categorized in a data processor without analyzing the content of the web pages. According to at least one embodiment, data is maintained that represents sample URLs grouped into a plurality of clusters. The sample URLs of a cluster are used to produce a URL regular expression pattern (“URL-regex”) that differentiates the sample URLs of the cluster from the sample URLs of other clusters and that covers at least a specified percentage of the sample URLs in the cluster. The process of producing a URL-regex is repeated for each of the clusters producing a URL-regex for each cluster. Web pages are then categorized into one of the clusters by determining which of the URL-regex patterns produced for the clusters match URLs that refer to the web pages. Thus, a web page may be categorized based on a URL that refers to the web page without having to obtain and analyze the content of the web page.

    Abstract translation: 在不分析网页内容的情况下,网页被有效地分类到数据处理器中。 根据至少一个实施例,维护表示分组为多个集群的抽样URL的数据。 集群的示例URL用于生成URL正则表达式模式(“URL-regex”),可以将集群的示例URL与其他集群的示例URL进行区分,并且至少包含指定百分比的示例URL 集群。 为每个集群重复生成URL-regex的过程,为每个集群生成一个URL-regex。 然后,通过确定针对集群产生的哪个URL-regex模式与引用网页的URL匹配,将网页分类到其中一个集群。 因此,可以基于引用网页的URL来分类网页,而不必获取和分析网页的内容。

Patent Agency Ranking