Methods and Apparatus for Assessing Web Page Decay

    公开(公告)号:US20080097977A1

    公开(公告)日:2008-04-24

    申请号:US11955471

    申请日:2007-12-13

    IPC分类号: G06F17/30

    CPC分类号: G06F16/958

    摘要: Systems and methods are herein disclosed for assessing the staleness of a web page. In particular, in one method of the present invention, the staleness of a web page is assessed by examining internal date references within the web page. In another method of the present invention, the staleness of a web page is assessed by examining the meta-data associated with the web page. In a further method of the present invention, the staleness of a hyperlinked web page is determined by examining the link status of the hyperlinks. If the web page has a relatively large number of dead links, it is assessed as being a stale web page. In a still further method of the present invention, the link status of web pages in the neighborhood of the web page being assessed is likewise examined.

    System and method for detecting matches of small edit distance
    12.
    发明申请
    System and method for detecting matches of small edit distance 审中-公开
    用于检测小编辑距离匹配的系统和方法

    公开(公告)号:US20070085716A1

    公开(公告)日:2007-04-19

    申请号:US11241468

    申请日:2005-09-30

    IPC分类号: H03M7/30

    CPC分类号: G06F16/90344

    摘要: A system and method of approximating edit distance for a set of character strings in a database includes producing a representative sketch for each of the character strings; and approximating an edit distance between two selected character strings based only on the representative sketch for each of the selected character strings. The character strings may comprise text, wherein the method further comprises encoding positions of substrings in the text using anchors, wherein the anchors comprise identical substrings occurring in two input character strings at a nearby position. A set of anchors may be used in a correlated manner, wherein character strings with a sufficiently small edit distance are likely to use a same sequence of anchors. The character strings may be substantially non-repetitive. The representative sketch of a first character string is preferably constructed absent knowledge of a second character string. A size of the representative sketch may be constant.

    摘要翻译: 近似数据库中的一组字符串的编辑距离的系统和方法包括为每个字符串产生代表性的草图; 并且仅基于每个所选择的字符串的代表性草图来近似两个所选字符串之间的编辑距离。 字符串可以包括文本,其中该方法还包括使用锚点对文本中的子串的位置进行编码,其中锚点包括在附近位置处的两个输入字符串中出现的相同的子串。 可以以相关方式使用一组锚,其中具有足够小的编辑距离的字符串可能使用相同的锚点序列。 字符串可以是基本上不重复的。 优选地构造第一个字符串的代表性草图而不知道第二个字符串。 代表性草图的大小可能不变。

    Methods and apparatus for assessing web page decay
    13.
    发明申请
    Methods and apparatus for assessing web page decay 审中-公开
    评估网页衰变的方法和设备

    公开(公告)号:US20060112089A1

    公开(公告)日:2006-05-25

    申请号:US10995770

    申请日:2004-11-22

    IPC分类号: G06F17/30

    CPC分类号: G06F16/958

    摘要: Systems and methods are herein disclosed for assessing the staleness of a web page. In particular, in one method of the present invention, the staleness of a web page is assessed by examining internal date references within the web page. In another method of the present invention, the staleness of a web page is assessed by examining the meta-data associated with the web page. In a further method of the present invention, the staleness of a hyperlinked web page is determined by examining the link status of the hyperlinks. If the web page has a relatively large number of dead links, it is assessed as being a stale web page. In a still further method of the present invention, the link status of web pages in the neighborhood of the web page being assessed is likewise examined.

    摘要翻译: 本文公开了用于评估网页的陈旧性的系统和方法。 特别地,在本发明的一种方法中,通过检查网页中的内部日期参考来评估网页的陈旧性。 在本发明的另一种方法中,通过检查与网页相关联的元数据来评估网页的陈旧性。 在本发明的另一方法中,通过检查超链接的链接状态来确定超链接网页的陈旧性。 如果网页的死链接数量相对较多,则会被视为一个陈旧的网页。 在本发明的又一方法中,同样检查正在评估的网页附近的网页的链接状态。

    IDENTIFYING TOPICAL ENTITIES
    14.
    发明申请
    IDENTIFYING TOPICAL ENTITIES 审中-公开
    识别主题实体

    公开(公告)号:US20150278366A1

    公开(公告)日:2015-10-01

    申请号:US13153365

    申请日:2011-06-03

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30867 G06F17/30958

    摘要: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for identifying topical entities. In one aspect, a method includes obtaining a plurality of entities that are associated with a first resource; for one or more of the identified entities, receiving search results for a search query derived from the entity; determining that search results for a search query including a particular entity include a specific type of search results; and determining that the particular entity is a topical entity of the first resource based at least in part on the particular entity appearing in a title or a resource locator of the first resource, wherein the topical entity of the first resource represents a predominant topic of the first resource.

    摘要翻译: 方法,系统和装置,包括在计算机存储介质上编码的计算机程序,用于识别局部实体。 一方面,一种方法包括获得与第一资源相关联的多个实体; 对于一个或多个所识别的实体,接收从该实体导出的搜索查询的搜索结果; 确定包括特定实体的搜索查询的搜索结果包括特定类型的搜索结果; 以及至少部分地基于出现在所述第一资源的标题或资源定位符中的所述特定实体来确定所述特定实体是所述第一资源的主题实体,其中所述第一资源的所述主体实体表示所述第一资源的主要主题 第一资源。

    Counting unique search results
    15.
    发明授权
    Counting unique search results 失效
    计数唯一的搜索结果

    公开(公告)号:US08065309B1

    公开(公告)日:2011-11-22

    申请号:US12106860

    申请日:2008-04-21

    IPC分类号: G06F17/30 G06F15/16

    CPC分类号: G06F17/30979

    摘要: The subject matter of this specification can be embodied in, among other things, a computer-implemented method for counting one or more unique search results within a plurality of search results includes creating hash values for information in each of the search results using a first hash function. The first hash function has a predetermined hash value range size. The method further includes identifying a predetermined number of smallest hash values within the created hash values. The method further includes estimating a first number of unique search results based on the predetermined hash value range size, the predetermined number, and a largest hash value in the smallest hash values.

    摘要翻译: 本说明书的主题尤其可以体现在用于对多个搜索结果内的一个或多个唯一搜索结果进行计数的计算机实现的方法中,包括使用第一散列来为每个搜索结果中的信息创建哈希值 功能。 第一散列函数具有预定的散列值范围大小。 该方法还包括在所创建的散列值内识别预定数量的最小散列值。 该方法还包括基于最小哈希值中的预定散列值范围大小,预定数量和最大哈希值来估计第一数量的唯一搜索结果。

    Methods and Apparatus for Assessing Web Page Decay
    16.
    发明申请
    Methods and Apparatus for Assessing Web Page Decay 审中-公开
    评估Web页面衰减的方法和设备

    公开(公告)号:US20080097978A1

    公开(公告)日:2008-04-24

    申请号:US11955481

    申请日:2007-12-13

    IPC分类号: G06F17/30

    CPC分类号: G06F16/958

    摘要: Systems and methods are herein disclosed for assessing the staleness of a web page. In particular, in one method of the present invention, the staleness of a web page is assessed by examining internal date references within the web page. In another method of the present invention, the staleness of a web page is assessed by examining the meta-data associated with the web page. In a further method of the present invention, the staleness of a hyperlinked web page is determined by examining the link status of the hyperlinks. If the web page has a relatively large number of dead links, it is assessed as being a stale web page. In a still further method of the present invention, the link status of web pages in the neighborhood of the web page being assessed is likewise examined.

    摘要翻译: 本文公开了用于评估网页的陈旧性的系统和方法。 特别地,在本发明的一种方法中,通过检查网页中的内部日期参考来评估网页的陈旧性。 在本发明的另一种方法中,通过检查与网页相关联的元数据来评估网页的陈旧性。 在本发明的另一方法中,通过检查超链接的链接状态来确定超链接网页的陈旧性。 如果网页的死链接数量相对较多,则会被视为一个陈旧的网页。 在本发明的又一方法中,同样检查正在评估的网页附近的网页的链接状态。

    Method and system for improving data quality in large hyperlinked text databases using pagelets and templates
    18.
    发明授权
    Method and system for improving data quality in large hyperlinked text databases using pagelets and templates 有权
    使用小页面和模板在大型超链接文本数据库中提高数据质量的方法和系统

    公开(公告)号:US06968331B2

    公开(公告)日:2005-11-22

    申请号:US10055586

    申请日:2002-01-22

    IPC分类号: G06F17/30 G06F7/00

    摘要: A computing system and method clean a set of hypertext documents to minimize violations of a Hypertext Information Retrieval (IR) rule set. Then, the system and method performs an information retrieval operation on the resulting cleaned data. The cleaning process includes decomposing each page of the set of hypertext documents into one or more pagelets; identifying possible templates; and eliminating the templates from the data. Traditional IR search and mining algorithms can then be used to search on the remaining pagelets, as opposed to the original pages, to provide cleaner, more precise results.

    摘要翻译: 计算系统和方法清理一组超文本文件以最小化对超文本信息检索(IR)规则集的违规。 然后,系统和方法对所得到的清理数据执行信息检索操作。 清洁过程包括将该组超文本文件的每一页分解成一个或多个小页; 识别可能的模板; 并从数据中消除模板。 然后可以使用传统的IR搜索和挖掘算法来搜索剩余的小页面,而不是原始页面,以提供更清晰,更精确的结果。