SYSTEM AND METHOD FOR DETECTING DUPLICATE CONTENT ITEMS
    1.
    发明申请
    SYSTEM AND METHOD FOR DETECTING DUPLICATE CONTENT ITEMS 审中-公开
    用于检测双重内容项的系统和方法

    公开(公告)号:US20090125516A1

    公开(公告)日:2009-05-14

    申请号:US11939834

    申请日:2007-11-14

    IPC分类号: G06F17/30 G06F15/18

    CPC分类号: G06F16/958

    摘要: Generally, the present invention provides systems, methods and computer program products for detecting different content items with similar content by examining the anchortext of the link. A method of the present invention comprises selecting one of a plurality of websites, crawling the selected website to identify one or more content items, and downloading one or more content items of the selected website. A determination is then made as to the one or more linking relationships from the one or more content items of the selected website and one or more linking rules are learned based upon association rule mining of the one or more content items. The one or more linking rules are then applied to one or more content items of one or more websites in order to determine storage of the one or more content items based upon the one or more linking rules on a search provider's central server.

    摘要翻译: 通常,本发明提供了通过检查链接的定位信息来检测具有相似内容的不同内容项的系统,方法和计算机程序产品。 本发明的方法包括选择多个网站之一,爬行所选择的网站以识别一个或多个内容项,以及下载所选网站的一个或多个内容项。 然后,根据所选择的网站的一个或多个内容项目确定一个或多个链接关系,并且基于一个或多个内容项目的关联规则挖掘来学习一个或多个链接规则。 然后将一个或多个链接规则应用于一个或多个网站的一个或多个内容项目,以便基于搜索提供商的中央服务器上的一个或多个链接规则来确定一个或多个内容项目的存储。

    Techniques for detecting duplicate web pages
    2.
    发明授权
    Techniques for detecting duplicate web pages 有权
    检测重复网页的技术

    公开(公告)号:US07698317B2

    公开(公告)日:2010-04-13

    申请号:US11788505

    申请日:2007-04-20

    IPC分类号: G06F17/00

    CPC分类号: G06F17/30864 G06F17/2211

    摘要: Techniques are disclosed for detecting web pages with duplicate content. In one embodiment, a set of shingles is computed for each page of a group of pages. An aggregate set of shingles is determined based on the sets of shingles computed for the group of pages. A first subset from the aggregate set of shingles is determined by selecting, from the aggregate set, shingles whose frequencies in the aggregate set exceed a specified threshold. A modified set of shingles is generated for each page of the group of pages by removing, from the set of shingles for that page, any shingle included in the first subset. One or more duplicate pages in the group of pages are determined based at least in part on the modified sets of shingles generated for the group of pages.

    摘要翻译: 公开了用于检测具有重复内容的网页的技术。 在一个实施例中,针对一组页面的每个页面计算一组带状块。 基于为该组页面计算的带状块的集合确定聚合的带状块组。 通过从聚合集合中选择聚合集合中的频率超过指定阈值的带状键确定来自聚合散列集合的第一子集。 通过从该页面的一组带状键移除包括在第一子集中的任何瓦片,为该组页面的每个页面生成经修改的带状块组。 至少部分地基于为该组页生成的带状块的修改的集合来确定该组页面中的一个或多个重复页面。

    Location input mistake correction
    3.
    发明授权
    Location input mistake correction 有权
    位置输入错误纠正

    公开(公告)号:US08370339B2

    公开(公告)日:2013-02-05

    申请号:US11797819

    申请日:2007-05-08

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30241 G01C21/20

    摘要: A system for automatically correcting a mistaken geocoded location input. A wireless device such as a cell phone ranks possible location input based on edit distance, which is a ‘confidence factor’. If there is no perfect match, then a list of geocode options is returned, preferably sorted by the score. The ‘closeness’ is derived by looking at the edit distance to go from the input to the matched address. Edit distance is defined herein as the number of insertion/deletion/replacement operations to go from input location to the possible matched location. In one embodiment, an option list, or ‘pick list’, may be provided based on an edit distance scoring system. The edit distance scoring system is preferably based on a number of keystrokes difference between the input location name and the possible matched location name.

    摘要翻译: 用于自动校正错误的地理编码位置输入的系统。 诸如蜂窝电话的无线设备基于编辑距离来排列可能的位置输入,该距离是置信因子。 如果没有完美匹配,则返回一个地理编码选项列表,最好按分数排序。 通过查看从输入到匹配地址的编辑距离得出亲近度。 编辑距离在此被定义为从输入位置到可能的匹配位置的插入/删除/替换操作的数量。 在一个实施例中,可以基于编辑距离评分系统来提供选项列表或选择列表。 编辑距离评分系统优选地基于输入位置名称和可能匹配的位置名称之间的击键差异的数量。

    Techniques for detecting duplicate web pages
    4.
    发明申请
    Techniques for detecting duplicate web pages 有权
    检测重复网页的技术

    公开(公告)号:US20080263026A1

    公开(公告)日:2008-10-23

    申请号:US11788505

    申请日:2007-04-20

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864 G06F17/2211

    摘要: Techniques are disclosed for detecting web pages with duplicate content. In one embodiment, a set of shingles is computed for each page of a group of pages. An aggregate set of shingles is determined based on the sets of shingles computed for the group of pages. A first subset from the aggregate set of shingles is determined by selecting, from the aggregate set, shingles whose frequencies in the aggregate set exceed a specified threshold. A modified set of shingles is generated for each page of the group of pages by removing, from the set of shingles for that page, any shingle included in the first subset. One or more duplicate pages in the group of pages are determined based at least in part on the modified sets of shingles generated for the group of pages.

    摘要翻译: 公开了用于检测具有重复内容的网页的技术。 在一个实施例中,针对一组页面的每个页面计算一组带状块。 基于为该组页面计算的带状块的集合确定聚合的带状块组。 通过从聚合集合中选择聚合集合中的频率超过指定阈值的带状键确定来自聚合散列集合的第一子集。 通过从该页面的一组带状键移除包括在第一子集中的任何瓦片,为该组页面的每个页面生成经修改的带状块组。 至少部分地基于为该组页生成的带状块的修改的集合来确定该组页面中的一个或多个重复页面。

    Location Input Mistake Correction
    5.
    发明申请
    Location Input Mistake Correction 有权
    位置输入错误修正

    公开(公告)号:US20130151512A1

    公开(公告)日:2013-06-13

    申请号:US13758701

    申请日:2013-02-04

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30241 G01C21/20

    摘要: A system for automatically correcting a mistaken geocoded location input. A wireless device such as a cell phone ranks possible location input based on edit distance, which is a ‘confidence factor’. If there is no perfect match, then a list of geocode options is returned, preferably sorted by the score. The ‘closeness’ is derived by looking at the edit distance to go from the input to the matched address. Edit distance is defined herein as the number of insertion/deletion/replacement operations to go from input location to the possible matched location. In one embodiment, an option list, or ‘pick list’, may be provided based on an edit distance scoring system. The edit distance scoring system is preferably based on a number of keystrokes difference between the input location name and the possible matched location name.

    摘要翻译: 用于自动校正错误的地理编码位置输入的系统。 诸如手机之类的无线设备基于编辑距离来排列可能的位置输入,这是一个“置信因子”。 如果没有完美匹配,则返回一个地理编码选项列表,最好按分数排序。 通过查看从输入到匹配地址的编辑距离来得到“亲密度”。 编辑距离在此被定义为从输入位置到可能的匹配位置的插入/删除/替换操作的数量。 在一个实施例中,可以基于编辑距离评分系统提供选项列表或“选择列表”。 编辑距离评分系统优选地基于输入位置名称和可能匹配的位置名称之间的击键差异的数量。

    SYSTEMS AND METHODS OF UNIVERSAL RESOURCE LOCATOR NORMALIZATION
    6.
    发明申请
    SYSTEMS AND METHODS OF UNIVERSAL RESOURCE LOCATOR NORMALIZATION 审中-公开
    通用资源定位器正常化的系统与方法

    公开(公告)号:US20090164502A1

    公开(公告)日:2009-06-25

    申请号:US11963925

    申请日:2007-12-24

    IPC分类号: G06F17/30

    CPC分类号: G06F16/9566

    摘要: Disclosed herein are method, systems and architectures for normalizing identifiers corresponding to resources using normalization rules that can be generalized for use with different resources. By way of a non-limiting example, an identifier can be a uniform resource locator (URL), and a normalization rule can be used to normalize URLs that correspond to different resources, e.g., content. A normalization rule can be generated by generalizing two or more normalization rules corresponding to different resources, such that a content determinative component is generalized. A normalization rule can be defined to include a context portion used to determine the rule's applicability to an identifier, and a transformation portion that identifies the transformations to be applied to an applicable identifier to yield a normalized form of the URL. A generalization of two or more normalization rules can include a normalization of one or both of the context and transformation portions.

    摘要翻译: 这里公开了用于使用归一化规则对与资源相对应的标识符进行归一化的方法,系统和体系结构,该规范化规则可以被推广以用于不同的资源。 作为非限制性示例,标识符可以是统一资源定位符(URL),并且归一化规则可以用于对与不同资源(例如,内容)相对应的URL进行归一化。 可以通过对与不同资源相对应的两个或多个规范化规则进行泛化来生成规范化规则,使得内容确定组件被广义化。 归一化规则可以被定义为包括用于确定规则对标识符的适用性的上下文部分,以及标识要应用于适用标识符以产生URL的归一化形式的变换的变换部分。 两个或多个归一化规则的泛化可以包括上下文和转换部分中的一个或两个的归一化。

    Location input mistake correction
    7.
    发明申请
    Location input mistake correction 有权
    位置输入错误纠正

    公开(公告)号:US20080063172A1

    公开(公告)日:2008-03-13

    申请号:US11797819

    申请日:2007-05-08

    IPC分类号: H04M3/42

    CPC分类号: G06F17/30241 G01C21/20

    摘要: A system for automatically correcting a mistaken geocoded location input. A wireless device such as a cell phone ranks possible location input based on edit distance, which is a ‘confidence factor’. If there is no perfect match, then a list of geocode options is returned, preferably sorted by the score. The ‘closeness’ is derived by looking at the edit distance to go from the input to the matched address. Edit distance is defined herein as the number of insertion/deletion/replacement operations to go from input location to the possible matched location. In one embodiment, an option list, or ‘pick list’, may be provided based on an edit distance scoring system. The edit distance scoring system is preferably based on a number of keystrokes difference between the input location name and the possible matched location name.

    摘要翻译: 用于自动校正错误的地理编码位置输入的系统。 诸如手机之类的无线设备基于编辑距离来排列可能的位置输入,这是一个“置信因子”。 如果没有完美匹配,则返回一个地理编码选项列表,最好按分数排序。 通过查看从输入到匹配地址的编辑距离来得到“亲密度”。 编辑距离在此被定义为从输入位置到可能的匹配位置的插入/删除/替换操作的数量。 在一个实施例中,可以基于编辑距离评分系统来提供选项列表或“选择列表”。 编辑距离评分系统优选地基于输入位置名称和可能匹配的位置名称之间的击键差异的数量。

    DETERMINING THE GEOGRAPHIC SCOPE OF WEB RESOURCES USING USER CLICK DATA
    8.
    发明申请
    DETERMINING THE GEOGRAPHIC SCOPE OF WEB RESOURCES USING USER CLICK DATA 审中-公开
    使用用户点击数据确定网页资源的地理范围

    公开(公告)号:US20100325129A1

    公开(公告)日:2010-12-23

    申请号:US12488134

    申请日:2009-06-19

    IPC分类号: G06F17/30

    CPC分类号: G06F16/9535 G06F16/9537

    摘要: A geographic region is automatically determined for an Internet resource based on information that has been gathered over time through the automatic monitoring of certain “click” activities of Internet search engine-using users. Over time, the search engine collects information for each click. Using this click-related data, the search engine estimates the geographic region with which the resource ought to be associated. The fact that a significant proportion of clicks on a resource's hyperlink are clicks that “came through” a search engine portal that is associated with a geographic region tends to suggest that the resource ought to be associated with that geographic region. Similarly, the fact that a significant proportion of clicks on a resource's hyperlink are clicks that were made by users whose computers have IP addresses that are associated with a geographic region tends to suggest that the resource ought to be associated with that geographic region.

    摘要翻译: 基于使用互联网搜索引擎的用户的某些“点击”活动的自动监视,基于随时间而收集的信息,自动确定地理区域。 随着时间的推移,搜索引擎会收集每次点击的信息。 使用该点击相关数据,搜索引擎估计资源应与之关联的地理区域。 事实上,资源超链接的大量点击是通过与地理区域相关联的搜索引擎门户“点击”的点击倾向于表明资源应该与该地理区域相关联。 类似地,资源超链接的大部分点击是由计算机具有与地理区域相关联的IP地址的用户进行的点击的事实倾向于表明该资源应该与该地理区域相关联。

    Method and Apparatus for Identifying if Two Websites are Co-Owned
    9.
    发明申请
    Method and Apparatus for Identifying if Two Websites are Co-Owned 审中-公开
    用于识别两个网站是否共同拥有的方法和设备

    公开(公告)号:US20090228438A1

    公开(公告)日:2009-09-10

    申请号:US12044339

    申请日:2008-03-07

    IPC分类号: G06F7/06

    CPC分类号: G06F21/6218

    摘要: A method and apparatus are provided for identifying if two websites are co-owned. In one example, the method includes obtaining redirect URL (uniform resource locator) pairs from the Internet, constructing a training set using the redirect URL pairs, constructing a feature set based on the training set, and learning co-ownership decisions based on the feature set and the training set.

    摘要翻译: 提供了一种用于识别两个网站是否共同拥有的方法和装置。 在一个示例中,该方法包括从因特网获取重定向URL(统一资源定位符)对,使用重定向URL对构建训练集,基于训练集构建特征集,以及基于特征学习共有权决定 设置和训练集。