Near-duplicate document detection for web crawling
    1.
    发明授权
    Near-duplicate document detection for web crawling 有权
    用于网络抓取的近似重复的文档检测

    公开(公告)号:US08140505B1

    公开(公告)日:2012-03-20

    申请号:US11094791

    申请日:2005-03-31

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864 G06F17/30949

    摘要: A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.

    摘要翻译: 系统为所获取的文档生成哈希值,并将哈希值与一组存储的哈希值进行比较,以便识别所存储的哈希值中的一些,其具有与相应序列匹配的小于所有比特位置的比特位置的序列 的哈希值的位位置。 所述系统还确定所识别的散列值中的任何一个是否与散列值基本相似,并且当所识别的散列值之一基本上类似于散列值时,将获取的文档识别为另一文档的近似副本。

    Near-duplicate document detection for web crawling
    2.
    发明授权
    Near-duplicate document detection for web crawling 有权
    用于网络抓取的近似重复的文档检测

    公开(公告)号:US08548972B1

    公开(公告)日:2013-10-01

    申请号:US13422130

    申请日:2012-03-16

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864 G06F17/30949

    摘要: A system generates a hash value for a fetched document and compares the hash value with a set of stored hash values to identify ones of the stored hash values with a sequence of bit positions, less than all of the bit positions, that match a corresponding sequence of bit positions of the hash value. The system also determines whether any of the identified hash values are substantially similar to the hash value and identify the fetched document as a near-duplicate of another document when one of the identified hash values is substantially similar to the hash value.

    摘要翻译: 系统为所获取的文档生成哈希值,并将哈希值与一组存储的哈希值进行比较,以便识别所存储的哈希值中的一些,其具有与相应序列匹配的小于所有比特位置的比特位置的序列 的哈希值的位位置。 所述系统还确定所识别的散列值中的任何一个是否与散列值基本相似,并且当所识别的散列值之一基本上类似于散列值时,将获取的文档识别为另一文档的近似副本。

    Highly compressed randomly accessed storage of large tables with arbitrary columns
    3.
    发明授权
    Highly compressed randomly accessed storage of large tables with arbitrary columns 有权
    高度压缩随机访问的存储大型表与任意列

    公开(公告)号:US07496589B1

    公开(公告)日:2009-02-24

    申请号:US11178655

    申请日:2005-07-09

    摘要: A table, such as a database table can be partitioned into blocks that are conveniently sized for storage and retrieval. The amount of storage space required and the speed of storing and retrieving blocks is proportional to the size of the blocks. Compressing the blocks leads to less required space and more speed. The columns in a table, and therefore the rows in a transposed block, tend to contain similar data. Compression algorithms can work more efficiently when sequential data items are similar. Therefore, transposing the blocks before compression or compressing them in a column-wise manner leads to better compression. Different compression algorithms can be used for each set of columnar data to yield even better compression.

    摘要翻译: 诸如数据库表之类的表格可以被划分为方便大小存储和检索的块。 所需的存储空间量和存储和检索块的速度与块的大小成比例。 压缩块导致较少的空间和更多的速度。 表中的列,因此转置块中的列倾向于包含类似的数据。 当顺序数据项相似时,压缩算法可以更有效地工作。 因此,在压缩之前转置块或以列方式压缩它们导致更好的压缩。 可以对每组柱状数据使用不同的压缩算法,以产生更好的压缩。

    Pre-computed impression lists
    4.
    发明授权
    Pre-computed impression lists 有权
    预计算的展示列表

    公开(公告)号:US08521718B1

    公开(公告)日:2013-08-27

    申请号:US13474165

    申请日:2012-05-17

    IPC分类号: G06F17/30

    摘要: Systems, methods, and computer program products identify one or more web page impressions satisfying one or more simply queries, each of the one or more web page impressions associated with a respective impression ID. Respective impression IDs of the one or more web pages satisfying the one or more simple queries are stored in an impression log. Subsequent to storing the respective impression IDs, a query is received from a client device, and a number of impression IDs for the one or more web pages satisfying the query are identified based on the identified one or more web page impressions satisfying the one or more simple queries.

    摘要翻译: 系统,方法和计算机程序产品识别满足一个或多个简单查询的一个或多个网页印象,每个与相应印象ID相关联的一个或多个网页印象。 满足一个或多个简单查询的一个或多个网页的相应展示ID被存储在展示日志中。 在存储相应的印象ID之后,从客户端设备接收查询,并且基于所识别的一个或多个网页印象来识别满足该查询的一个或多个网页的多个印象ID,该网页印象满足一个或多个 简单查询

    Pre-computed impression lists
    5.
    发明授权
    Pre-computed impression lists 有权
    预计算的展示列表

    公开(公告)号:US08214350B1

    公开(公告)日:2012-07-03

    申请号:US12348170

    申请日:2009-01-02

    IPC分类号: G06F7/00 G06F17/30

    摘要: Systems, methods, and computer program products identify one or more web page impressions satisfying one or more simply queries, each of the one or more web page impressions associated with a respective impression ID. Respective impression IDs of the one or more web pages satisfying the one or more simple queries are stored in an impression log. Subsequent to storing the respective impression IDs, a query is received from a client device, and a number of impression IDs for the one or more web pages satisfying the query are identified based on the identified one or more web page impressions satisfying the one or more simple queries.

    摘要翻译: 系统,方法和计算机程序产品识别满足一个或多个简单查询的一个或多个网页印象,每个与相应印象ID相关联的一个或多个网页印象。 满足一个或多个简单查询的一个或多个网页的相应展示ID被存储在展示日志中。 在存储相应的印象ID之后,从客户端设备接收查询,并且基于所识别的一个或多个网页印象来识别满足该查询的一个或多个网页的多个印象ID,该网页印象满足一个或多个 简单查询

    System and method for searching peer-to-peer computer networks
    6.
    发明授权
    System and method for searching peer-to-peer computer networks 有权
    用于搜索对等计算机网络的系统和方法

    公开(公告)号:US07730178B2

    公开(公告)日:2010-06-01

    申请号:US11445080

    申请日:2006-05-31

    IPC分类号: G06F15/173

    摘要: A method and system for intelligently directing a search of a peer-to-peer network, in which a user performing a search is assisted in choosing a host which is likely to return fast, favorable results to the user. A host monitor monitors the peer-to-peer network and collects data on various characteristics of the hosts which make up the network. Thereafter, a host selector ranks the hosts using the data, and passes this information to the user. The user then selects one or more of the highly-ranked hosts as an entry point into the network. Additionally, a cache may collect a list of hosts based on the content on the hosts. In this way, a user may choose to connect to a host which is known to contain information relevant to the user's search. The host selector may be used to select from among the hosts listed in the cache.

    摘要翻译: 用于智能地指导对等网络的搜索的方法和系统,其中执行搜索的用户被选择可能快速返回给用户的有利结果的主机。 主机监视器监视对等网络,并收集组成网络的主机的各种特性的数据。 此后,主机选择器使用数据对主机进行排序,并将该信息传递给用户。 然后用户选择一个或多个高排名的主机作为网络中的入口点。 此外,缓存可以基于主机上的内容来收集主机列表。 以这种方式,用户可以选择连接到已知包含与用户搜索相关的信息的主机。 主机选择器可以用于从缓存中列出的主机中进行选择。

    Calculating flight plans for reservation-based ad serving
    7.
    发明授权
    Calculating flight plans for reservation-based ad serving 有权
    计算基于预订的广告投放的飞行计划

    公开(公告)号:US09053492B1

    公开(公告)日:2015-06-09

    申请号:US11551151

    申请日:2006-10-19

    IPC分类号: G06Q30/00 G06Q30/02

    CPC分类号: G06Q30/0224

    摘要: The disclosure provides various embodiments of systems, methods, and software for supporting server-side product catalogs. Software for managing ad serving may comprise computer readable instructions embodied on media and be operable to identify a logically local directed graph representing a logically remote network property associated with a publisher. The network property is associated with at least one product catalog representing a package of network ad slots. The software may then generate an ad service flight plan for serving various ones of a plurality of ads associated with a first of the network ad slots using an iterative solution on the directed graph.

    摘要翻译: 本公开提供了用于支持服务器端产品目录的系统,方法和软件的各种实施例。 用于管理广告投放的软件可以包括体现在媒体上的计算机可读指令,并且可操作以识别代表与发布者相关联的逻辑上远程网络属性的逻辑上的局部有向图。 网络属性与至少一个表示网络广告位包的产品目录相关联。 然后,软件可以使用有向图上的迭代解决方案生成用于服务与第一网络广告时隙相关联的多个广告中的各种广告的广告服务飞行计划。

    Single pass space efficent system and method for generating approximate
quantiles satisfying an apriori user-defined approximation error
    8.
    发明授权
    Single pass space efficent system and method for generating approximate quantiles satisfying an apriori user-defined approximation error 失效
    单通道空间效率系统和方法,用于生成满足先验用户定义的近似误差的近似分位数

    公开(公告)号:US6108658A

    公开(公告)日:2000-08-22

    申请号:US50434

    申请日:1998-03-30

    IPC分类号: G06F7/22 G06F17/30

    摘要: A system and method for finding an .epsilon.-approximate .phi.-quantile data element of a data set with N data elements in a single pass over the data set. The .epsilon.-approximate .phi.-quantile data element is guaranteed to lie within a user-specified approximation error .epsilon. of a true .phi.-quantile data element being sought. B buffers, each having a capacity of k elements, initially are filled with sorted data elements from the data set, with the values of b and k depending on .epsilon. and N. The buffers are then collapsed into an output buffer, with the remaining buffers then being refilled with data elements, collapsed (along with the previous output buffer), and so on until the entire data set has been processed and a single output buffer remains. A data element of the output buffer corresponding to the .epsilon.-approximate .phi.-quantile is then output as the approximate .phi.-quantile data element. If desired, the system and method can be practiced with sampling to even further reduce the amount of space required to find a desired .epsilon.-approximate .phi.-quantile data element.

    摘要翻译: 一种用于在数据集中的单次传递中找到具有N个数据元素的数据集的ε-近似phi-量子数据元素的系统和方法。 ε-近似phi - 数量数据元素被保证位于正在寻找的真实phi - 数量数据元素的用户指定的近似误差ε。 每个具有k个元素的容量的B缓冲器最初由来自数据集的排序数据元素填充,其中b和k的值取决于epsilon和N.然后,缓冲器被折叠成输出缓冲器,其余的缓冲器 然后使用数据元素重新填充数据元素,并与之前的输出缓冲区一起折叠,等等,直到整个数据集被处理完毕,并保留单个输出缓冲区。 然后,输出对应于ε-近似phi - 数量的输出缓冲器的数据元素作为近似phi - 数量数据元素。 如果需要,可以采用系统和方法来实施,以进一步减少找到所需的ε-近似phi - 数量数据元素所需的空间量。

    System and method for optimizing access to information in peer-to-peer computer networks
    9.
    发明授权
    System and method for optimizing access to information in peer-to-peer computer networks 有权
    优化对等计算机网络信息访问的系统和方法

    公开(公告)号:US07454480B2

    公开(公告)日:2008-11-18

    申请号:US11444648

    申请日:2006-05-31

    IPC分类号: G06F15/16

    摘要: A method and system for intelligently directing a search of a peer-to-peer network, in which a user performing a search is assisted in choosing a host which is likely to return fast, favorable results to the user. A host monitor monitors the peer-to-peer network and collects data on various characteristics of the hosts which make up the network. Thereafter, a host selector ranks the hosts using the data, and passes this information to the user. The user then selects one or more of the highly-ranked hosts as an entry point into the network. Additionally, a cache may collect a list of hosts based on the content on the hosts. In this way, a user may choose to connect to a host which is known to contain information relevant to the user's search. The host selector may be used to select from among the hosts listed in the cache.

    摘要翻译: 用于智能地指导对等网络的搜索的方法和系统,其中执行搜索的用户被选择可能快速返回给用户的有利结果的主机。 主机监视器监视对等网络,并收集组成网络的主机的各种特性的数据。 此后,主机选择器使用数据对主机进行排序,并将该信息传递给用户。 然后用户选择一个或多个高排名的主机作为网络中的入口点。 此外,缓存可以基于主机上的内容来收集主机列表。 以这种方式,用户可以选择连接到已知包含与用户搜索相关的信息的主机。 主机选择器可以用于从缓存中列出的主机中进行选择。

    System and method for searching peer-to-peer computer networks by selecting a computer based on at least a number of files shared by the computer
    10.
    发明授权
    System and method for searching peer-to-peer computer networks by selecting a computer based on at least a number of files shared by the computer 有权
    用于通过基于计算机共享的至少多个文件选择计算机来搜索对等计算机网络的系统和方法

    公开(公告)号:US07089301B1

    公开(公告)日:2006-08-08

    申请号:US09635777

    申请日:2000-08-11

    IPC分类号: G06F15/173

    摘要: A method and system for intelligently directing a search of a peer-to-peer network, in which a user performing a search is assisted in choosing a host which is likely to return fast, favorable results to the user. A host monitor monitors the peer-to-peer network and collects data on various characteristics of the hosts which make up the network. Thereafter, a host selector ranks the hosts using the data, and passes this information to the user. The user then selects one or more of the highly-ranked hosts as an entry point into the network. Additionally, a cache may collect a list of hosts based on the content on the hosts. In this way, a user may choose to connect to a host which is known to contain information relevant to the user's search. The host selector may be used to select from among the hosts listed in the cache.

    摘要翻译: 用于智能地指导对等网络的搜索的方法和系统,其中执行搜索的用户被选择可能快速返回给用户的有利结果的主机。 主机监视器监视对等网络,并收集组成网络的主机的各种特性的数据。 此后,主机选择器使用数据对主机进行排序,并将该信息传递给用户。 然后用户选择一个或多个高排名的主机作为网络中的入口点。 此外,缓存可以基于主机上的内容来收集主机列表。 以这种方式,用户可以选择连接到已知包含与用户搜索相关的信息的主机。 主机选择器可以用于从缓存中列出的主机中进行选择。