System and method for data distribution
    5.
    发明授权
    System and method for data distribution 有权
    数据分发的系统和方法

    公开(公告)号:US07568034B1

    公开(公告)日:2009-07-28

    申请号:US10613626

    申请日:2003-07-03

    IPC分类号: G06F15/173

    CPC分类号: G06F9/5033

    摘要: A method of distributing files operates in a system having a master and a plurality of slaves, interconnected by a communications network. Each slave determines a current file length for each of a plurality of files and sends slave status information to the master, the slave status information including the current file length for each file. The master schedules copy operations based on the slave status information. The master stores bandwidth capability information indicating data transmission bandwidth capabilities for the resources required to transmit data between the slaves, and also stores bandwidth usage information indicating a total allocated bandwidth for each resource. For each schedule copy operation, an amount of data transmission bandwidth is allocated and the stored bandwidth usage information is updated accordingly. The master only schedules copy operations that do not cause the total allocated bandwidth of any resource to exceed the bandwidth capability of that resource.

    摘要翻译: 分发文件的方法在具有由通信网络互连的主机和多个从机的系统中操作。 每个从设备确定多个文件中的每个文件的当前文件长度,并向主设备发送从属状态信息,从属状态信息包括每个文件的当前文件长度。 主机根据从属状态信息调度复制操作。 主机存储指示在从站之间传输数据所需的资源的数据传输带宽能力的带宽能力信息,并且还存储指示每个资源的总分配带宽的带宽使用信息。 对于每个调度复制操作,分配数据传输带宽的数量并相应地更新所存储的带宽使用信息。 主机只调度不导致任何资源的总分配带宽超过该资源的带宽能力的复制操作。

    Assigning document identification tags
    6.
    发明授权
    Assigning document identification tags 有权
    分配文件识别标签

    公开(公告)号:US09411889B2

    公开(公告)日:2016-08-09

    申请号:US13419349

    申请日:2012-03-13

    摘要: Document identification tags are assigned to documents to be added to a collection of documents. Based on query-independent information about a new document, a document identification tag is assigned to the new document. The document identification tag so assigned is used in the indexing of the new document. When a list of document identification tags are produced by an index in response to a query, the list is approximately ordered with respect to a measure of query-independent relevance. In some embodiments, the measure of query-independent relevance is related to the connectivity matrix of the World Wide Web. In other embodiments, the measure is related to the recency of crawling. In still other embodiments, the measure is a mixture of these two. The provided systems and methods allow for real-time indexing of documents as they are crawled from a collection of documents.

    摘要翻译: 文件识别标签被分配给要添加到文档集合的文档。 基于与新文档的查询无关信息,文档识别标签被分配给新文档。 所分配的文档识别标签用于新文档的索引。 当响应于查询而由索引产生文档识别标签的列表时,该列表关于与查询无关的相关度的度量近似排序。 在一些实施例中,与查询无关的相关性的度量与万维网的连接矩阵相关。 在其他实施例中,该度量与爬行的新近相关。 在其他实施方案中,测量是这两者的混合物。 所提供的系统和方法允许在从文档集合中爬取时对文档进行实时索引。

    Scheduler for search engine crawler
    7.
    发明授权
    Scheduler for search engine crawler 有权
    搜索引擎抓取器的计划程序

    公开(公告)号:US08707313B1

    公开(公告)日:2014-04-22

    申请号:US13031011

    申请日:2011-02-18

    IPC分类号: G06F9/46 G06F7/00

    CPC分类号: G06F17/30864

    摘要: A search engine crawler includes a distributed set of schedulers that are associated with one or more segments of document identifiers (e.g., URLs) corresponding to documents on a network (e.g., WWW). Each scheduler handles the scheduling of document identifiers (for crawling) for a subset of the known document identifiers. Using a starting set of document identifiers, such as the document identifiers crawled (or scheduled for crawling) during the most recent completed crawl, the scheduler removes from the starting set those document identifiers that have been unreachable in each of the last X crawls. Other filtering mechanisms may also be used to filter out some of the document identifiers in the starting set. The resulting list of document identifiers is written to a scheduled output file for use in a next crawl cycle.

    摘要翻译: 搜索引擎爬行器包括与一个或多个文档标识符(例如,URL)相关联的分布式的一组调度器,对应于网络上的文档(例如,WWW)。 每个调度器处理已知文档标识符的子集的文档标识符(用于爬行)的调度。 使用文档标识符的起始集合,例如在最近完成的爬网期间爬行(或计划进行爬网)的文档标识符,调度程序从起始设置中删除那些在最后一次X爬网中的每一个中都无法访问的文档标识符。 其他过滤机制也可用于过滤出起始集中的一些文档标识符。 生成的文档标识符列表将写入一个预定的输出文件,以供下一个爬网周期使用。

    Document reuse in a search engine crawler
    8.
    发明授权
    Document reuse in a search engine crawler 有权
    搜索引擎抓取工具中的文档重用

    公开(公告)号:US08707312B1

    公开(公告)日:2014-04-22

    申请号:US10882955

    申请日:2004-06-30

    IPC分类号: G06F9/46

    CPC分类号: G06F17/30864

    摘要: A search engine crawler includes a scheduler for determining which documents to download from their respective host servers. Some documents, known to be stable based on one or more record from prior crawls, are reused from a document repository. A reuse flag is set in a scheduler record that also contains a document identifier, the reuse flag indicating whether the document should be retrieved from a first database, such as the World Wide Web, or a second database, such as a document repository. A set of such scheduler records are used during a crawl by the search engine crawler to determine which database to use when retrieving the documents identified in the scheduler records.

    摘要翻译: 搜索引擎搜索器包括用于确定要从其各自的主机服务器下载哪些文档的调度器。 已知基于先前抓取的一个或多个记录的稳定的文档从文档存储库重新使用。 在还包含文档标识符的调度器记录中设置重用标志,重用标志指示是否应该从诸如万维网的第一数据库或诸如文档存储库的第二数据库检索文档。 在搜索引擎爬网程序抓取期间使用一组这样的调度程序记录来确定在检索在调度程序记录中标识的文档时要使用哪个数据库。

    Assigning document identification tags
    9.
    发明授权
    Assigning document identification tags 有权
    分配文件识别标签

    公开(公告)号:US08136025B1

    公开(公告)日:2012-03-13

    申请号:US10613637

    申请日:2003-07-03

    IPC分类号: G06F17/00

    摘要: Document identification tags are assigned to documents to be added to a collection of documents. Based on query-independent information about a new document, a document identification tag is assigned to the new document. The document identification tag so assigned is used in the indexing of the new document. When a list of document identification tags are produced by an index in response to a query, the list is approximately ordered with respect to a measure of query-independent relevance. In some embodiments, the measure of query-independent relevance is related to the connectivity matrix of the World Wide Web. In other embodiments, the measure is related to the recency of crawling. In still other embodiments, the measure is a mixture of these two. The provided systems and methods allow for real-time indexing of documents as they are crawled from a collection of documents.

    摘要翻译: 文件识别标签被分配给要添加到文档集合的文档。 基于与新文档的查询无关信息,文档识别标签被分配给新文档。 所分配的文档识别标签用于新文档的索引。 当响应于查询而由索引产生文档识别标签的列表时,该列表关于与查询无关的相关度的度量近似排序。 在一些实施例中,与查询无关的相关性的度量与万维网的连接矩阵相关。 在其他实施例中,该度量与爬行的新近相关。 在其他实施方案中,测量是这两者的混合物。 所提供的系统和方法允许在从文档集合中爬取时对文档进行实时索引。

    Assigning Document Identification Tags
    10.
    发明申请
    Assigning Document Identification Tags 有权
    分配文件识别标签

    公开(公告)号:US20120173552A1

    公开(公告)日:2012-07-05

    申请号:US13419349

    申请日:2012-03-13

    IPC分类号: G06F17/30

    摘要: Document identification tags are assigned to documents to be added to a collection of documents. Based on query-independent information about a new document, a document identification tag is assigned to the new document. The document identification tag so assigned is used in the indexing of the new document. When a list of document identification tags are produced by an index in response to a query, the list is approximately ordered with respect to a measure of query-independent relevance. In some embodiments, the measure of query-independent relevance is related to the connectivity matrix of the World Wide Web. In other embodiments, the measure is related to the recency of crawling. In still other embodiments, the measure is a mixture of these two. The provided systems and methods allow for real-time indexing of documents as they are crawled from a collection of documents.

    摘要翻译: 文件识别标签被分配给要添加到文档集合的文档。 基于与新文档的查询无关信息,文档识别标签被分配给新文档。 所分配的文档识别标签用于新文档的索引。 当响应于查询而由索引产生文档识别标签的列表时,该列表关于与查询无关的相关度的度量近似排序。 在一些实施例中,与查询无关的相关性的度量与万维网的连接矩阵相关。 在其他实施例中,该度量与爬行的新近相关。 在其他实施方案中,测量是这两者的混合物。 所提供的系统和方法允许在从文档集合中爬取时对文档进行实时索引。