-
公开(公告)号:US20120066576A1
公开(公告)日:2012-03-15
申请号:US13300516
申请日:2011-11-18
IPC分类号: G06F15/00
CPC分类号: G06F17/30014 , G06F17/2235 , G06F17/241 , G06F17/2705 , G06F17/30321 , G06F17/30864
摘要: Provided is a method and system for indexing documents in a collection of linked documents. A link log, including one or more pairings of source documents and target documents is accessed. A sorted anchor map, containing one or more target document to source document pairings, is generated. The pairings in the sorted anchor map are ordered based on target document identifiers.
摘要翻译: 提供了一种用于在链接文档的集合中索引文档的方法和系统。 链接日志,包括一个或多个源文档和目标文档的配对。 生成包含一个或多个目标文档到源文档配对的排序的锚图。 排序的锚图中的配对是基于目标文档标识符进行排序的。
-
公开(公告)号:US07308643B1
公开(公告)日:2007-12-11
申请号:US10614113
申请日:2003-07-03
CPC分类号: G06F17/30014 , G06F17/2235 , G06F17/241 , G06F17/2705 , G06F17/30321 , G06F17/30864
摘要: Provided is a method and system for indexing documents in a collection of linked documents. A link log, including one or more pairings of source documents and target documents is accessed. A sorted anchor map, containing one or more target document to source document pairings, is generated. The pairings in the sorted anchor map are ordered based on target document identifiers.
摘要翻译: 提供了一种用于在链接文档的集合中索引文档的方法和系统。 链接日志,包括一个或多个源文档和目标文档的配对。 生成包含一个或多个目标文档到源文档配对的排序的锚图。 排序的锚图中的配对是基于目标文档标识符进行排序的。
-
公开(公告)号:US09305091B2
公开(公告)日:2016-04-05
申请号:US13300516
申请日:2011-11-18
CPC分类号: G06F17/30014 , G06F17/2235 , G06F17/241 , G06F17/2705 , G06F17/30321 , G06F17/30864
摘要: Provided is a method and system for indexing documents in a collection of linked documents. A link log, including one or more pairings of source documents and target documents is accessed. A sorted anchor map, containing one or more target document to source document pairings, is generated. The pairings in the sorted anchor map are ordered based on target document identifiers.
-
公开(公告)号:US08484548B1
公开(公告)日:2013-07-09
申请号:US11936421
申请日:2007-11-07
IPC分类号: G06F17/00
CPC分类号: G06F17/30014 , G06F17/2235 , G06F17/241 , G06F17/2705 , G06F17/30321 , G06F17/30864
摘要: Provided is a method and system for indexing documents in a collection of linked documents. A link log, including one or more pairings of source documents and target documents is accessed. A sorted anchor map, containing one or more target document to source document pairings, is generated. The pairings in the sorted anchor map are ordered based on target document identifiers.
摘要翻译: 提供了一种用于在链接文档的集合中索引文档的方法和系统。 链接日志,包括一个或多个源文档和目标文档的配对。 生成包含一个或多个目标文档到源文档配对的排序的锚图。 排序的锚图中的配对是基于目标文档标识符进行排序的。
-
公开(公告)号:US07568034B1
公开(公告)日:2009-07-28
申请号:US10613626
申请日:2003-07-03
IPC分类号: G06F15/173
CPC分类号: G06F9/5033
摘要: A method of distributing files operates in a system having a master and a plurality of slaves, interconnected by a communications network. Each slave determines a current file length for each of a plurality of files and sends slave status information to the master, the slave status information including the current file length for each file. The master schedules copy operations based on the slave status information. The master stores bandwidth capability information indicating data transmission bandwidth capabilities for the resources required to transmit data between the slaves, and also stores bandwidth usage information indicating a total allocated bandwidth for each resource. For each schedule copy operation, an amount of data transmission bandwidth is allocated and the stored bandwidth usage information is updated accordingly. The master only schedules copy operations that do not cause the total allocated bandwidth of any resource to exceed the bandwidth capability of that resource.
摘要翻译: 分发文件的方法在具有由通信网络互连的主机和多个从机的系统中操作。 每个从设备确定多个文件中的每个文件的当前文件长度,并向主设备发送从属状态信息,从属状态信息包括每个文件的当前文件长度。 主机根据从属状态信息调度复制操作。 主机存储指示在从站之间传输数据所需的资源的数据传输带宽能力的带宽能力信息,并且还存储指示每个资源的总分配带宽的带宽使用信息。 对于每个调度复制操作,分配数据传输带宽的数量并相应地更新所存储的带宽使用信息。 主机只调度不导致任何资源的总分配带宽超过该资源的带宽能力的复制操作。
-
公开(公告)号:US09411889B2
公开(公告)日:2016-08-09
申请号:US13419349
申请日:2012-03-13
申请人: Huican Zhu , Anurag Acharya
发明人: Huican Zhu , Anurag Acharya
CPC分类号: G06F17/30864 , G06F17/30112 , G06F17/303 , G06F17/3053 , G06F17/30867 , G06Q30/02 , G06Q30/0246 , H04L29/06 , H04L29/08072 , H04L29/0809 , H04L41/12 , H04L41/22 , H04L63/08 , H04L63/102 , H04L67/14 , H04L67/2804
摘要: Document identification tags are assigned to documents to be added to a collection of documents. Based on query-independent information about a new document, a document identification tag is assigned to the new document. The document identification tag so assigned is used in the indexing of the new document. When a list of document identification tags are produced by an index in response to a query, the list is approximately ordered with respect to a measure of query-independent relevance. In some embodiments, the measure of query-independent relevance is related to the connectivity matrix of the World Wide Web. In other embodiments, the measure is related to the recency of crawling. In still other embodiments, the measure is a mixture of these two. The provided systems and methods allow for real-time indexing of documents as they are crawled from a collection of documents.
摘要翻译: 文件识别标签被分配给要添加到文档集合的文档。 基于与新文档的查询无关信息,文档识别标签被分配给新文档。 所分配的文档识别标签用于新文档的索引。 当响应于查询而由索引产生文档识别标签的列表时,该列表关于与查询无关的相关度的度量近似排序。 在一些实施例中,与查询无关的相关性的度量与万维网的连接矩阵相关。 在其他实施例中,该度量与爬行的新近相关。 在其他实施方案中,测量是这两者的混合物。 所提供的系统和方法允许在从文档集合中爬取时对文档进行实时索引。
-
公开(公告)号:US08707313B1
公开(公告)日:2014-04-22
申请号:US13031011
申请日:2011-02-18
CPC分类号: G06F17/30864
摘要: A search engine crawler includes a distributed set of schedulers that are associated with one or more segments of document identifiers (e.g., URLs) corresponding to documents on a network (e.g., WWW). Each scheduler handles the scheduling of document identifiers (for crawling) for a subset of the known document identifiers. Using a starting set of document identifiers, such as the document identifiers crawled (or scheduled for crawling) during the most recent completed crawl, the scheduler removes from the starting set those document identifiers that have been unreachable in each of the last X crawls. Other filtering mechanisms may also be used to filter out some of the document identifiers in the starting set. The resulting list of document identifiers is written to a scheduled output file for use in a next crawl cycle.
摘要翻译: 搜索引擎爬行器包括与一个或多个文档标识符(例如,URL)相关联的分布式的一组调度器,对应于网络上的文档(例如,WWW)。 每个调度器处理已知文档标识符的子集的文档标识符(用于爬行)的调度。 使用文档标识符的起始集合,例如在最近完成的爬网期间爬行(或计划进行爬网)的文档标识符,调度程序从起始设置中删除那些在最后一次X爬网中的每一个中都无法访问的文档标识符。 其他过滤机制也可用于过滤出起始集中的一些文档标识符。 生成的文档标识符列表将写入一个预定的输出文件,以供下一个爬网周期使用。
-
公开(公告)号:US08707312B1
公开(公告)日:2014-04-22
申请号:US10882955
申请日:2004-06-30
IPC分类号: G06F9/46
CPC分类号: G06F17/30864
摘要: A search engine crawler includes a scheduler for determining which documents to download from their respective host servers. Some documents, known to be stable based on one or more record from prior crawls, are reused from a document repository. A reuse flag is set in a scheduler record that also contains a document identifier, the reuse flag indicating whether the document should be retrieved from a first database, such as the World Wide Web, or a second database, such as a document repository. A set of such scheduler records are used during a crawl by the search engine crawler to determine which database to use when retrieving the documents identified in the scheduler records.
摘要翻译: 搜索引擎搜索器包括用于确定要从其各自的主机服务器下载哪些文档的调度器。 已知基于先前抓取的一个或多个记录的稳定的文档从文档存储库重新使用。 在还包含文档标识符的调度器记录中设置重用标志,重用标志指示是否应该从诸如万维网的第一数据库或诸如文档存储库的第二数据库检索文档。 在搜索引擎爬网程序抓取期间使用一组这样的调度程序记录来确定在检索在调度程序记录中标识的文档时要使用哪个数据库。
-
公开(公告)号:US08136025B1
公开(公告)日:2012-03-13
申请号:US10613637
申请日:2003-07-03
申请人: Huican Zhu , Anurag Acharya
发明人: Huican Zhu , Anurag Acharya
IPC分类号: G06F17/00
CPC分类号: G06F17/30864 , G06F17/30112 , G06F17/303 , G06F17/3053 , G06F17/30867 , G06Q30/02 , G06Q30/0246 , H04L29/06 , H04L29/08072 , H04L29/0809 , H04L41/12 , H04L41/22 , H04L63/08 , H04L63/102 , H04L67/14 , H04L67/2804
摘要: Document identification tags are assigned to documents to be added to a collection of documents. Based on query-independent information about a new document, a document identification tag is assigned to the new document. The document identification tag so assigned is used in the indexing of the new document. When a list of document identification tags are produced by an index in response to a query, the list is approximately ordered with respect to a measure of query-independent relevance. In some embodiments, the measure of query-independent relevance is related to the connectivity matrix of the World Wide Web. In other embodiments, the measure is related to the recency of crawling. In still other embodiments, the measure is a mixture of these two. The provided systems and methods allow for real-time indexing of documents as they are crawled from a collection of documents.
摘要翻译: 文件识别标签被分配给要添加到文档集合的文档。 基于与新文档的查询无关信息,文档识别标签被分配给新文档。 所分配的文档识别标签用于新文档的索引。 当响应于查询而由索引产生文档识别标签的列表时,该列表关于与查询无关的相关度的度量近似排序。 在一些实施例中,与查询无关的相关性的度量与万维网的连接矩阵相关。 在其他实施例中,该度量与爬行的新近相关。 在其他实施方案中,测量是这两者的混合物。 所提供的系统和方法允许在从文档集合中爬取时对文档进行实时索引。
-
公开(公告)号:US20120173552A1
公开(公告)日:2012-07-05
申请号:US13419349
申请日:2012-03-13
申请人: Huican Zhu , Anurag Acharya
发明人: Huican Zhu , Anurag Acharya
IPC分类号: G06F17/30
CPC分类号: G06F17/30864 , G06F17/30112 , G06F17/303 , G06F17/3053 , G06F17/30867 , G06Q30/02 , G06Q30/0246 , H04L29/06 , H04L29/08072 , H04L29/0809 , H04L41/12 , H04L41/22 , H04L63/08 , H04L63/102 , H04L67/14 , H04L67/2804
摘要: Document identification tags are assigned to documents to be added to a collection of documents. Based on query-independent information about a new document, a document identification tag is assigned to the new document. The document identification tag so assigned is used in the indexing of the new document. When a list of document identification tags are produced by an index in response to a query, the list is approximately ordered with respect to a measure of query-independent relevance. In some embodiments, the measure of query-independent relevance is related to the connectivity matrix of the World Wide Web. In other embodiments, the measure is related to the recency of crawling. In still other embodiments, the measure is a mixture of these two. The provided systems and methods allow for real-time indexing of documents as they are crawled from a collection of documents.
摘要翻译: 文件识别标签被分配给要添加到文档集合的文档。 基于与新文档的查询无关信息,文档识别标签被分配给新文档。 所分配的文档识别标签用于新文档的索引。 当响应于查询而由索引产生文档识别标签的列表时,该列表关于与查询无关的相关度的度量近似排序。 在一些实施例中,与查询无关的相关性的度量与万维网的连接矩阵相关。 在其他实施例中,该度量与爬行的新近相关。 在其他实施方案中,测量是这两者的混合物。 所提供的系统和方法允许在从文档集合中爬取时对文档进行实时索引。
-
-
-
-
-
-
-
-
-