Representative document selection for a set of duplicate documents
    2.
    发明授权
    Representative document selection for a set of duplicate documents 有权
    代表文件选择一套重复的文件

    公开(公告)号:US08868559B2

    公开(公告)日:2014-10-21

    申请号:US13599707

    申请日:2012-08-30

    IPC分类号: G06F7/00 G06F17/30

    摘要: Systems and methods for indexing a representative document from a set of duplicate documents are disclosed. Disclosed systems and methods comprise selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score. Each respective document in the plurality of documents has a fingerprint that indicates that the respective document has substantially identical content to every other document in the plurality of documents. Disclosed systems and methods further comprise indexing, in accordance with the query independent score, the first document thereby producing an indexed first document. With respect to the plurality of documents, only the indexed first document is included in a document index.

    摘要翻译: 公开了从一组重复文件索引代表性文件的系统和方法。 公开的系统和方法包括在第一文档与查询独立分数相关联的基础上选择多个文档中的第一文档。 多个文档中的每个文档具有指示相应文档具有与多个文档中的每个其他文档基本相同的内容的指纹。 公开的系统和方法还包括根据查询独立分数索引第一文档,从而产生索引的第一文档。 对于多个文档,仅索引的第一文档被包括在文档索引中。

    REPRESENTATIVE DOCUMENT SELECTION FOR A SET OF DUPLICATE DOCUMENTS
    3.
    发明申请
    REPRESENTATIVE DOCUMENT SELECTION FOR A SET OF DUPLICATE DOCUMENTS 有权
    一组重复文件的代表性文件选择

    公开(公告)号:US20120323896A1

    公开(公告)日:2012-12-20

    申请号:US13599707

    申请日:2012-08-30

    IPC分类号: G06F17/30

    摘要: Systems and methods for indexing a representative document from a set of duplicate documents are disclosed. Disclosed systems and methods comprise selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score. Each respective document in the plurality of documents has a fingerprint that indicates that the respective document has substantially identical content to every other document in the plurality of documents. Disclosed systems and methods further comprise indexing, in accordance with the query independent score, the first document thereby producing an indexed first document. With respect to the plurality of documents, only the indexed first document is included in a document index.

    摘要翻译: 公开了从一组重复文件索引代表性文件的系统和方法。 公开的系统和方法包括在第一文档与查询独立分数相关联的基础上选择多个文档中的第一文档。 多个文档中的每个文档具有指示相应文档具有与多个文档中的每个其他文档基本相同的内容的指纹。 公开的系统和方法还包括根据查询独立分数索引第一文档,从而产生索引的第一文档。 对于多个文档,仅索引的第一文档被包括在文档索引中。

    Duplicate document detection in a web crawler system
    7.
    发明授权
    Duplicate document detection in a web crawler system 有权
    在网页抓取系统中重复的文档检测

    公开(公告)号:US07627613B1

    公开(公告)日:2009-12-01

    申请号:US10614111

    申请日:2003-07-03

    IPC分类号: G06F12/00 G06F17/30

    摘要: Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

    摘要翻译: 在网页抓取工具系统中检测到重复的文档。 在接收到新爬取的文档时,识别与新爬取的文档共享相同内容的一组文档(如果有的话)。 识别新爬取的文档和所选择的一组文档的信息被合并到识别新的一组文档的信息中。 基于每个此类文档的查询独立指标,将重复的文档包含在新文档集中并从其中排除。 根据一组预定义的条件识别新的文档集合的单个代表性文档。

    System and method for data distribution
    8.
    发明授权
    System and method for data distribution 有权
    数据分发的系统和方法

    公开(公告)号:US07568034B1

    公开(公告)日:2009-07-28

    申请号:US10613626

    申请日:2003-07-03

    IPC分类号: G06F15/173

    CPC分类号: G06F9/5033

    摘要: A method of distributing files operates in a system having a master and a plurality of slaves, interconnected by a communications network. Each slave determines a current file length for each of a plurality of files and sends slave status information to the master, the slave status information including the current file length for each file. The master schedules copy operations based on the slave status information. The master stores bandwidth capability information indicating data transmission bandwidth capabilities for the resources required to transmit data between the slaves, and also stores bandwidth usage information indicating a total allocated bandwidth for each resource. For each schedule copy operation, an amount of data transmission bandwidth is allocated and the stored bandwidth usage information is updated accordingly. The master only schedules copy operations that do not cause the total allocated bandwidth of any resource to exceed the bandwidth capability of that resource.

    摘要翻译: 分发文件的方法在具有由通信网络互连的主机和多个从机的系统中操作。 每个从设备确定多个文件中的每个文件的当前文件长度,并向主设备发送从属状态信息,从属状态信息包括每个文件的当前文件长度。 主机根据从属状态信息调度复制操作。 主机存储指示在从站之间传输数据所需的资源的数据传输带宽能力的带宽能力信息,并且还存储指示每个资源的总分配带宽的带宽使用信息。 对于每个调度复制操作,分配数据传输带宽的数量并相应地更新所存储的带宽使用信息。 主机只调度不导致任何资源的总分配带宽超过该资源的带宽能力的复制操作。

    Using text surrounding hypertext links when indexing and generating page summaries
    9.
    发明授权
    Using text surrounding hypertext links when indexing and generating page summaries 有权
    在索引和生成页面摘要时使用超文本链接的文本

    公开(公告)号:US08495483B1

    公开(公告)日:2013-07-23

    申请号:US10386110

    申请日:2003-03-12

    IPC分类号: G06F17/00 G06F17/30

    CPC分类号: G06F17/30864

    摘要: Web quotes are gathered from web pages that link to a web page of interest. The web quote may include text from the paragraphs that contain the hypertext links to the page of interest as well as text from other portions of the linked web page, such as text from a nearby header. The obtained web quotes may be ranked based on quality or relevance and may then be incorporated into a search engine's document index or into summary information returned to users in response to a search query.

    摘要翻译: 网络引用从链接到感兴趣的网页的网页收集。 网络报价可以包括来自包含到感兴趣页面的超文本链接的段落的文本以及链接网页的其他部分的文本,例如来自附近标题的文本。 获得的网络报价可以基于质量或相关性来排序,然后可以被合并到搜索引擎的文档索引中或者被合并到响应于搜索查询返回给用户的摘要信息中。

    Distributed crawling of hyperlinked documents
    10.
    发明授权
    Distributed crawling of hyperlinked documents 有权
    分布式抓取超链接文档

    公开(公告)号:US08812478B1

    公开(公告)日:2014-08-19

    申请号:US13608598

    申请日:2012-09-10

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864

    摘要: Techniques for crawling hyperlinked documents are provided. Hyperlinked documents to be crawled are grouped by host and the host to be crawled next is selected according to a stall time of the host. The stall time can indicate the earliest time that the host should be crawled and the stall times can be a predetermined amount of time, vary by host and be adjusted according to actual retrieval times from the host.

    摘要翻译: 提供了用于爬行超链接文档的技术。 要爬网的超链接文档按主机分组,根据主机的停顿时间选择下一次要抓取的主机。 停机时间可以指示主机应该被抓取的最早时间,并且停机时间可以是预定的时间量,由主机变化,并且根据主机的实际检索时间进行调整。