-
公开(公告)号:US07676553B1
公开(公告)日:2010-03-09
申请号:US10750011
申请日:2003-12-31
IPC分类号: G06F15/16
CPC分类号: G06F17/30864
摘要: A system and method facilitating incremental web crawl(s) using chunk(s) is provided. The system can be employed, for example, to facilitate a web-crawling system that crawls (e.g., continuously) the Internet for information (e.g., data) and indexes the information so that it can be used as part of a web search engine.The system facilitates incremental re-crawls and/or selective updating of information (e.g., documents) using a structure called a chunk to simplify the process of an incremental crawl. A chunk is a set of documents that can be manipulated as a set (e.g., of up to 65,536 (64K) documents). “Document” refers to a corpus of data that is stored at a particular URL (e.g., HTML, PDF, PS, PPT, XLS, and/or DOC Files etc.)A chunk is created by an indexer. The indexer can place into a chunk documents that have similar property(ies). These property(ies) include but are not limited to: average time between change and average importance. These property(ies) can be stored at the chunk level in a chunk map. The chunk map can then be employed (e.g., on a daily basis) to determine which chunk(s) should be re-crawled.
摘要翻译: 提供了一种使用块来促进增量Web爬网的系统和方法。 例如,该系统可以用于促进爬行(例如,连续地)互联网以用于信息(例如,数据)并且索引信息的网络爬行系统,使得其可以用作网络搜索引擎的一部分。 该系统有助于使用称为块的结构对信息(例如,文档)的增量重新爬行和/或选择性更新,以简化增量爬网的过程。 块是一组可以作为一组(例如最多65,536(64K)个文档)被操纵的文档)。 “文档”是指存储在特定URL(例如HTML,PDF,PS,PPT,XLS和/或DOC文件等)的数据语料库。索引器创建块。 索引器可以放入具有类似属性的块文档中。 这些财产包括但不限于:平均改变时间和平均重要性之间的时间。 这些属性可以存储在块图中的块图中。 然后可以使用块图(例如,每天)来确定应该重新爬行哪个块。
-
公开(公告)号:US07536417B2
公开(公告)日:2009-05-19
申请号:US11439341
申请日:2006-05-24
申请人: James E. Walsh , Jonathan Forbes
发明人: James E. Walsh , Jonathan Forbes
IPC分类号: G06F17/00
CPC分类号: G06F17/30861 , Y10S707/99942 , Y10S707/99943 , Y10S707/99944 , Y10S707/99945 , Y10S707/99948
摘要: A system and method are presented for monitoring user browsing information. Such information can include, but is not limited to, the web pages visited by users, search queries submitted by users, the manner in which users browse the Internet and search for content, as well as any demographic information and interests of the corresponding users. Once a particular type of user browsing information has reached a certain threshold of activity by users, the invention can be configured to detect activity that reaches the threshold and then can increase the monitoring of the information.
摘要翻译: 提出了一种用于监控用户浏览信息的系统和方法。 这样的信息可以包括但不限于用户访问的网页,用户提交的搜索查询,用户浏览因特网以及搜索内容的方式以及相应用户的任何人口统计信息和兴趣。 一旦特定类型的用户浏览信息已经达到用户的一定的活动阈值,本发明可被配置为检测达到阈值的活动,然后可以增加信息的监视。
-
公开(公告)号:US08819017B2
公开(公告)日:2014-08-26
申请号:US12905464
申请日:2010-10-15
CPC分类号: G06F17/30982
摘要: Embodiments of the present invention relate to systems, methods, and computer-storage media for affinitizing datasets based on efficient query processing. In one embodiment, a plurality of datasets within a data stream is received. The data stream is partitioned based on efficient query processing. Once the data stream is partitioned, an affinity identifier is assigned to datasets based on the partitioning of the dataset. Further, when datasets are broken into extents, the affinity identifier of the parent dataset is retained in the resulting extent. The affinity identifier of each extent is then referenced to preferentially store extents having common affinity identifiers within close proximity of one other across a data center.
摘要翻译: 本发明的实施例涉及用于基于有效查询处理关联数据集的系统,方法和计算机存储介质。 在一个实施例中,接收数据流内的多个数据集。 基于有效的查询处理对数据流进行分区。 一旦数据流被分区,基于数据集的分区,将一个亲和度标识符分配给数据集。 此外,当数据集分成多个区段时,父数据集的亲和性标识符将保留在生成的范围内。 然后引用每个范围的相似性标识符,以优先地存储具有跨越数据中心的彼此靠近的共同相似性标识符的盘区。
-
公开(公告)号:US06226628B1
公开(公告)日:2001-05-01
申请号:US09104162
申请日:1998-06-24
申请人: Jonathan Forbes
发明人: Jonathan Forbes
IPC分类号: G06F1730
CPC分类号: G06F17/30952 , H03M7/3084 , Y10S707/918 , Y10S707/99931 , Y10S707/99932 , Y10S707/99936 , Y10S707/99942
摘要: A method of providing data files includes compressing the files using a cross-file compression technique. The technique makes use of ancillary files that are stored along with the data files. The ancillary files include lookup tables and indexes. A lookup table for a data file indicates the position of the last occurrence of individual data values within the data file. Each displacement index for a data file indicates displacements from respective data elements to prior strings of a particular match length that match strings of the particular match length begun by the respective data elements. Indexes corresponding to different match lengths are provided. In response to client requests for subsets of available data files, a server compresses each subset of data files using a pattern-matching compression scheme that attempts to represent given strings by referencing prior matching strings across file boundaries. To find a prior matching string for a string begun by a current data element in a current data file, the server finds a previous matching string in the current data file by referencing the displacement indexes associated with the current data file, and then searches for a larger matching string in previous data files by referencing the lookup tables and displacement indexes associated with the previous data files.
摘要翻译: 提供数据文件的方法包括使用跨文件压缩技术压缩文件。 该技术利用与数据文件一起存储的辅助文件。 辅助文件包括查找表和索引。 数据文件的查找表指示数据文件中单个数据值的最后一次出现的位置。 数据文件的每个位移指数指示从相应数据元素到匹配由相应数据元素开始的特定匹配长度的字符串的特定匹配长度的先前字符串的位移。 提供了与不同匹配长度对应的索引。 响应于客户端对可用数据文件子集的请求,服务器使用模式匹配压缩方案来压缩数据文件的每个子集,该模式匹配压缩方案通过引用跨文件边界的先前匹配字符串来尝试表示给定的字符串。 要查找当前数据文件中由当前数据元素开始的字符串的先前匹配字符串,服务器将通过引用与当前数据文件相关联的位移索引来查找当前数据文件中的先前匹配字符串,然后搜索 通过引用与先前数据文件相关联的查找表和位移指数,可以获得先前数据文件中较大的匹配字符串。
-
-
-