-
公开(公告)号:US09424340B1
公开(公告)日:2016-08-23
申请号:US14521078
申请日:2014-10-22
Applicant: Google Inc.
Inventor: Rupesh Kapoor , David Michael Proudfoot , Joachim Kupke
IPC: G06F17/30
CPC classification number: G06F17/30613 , G06F17/3053 , G06F17/3071 , G06F17/30864
Abstract: A system may identify a set of first documents associated with an organization, and identify clusters to which the first documents belong. Each of a number of the identified clusters may include a group of documents that includes one of the first documents and one or more second documents associated with one or more different organizations. The system may determine a quality score for each of the documents in each of the identified clusters, and determine, for each of the number of the identified clusters, whether the quality score of the one of the first documents in the identified cluster is higher than the quality score of the one or more second documents in the identified cluster. The system may generate a proxy pad score based on the determinations, and store the proxy pad score.
Abstract translation: 系统可以标识与组织相关联的一组第一文档,并且识别第一文档所属的群集。 多个所识别的集群中的每一个可以包括一组文档,其包括第一文档之一和与一个或多个不同组织相关联的一个或多个第二文档。 所述系统可以确定每个所识别的集群中的每个文档的质量得分,并且对于所识别的集群中的每一个,确定所识别的集群中的所述第一文档之一的质量得分是否高于 所识别的群集中的一个或多个第二个文档的质量得分。 该系统可以基于确定产生代理贴片分数,并存储代理贴片分数。
-
公开(公告)号:US20150379014A1
公开(公告)日:2015-12-31
申请号:US14521206
申请日:2014-10-22
Applicant: GOOGLE INC.
Inventor: Hui Xu , Rupesh Kapoor , Erik Arjan Hendriks , Hao Fang , Cristian Tapus
CPC classification number: G06F17/3053 , G06F17/211 , G06F17/2229 , G06F17/2247 , G06F17/248 , G06F17/30887 , G06F17/30899
Abstract: Implementations include a batch-optimized render and fetch architecture. An example method performed by the architecture includes receiving a request from a batch process to render a web page and initializing a virtual clock and a task list for rendering the web page. The virtual clock stands still when a request for an embedded item is outstanding and when a task is ready to run. The method may also include generating a rendering result for the web page when the virtual clock matches a run time for a stop task in the task list, and providing the rendering result to the batch process. Another example method includes receiving a request from a batch process to render a web page, identifying an embedded item in the web page, and determining, based on a rewrite rule, that the embedded item has content that is duplicative of content for a previously fetched embedded item.
Abstract translation: 实现包括批量优化的渲染和提取架构。 由该架构执行的示例性方法包括从批处理接收请求以呈现网页并初始化虚拟时钟以及用于呈现网页的任务列表。 当嵌入式项目的请求未完成,任务准备运行时,虚拟时钟仍然停留。 该方法还可以包括当虚拟时钟与任务列表中的停止任务的运行时间匹配时,为网页生成呈现结果,以及将渲染结果提供给批处理。 另一示例性方法包括从批处理接收呈现网页的请求,识别网页中的嵌入项目,以及基于重写规则确定嵌入项目具有与之前提取的内容重复的内容 嵌入项目。
-
公开(公告)号:US09984130B2
公开(公告)日:2018-05-29
申请号:US14521206
申请日:2014-10-22
Applicant: GOOGLE INC.
Inventor: Hui Xu , Rupesh Kapoor , Erik Arjan Hendriks , Hao Fang , Cristian Tapus
CPC classification number: G06F17/3053 , G06F17/211 , G06F17/2229 , G06F17/2247 , G06F17/248 , G06F17/30887 , G06F17/30899
Abstract: Implementations include a batch-optimized render and fetch architecture. An example method performed by the architecture includes receiving a request from a batch process to render a web page and initializing a virtual clock and a task list for rendering the web page. The virtual clock stands still when a request for an embedded item is outstanding and when a task is ready to run. The method may also include generating a rendering result for the web page when the virtual clock matches a run time for a stop task in the task list, and providing the rendering result to the batch process. Another example method includes receiving a request from a batch process to render a web page, identifying an embedded item in the web page, and determining, based on a rewrite rule, that the embedded item has content that is duplicative of content for a previously fetched embedded item.
-
公开(公告)号:US20130117252A1
公开(公告)日:2013-05-09
申请号:US13644297
申请日:2012-10-04
Applicant: Google Inc.
Inventor: Sumitro Samaddar , Rupesh Kapoor , Pawel Alexander Fedorynski
IPC: G06F17/30
CPC classification number: G06F16/951
Abstract: System and method for fetching embedded object content as part of a batch crawl. A fetch server receives a request on a request thread to retrieve content for objects embedded in a document, such as a web page. The fetch server attempts to locate the content of the object in cache first and in disk storage next. If the content is not located in the cache the fetch server may switch the request to a worker thread. If the content is not located in the disk storage, the fetch server may schedule a request to retrieve the content of the embedded object through a batch web crawl. Scheduling a request may include determining that a request to crawl the content of the object has already been scheduled or inserting a request into a scheduling queue.
Abstract translation: 作为批量抓取的一部分,提取嵌入对象内容的系统和方法。 获取服务器在请求线程上接收请求,以检索内嵌在诸如网页之类的文档中的对象的内容。 抓取服务器尝试首先在磁盘存储器中首先找到对象的内容,然后再尝试查找。 如果内容不在高速缓存中,则获取服务器可以将请求切换到工作线程。 如果内容不在磁盘存储器中,则获取服务器可以通过批量网络爬网来调度请求以检索嵌入对象的内容。 调度请求可以包括确定已经调度了用于爬取对象的内容的请求或将请求插入到调度队列中。
-
-
-