-
公开(公告)号:US09043306B2
公开(公告)日:2015-05-26
申请号:US12861788
申请日:2010-08-23
申请人: Fabrice Canel , Junaid Ahmed , Thomas Francis McElroy , Walter Sun , Kumar Chellapilla , Abhishek Singh , Vishnu Challam
发明人: Fabrice Canel , Junaid Ahmed , Thomas Francis McElroy , Walter Sun , Kumar Chellapilla , Abhishek Singh , Vishnu Challam
IPC分类号: G06F17/30
CPC分类号: G06F17/30864 , G06F17/30109 , G06F17/30336 , G06F17/30867 , G06F17/30899
摘要: A client application installed on end user computers generates metadata from the content of web pages visited by end users and provides the metadata to a search engine. When an end user visits a web page, the end user's computer downloads and displays the web page to the end user. The client application may simultaneously access the web page content and generate this metadata in the form of a content signature of the web page from the web page content. The client application then provides the content signature to a search engine. The search engine may employ content signatures to identify new web pages to crawl and index. Additionally, the search engine may employ content signatures to identify changes to web pages and determine the crawl frequency of web pages.
摘要翻译: 安装在最终用户计算机上的客户端应用程序从最终用户访问的网页的内容生成元数据,并将元数据提供给搜索引擎。 当最终用户访问网页时,最终用户的计算机下载并将该网页显示给最终用户。 客户端应用程序可以同时访问网页内容,并从网页内容以网页的内容签名的形式生成该元数据。 然后,客户应用程序将内容签名提供给搜索引擎。 搜索引擎可以使用内容签名来识别新的网页来爬行和索引。 此外,搜索引擎可以使用内容签名来识别网页的改变并确定网页的爬行频率。
-
公开(公告)号:US09104960B2
公开(公告)日:2015-08-11
申请号:US13163857
申请日:2011-06-20
CPC分类号: G06N7/005 , G06Q30/0242
摘要: Methods, systems, and computer-storage media having computer-usable instructions embodied thereon for calculating event probabilities are provided. The event may be a click probability. Event probabilities are calculated using a system optimized for runtime model accuracy with an operable learning algorithm. Bin counting techniques are used to calculate event probabilities based on a count of event occurrences and non-event occurrences. Linear parameters, such and counts of clicks and non-clicks, may also be used in the system to allow for runtime adjustments.
摘要翻译: 提供了具有计算机可用指令的方法,系统和计算机存储介质,用于计算事件概率。 事件可能是点击概率。 事件概率是使用针对运行时模型精度优化的系统与可操作的学习算法计算的。 Bin计数技术用于根据事件发生次数和非事件发生次数来计算事件概率。 也可以在系统中使用线性参数,例如点击次数和非点击次数,以允许运行时间调整。
-
公开(公告)号:US08244752B2
公开(公告)日:2012-08-14
申请号:US12106857
申请日:2008-04-21
申请人: Greg Buehrer , Kumar Chellapilla , Jack W. Stokes
发明人: Greg Buehrer , Kumar Chellapilla , Jack W. Stokes
CPC分类号: H04L47/10
摘要: A method for classifying search query traffic can involve receiving a plurality of labeled sample search query traffic and generating a feature set partitioned into human physical limit features and query stream behavioral features. A model can be generated using the plurality of labeled sample search query traffic and the feature set. Search query traffic can be received and the model can be utilized to classify the received search query traffic as generated by a human or automatically generated.
摘要翻译: 用于分类搜索查询流量的方法可以包括接收多个标记的样本搜索查询流量并生成被划分为人体物理限制特征和查询流行为特征的特征集。 可以使用多个标记的样本搜索查询流量和特征集来生成模型。 可以接收搜索查询流量,并且该模型可以用于对由人类生成的或自动生成的接收的搜索查询流量进行分类。
-
公开(公告)号:US07743013B2
公开(公告)日:2010-06-22
申请号:US11811619
申请日:2007-06-11
申请人: Anton Mityagin , Kumar Chellapilla , Denis Charles
发明人: Anton Mityagin , Kumar Chellapilla , Denis Charles
IPC分类号: G06F17/30
CPC分类号: G06F17/30011
摘要: Multiple Bloom filters are generated to partition data between first and second disjoint data sets of elements. Each element in the first data set is assigned to a bucket of a first set of buckets, and each element in the second data set is assigned to a bucket of a second set of buckets. A Bloom filter is generated for each bucket of the first set of buckets. The Bloom filter generated for a bucket indicates that each element assigned to that bucket is part of the first data set, and that each element assigned to a corresponding bucket of the second set of buckets is not part of the first data set. Additionally, a Bloom filter corresponding to a subsequently received element can be determined and used to identify whether that subsequently received element is part of the first data set or the second data set.
摘要翻译: 生成多个Bloom过滤器以在元素的第一和第二不相交数据集之间划分数据。 第一数据集中的每个元素被分配给第一组桶的桶,并且第二数据集中的每个元素被分配给第二组桶的桶。 为第一组存储桶的每个桶生成布隆过滤器。 为桶生成的Bloom过滤器指示分配给该桶的每个元素是第一数据集的一部分,并且分配给第二组桶的相应桶的每个元素不是第一数据集的一部分。 此外,可以确定与随后接收到的元素相对应的布隆式过滤器,并用于识别随后接收的元件是否是第一数据集或第二数据集的一部分。
-
5.
公开(公告)号:US08200596B2
公开(公告)日:2012-06-12
申请号:US12473428
申请日:2009-05-28
IPC分类号: G06F17/00
CPC分类号: G06F17/10
摘要: Classes of web graph algorithms are extended to run directly on virtual node-type compressed web graphs where a reduction in runtime of the extended algorithms is realized which is approximately proportional to the compression ratio applied to the original (i.e., uncompressed) graph. In the virtual node compression technique, a succinct representation of a web graph is constructed by replacing dense subgraphs by sparse ones so that the resulting compressed graph has significantly fewer edges and a relatively small number of additional nodes.
摘要翻译: Web图算法的类被扩展为直接在虚拟节点类型的压缩web图上运行,其中实现了与应用于原始(即未压缩)图的压缩比大致成比例的扩展算法的运行时间的减少。 在虚拟节点压缩技术中,通过用稀疏替换密集子图来构建网络图的简洁表示,使得所得到的压缩图具有明显更少的边缘和相对较少数量的附加节点。
-
公开(公告)号:US20070239450A1
公开(公告)日:2007-10-11
申请号:US11278949
申请日:2006-04-06
申请人: Wolf Kienzle , Kumar Chellapilla
发明人: Wolf Kienzle , Kumar Chellapilla
IPC分类号: G10L15/06
CPC分类号: G10L15/07
摘要: The subject disclosure pertains to systems and methods for personalization of a recognizer. In general, recognizers can be used to classify input data. During personalization, a recognizer is provided with samples specific to a user, entity or format to improve performance for the specific user, entity or format. Biased regularization can be utilized during personalization to maintain recognizer performance for non-user specific input. In one aspect, regularization can be biased to the original parameters of the recognizer, such that the recognizer is not modified excessively during personalization.
摘要翻译: 本发明涉及用于识别器个性化的系统和方法。 通常,识别器可用于对输入数据进行分类。 在个性化期间,向识别器提供特定于用户,实体或格式的样本,以提高特定用户,实体或格式的性能。 在个性化过程中可以利用偏置正则化来维持非用户特定输入的识别器性能。 在一个方面,正则化可以偏向识别器的原始参数,使得识别器在个性化期间不被过度修改。
-
公开(公告)号:US20070133883A1
公开(公告)日:2007-06-14
申请号:US11299873
申请日:2005-12-12
申请人: Kumar Chellapilla , Patrice Simard
发明人: Kumar Chellapilla , Patrice Simard
IPC分类号: G06K9/62
CPC分类号: G06K9/80
摘要: A method and system for implementing character recognition is described herein. An input character is received. The input character is composed of one or more logical structures in a particular layout. The layout of the one or more logical structures is identified. One or more of a plurality of classifiers are selected based on the layout of the one or more logical structures in the input character. The entire character is input into the selected classifiers. The selected classifiers classify the logical structures. The outputs from the selected classifiers are then combined to form an output character vector.
摘要翻译: 本文描述了用于实现字符识别的方法和系统。 接收到一个输入字符。 输入字符由特定布局中的一个或多个逻辑结构组成。 识别一个或多个逻辑结构的布局。 基于输入字符中的一个或多个逻辑结构的布局来选择多个分类器中的一个或多个。 整个字符被输入到所选择的分类器中。 所选分类器对逻辑结构进行分类。 然后将所选分类器的输出组合以形成输出字符向量。
-
公开(公告)号:US08296327B2
公开(公告)日:2012-10-23
申请号:US12473706
申请日:2009-05-28
IPC分类号: G06F17/30
CPC分类号: G06F17/30958 , G06F17/30861
摘要: Short paths are found with a small query time in scale-free directed graphs using a two-phase process by which data structures comprising shortest path trees are first pre-computed for a group of central vertices called “hubs” that have short paths to most other vertices in the graph. In a query time phase, a short path between two vertices of interest in the graph is found by looking up the path to the root in each of the shortest path trees.
摘要翻译: 在无尺度的有向图中使用两阶段过程找到短路径,其中使用两阶段过程,通过该两阶段过程,首先对于具有到大多数其他顶点的短路径的一组称为集线器的中心顶点预先计算包括最短路径树的数据结构 在图中。 在查询时间阶段,通过在每个最短路径树中查找到根的路径来找到图中感兴趣的两个顶点之间的短路径。
-
公开(公告)号:US20080307189A1
公开(公告)日:2008-12-11
申请号:US11811619
申请日:2007-06-11
申请人: Anton Mityagin , Kumar Chellapilla , Denis Charles
发明人: Anton Mityagin , Kumar Chellapilla , Denis Charles
IPC分类号: G06F12/00
CPC分类号: G06F17/30011
摘要: Multiple Bloom filters are generated to partition data between first and second disjoint data sets of elements. Each element in the first data set is assigned to a bucket of a first set of buckets, and each element in the second data set is assigned to a bucket of a second set of buckets. A Bloom filter is generated for each bucket of the first set of buckets. The Bloom filter generated for a bucket indicates that each element assigned to that bucket is part of the first data set, and that each element assigned to a corresponding bucket of the second set of buckets is not part of the first data set. Additionally, a Bloom filter corresponding to a subsequently received element can be determined and used to identify whether that subsequently received element is part of the first data set or the second data set.
摘要翻译: 生成多个Bloom过滤器以在元素的第一和第二不相交数据集之间划分数据。 第一数据集中的每个元素被分配给第一组桶的桶,并且第二数据集中的每个元素被分配给第二组桶的桶。 为第一组存储桶的每个桶生成布隆过滤器。 为桶生成的Bloom过滤器指示分配给该桶的每个元素是第一数据集的一部分,并且分配给第二组桶的相应桶的每个元素不是第一数据集的一部分。 此外,可以确定与随后接收到的元素相对应的布隆式过滤器,并用于识别随后接收的元件是否是第一数据集或第二数据集的一部分。
-
公开(公告)号:US20080270549A1
公开(公告)日:2008-10-30
申请号:US11789997
申请日:2007-04-26
申请人: Kumar Chellapilla , Baoning Wu
发明人: Kumar Chellapilla , Baoning Wu
CPC分类号: G06Q10/107 , G06F16/951
摘要: Architecture for extracting link spam communities when given one or more members of the community. A link spam extraction algorithm is provided that takes as input link spam seeds and extracts other nearby link spam through a biased local random walk around the seed(s). The seed set is provided by a user (or an automated algorithm scrubbed by a human) which the algorithm uses to simulate a random walk on a web graph. The random walk can be biased to explore a local neighborhood around the seed set through use of decay probabilities. Truncation can be used to retain only the most frequently visited nodes. After termination, the nodes are sorted in decreasing order of final probabilities and presented to the user. Human judges need only make decisions at the spam community level, thereby limiting involvement, and human input can be scaled by several orders of magnitude.
摘要翻译: 当给予一个或多个社区成员时,提取链接垃圾邮件社区的架构。 提供链接垃圾邮件提取算法,其作为输入链接垃圾邮件种子,并且通过在种子周围的偏置的本地随机游走来提取其他附近的链接垃圾邮件。 种子集由用户(或人类擦除的自动化算法)提供,该算法用于模拟网络图上的随机游走。 随机游走可以偏向于通过使用衰变概率来探索种子集周围的当地社区。 截断可用于仅保留最常访问的节点。 终止后,节点按最终概率的降序排序,并呈现给用户。 人类法官只需要在垃圾邮件社区一级作出决定,从而限制参与,而且人力投入可以按几个数量级进行。
-
-
-
-
-
-
-
-
-