Duplicate document detection
    1.
    发明授权
    Duplicate document detection 有权
    重复文件检测

    公开(公告)号:US08768940B2

    公开(公告)日:2014-07-01

    申请号:US13612840

    申请日:2012-09-13

    Abstract: In a single-signature duplicate document system, a secondary set of attributes is used in addition to a primary set of attributes so as to improve the precision of the system. When the projection of a document onto the primary set of attributes is below a threshold, then a secondary set of attributes is used to supplement the primary lexicon so that the projection is above the threshold.

    Abstract translation: 在单签名重复文档系统中,除了主要属性集之外还使用辅助属性集,以提高系统的精度。 当将文档投影到主要属性集合上时,使用辅助的一组属性来补充主要词典,使得投影高于阈值。

    Detecting spam from metafeatures of an email message
    3.
    发明授权
    Detecting spam from metafeatures of an email message 有权
    从电子邮件的元数据中检测垃圾邮件

    公开(公告)号:US08370930B2

    公开(公告)日:2013-02-05

    申请号:US12039727

    申请日:2008-02-28

    CPC classification number: H04L51/12

    Abstract: Detecting spam from metafeatures of an email message. As a part of detecting spam, the email message is accessed and a distribution of numerical values is accorded to a set of features of the email message. It is determined whether the distribution of numerical values accorded the set of features of the email message is consistent with that of spam. Access is provided to the determination of whether the email message has a distribution of numerical values accorded the set of features that is consistent with that of spam.

    Abstract translation: 从电子邮件的元数据中检测垃圾邮件。 作为检测垃圾邮件的一部分,电子邮件消息被访问,数字分布符合电子邮件消息的一组功能。 确定符合电子邮件消息特征的数值分配是否与垃圾邮件的一致。 提供访问以确定电子邮件消息是否具有符合与垃圾邮件一致的特征集合的数值分布。

    Filtering system for providing personalized information in the absence of negative data
    4.
    发明授权
    Filtering system for providing personalized information in the absence of negative data 有权
    过滤系统,在没有负数据的情况下提供个性化信息

    公开(公告)号:US08060507B2

    公开(公告)日:2011-11-15

    申请号:US12987046

    申请日:2011-01-07

    Abstract: Systems and methods are provided for personalizing advertising for a user. In accordance with certain implementations, information is accessed indicating which documents were selected by a user and which documents were not selected by a user. At least one positive word vector is generated using words contained in at least one of the selected documents, and at least one negative word vector is generated using words contained in at least one of the unselected documents. Document word vectors are generated, and a document rank order is established based on a vector space relationship analysis. Categories associated with the documents are ranked based on the document rank order, and the ranked categories are sent to an ad server. Advertising material associated with the ranked categories may then be received from the ad server in a selected context.

    Abstract translation: 为用户个性化广告提供了系统和方法。 根据某些实现,访问指示哪些文档被用户选择并且哪些文档未被用户选择的信息。 使用包含在所选择的文档中的至少一个中的字来生成至少一个正字向量,并且使用至少一个未选择的文档中包含的字来生成至少一个负字向量。 生成文档字矢量,并且基于向量空间关系分析建立文档等级顺序。 与文档相关联的类别将根据文档排序顺序进行排名,并将排名的类别发送到广告服务器。 然后可以在所选择的上下文中从广告服务器接收与排名类别相关联的广告资料。

    Classifier tuning based on data similarities
    6.
    发明授权
    Classifier tuning based on data similarities 有权
    基于数据相似性的分类器调优

    公开(公告)号:US07089241B1

    公开(公告)日:2006-08-08

    申请号:US10740821

    申请日:2003-12-22

    CPC classification number: G06Q10/107 H04L51/12 Y10S707/99937 Y10S707/99945

    Abstract: A probabilistic classifier is used to classify data items in a data stream. The probabilistic classifier is trained, and an initial classification threshold is set, using unique training and evaluation data sets (i.e., data sets that do not contain duplicate data items). Unique data sets are used for training and in setting the initial classification threshold so as to prevent the classifier from being improperly biased as a result of similarity rates in the training and evaluation data sets that do not reflect similarity rates encountered during operation. During operation, information regarding the actual similarity rates of data items in the data stream is obtained and used to adjust the classification threshold such that misclassification costs are minimized given the actual similarity rates.

    Abstract translation: 概率分类器用于对数据流中的数据项进行分类。 对概率分类器进行训练,并使用独特的训练和评估数据集(即,不包含重复数据项的数据集)设置初始分类阈值。 唯一数据集用于训练和设置初始分类阈值,以防止分类器由于在训练和评估数据集中的相似率而不被反映在操作期间遇到的相似性的差异。 在操作期间,获得关于数据流中数据项的实际相似度的信息,并用于调整分类阈值,使得鉴于实际相似性,误分类成本最小化。

    META-MODEL DISTRIBUTED QUERY CLASSIFICATION
    9.
    发明申请
    META-MODEL DISTRIBUTED QUERY CLASSIFICATION 审中-公开
    META模型分布式查询分类

    公开(公告)号:US20130091131A1

    公开(公告)日:2013-04-11

    申请号:US13267163

    申请日:2011-10-06

    CPC classification number: G06F16/353

    Abstract: Systems and methods are provided for classifying a search query. A first group of query classifiers can be used to evaluate a query relative to various subject matter domains. The evaluation results from the first group of domain classifiers can then be used by a second group of meta-classifiers. The meta-classifiers are associated with meta-classifier categories that may correspond to a domain or that may correspond to a plurality of domains. The assigned meta-classifier category for a query can be used in any convenient manner, such as by triggering additional uses of the search query to match images or other alternative types of documents, or such as by allowing a subject matter domain to be assigned to the query.

    Abstract translation: 提供了用于对搜索查询进行分类的系统和方法。 第一组查询分类器可用于评估相对于各主题域的查询。 第一组域分类器的评估结果可以由第二组元分类器使用。 元分类器与可能对应于域或可对应于多个域的元分类器类别相关联。 用于查询的分配的元分类器类别可以以任何便利的方式使用,例如通过触发搜索查询的附加使用来匹配图像或其他替代类型的文档,或者例如通过允许将主题域分配给 查询。

Patent Agency Ranking