Learning Discriminative Projections for Text Similarity Measures
    1.
    发明申请
    Learning Discriminative Projections for Text Similarity Measures 审中-公开
    用于文本相似度量度的学习判别预测

    公开(公告)号:US20120323968A1

    公开(公告)日:2012-12-20

    申请号:US13160485

    申请日:2011-06-14

    IPC分类号: G06F17/30

    CPC分类号: G06F16/31

    摘要: A model for mapping the raw text representation of a text object to a vector space is disclosed. A function is defined for computing a similarity score given two output vectors. A loss function is defined for computing an error based on the similarity scores and the labels of pairs of vectors. The parameters of the model are tuned to minimize the loss function. The label of two vectors indicates a degree of similarity of the objects. The label may be a binary number or a real-valued number. The function for computing similarity scores may be a cosine, Jaccard, or differentiable function. The loss function may compare pairs of vectors to their labels. Each element of the output vector is a linear or non-linear function of the terms of an input vector. The text objects may be different types of documents and two different models may be trained concurrently.

    摘要翻译: 公开了将文本对象的原始文本表示映射到向量空间的模型。 定义了一个功能,用于计算给定两个输出向量的相似度得分。 定义了一种损失函数,用于计算基于相似度得分和向量对的标签的误差。 调整模型的参数以最小化损失函数。 两个向量的标签表示对象的相似度。 标签可以是二进制数字或实数值。 用于计算相似性分数的函数可以是余弦,Jaccard或可微分函数。 损失函数可以将向量对与其标签进行比较。 输出向量的每个元素是输入向量的项的线性或非线性函数。 文本对象可以是不同类型的文档,并且可以同时训练两个不同的模型。

    Classification using a cascade approach
    2.
    发明授权
    Classification using a cascade approach 失效
    使用级联方法分类

    公开(公告)号:US07693806B2

    公开(公告)日:2010-04-06

    申请号:US11766434

    申请日:2007-06-21

    IPC分类号: G06F15/18 G06N3/08

    摘要: A system and method that facilitates and effectuates optimizing a classifier for greater performance in a specific region of classification that is of interest, such as a low false positive rate or a low false negative rate. A two-stage classification model can be trained and employed, where the first stage classification is optimized over the entire classification region and the second stage classifier is optimized for the specific region of interest. During training the entire set of training data is employed by a first stage classifier. Only data that is classified by the first stage classifier or by cross validation to fall within a region of interest is used to train the second stage classifier. During classification, data that is classified within the region of interest by the first classification is given the first stage classifier's classification value, otherwise the classification value for the instance of data from the second stage classifier is used.

    摘要翻译: 促进并实现分类器在特定感兴趣区域中的更高性能的系统和方法,例如低假阳性率或低假阴性率。 可以训练和采用两阶段分类模型,其中对整个分类区域优化第一阶段分类,并针对特定的兴趣区域优化第二阶段分类器。 在训练期间,整套训练数据由第一阶段分类器采用。 仅使用由第一阶段分类器分类的数据或通过交叉验证落入感兴趣区域内的数据来训练第二阶段分类器。 在分类期间,通过第一分类对分类在感兴趣区域内的数据给予第一阶段分类器的分类值,否则使用来自第二阶段分类器的数据实例的分类值。

    CLICKTHROUGH-BASED LATENT SEMANTIC MODEL
    3.
    发明申请
    CLICKTHROUGH-BASED LATENT SEMANTIC MODEL 有权
    基于CLICKTHROUGH的LATENT语义模型

    公开(公告)号:US20130159320A1

    公开(公告)日:2013-06-20

    申请号:US13329345

    申请日:2011-12-19

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30867

    摘要: There is provided a computer-implemented method and system for ranking documents. The method includes identifying a number of query-document pairs based on clickthrough data for a number of documents. The method also includes building a latent semantic model based on the query-document pairs and ranking the documents for a search based on the latent semantic model.

    摘要翻译: 提供了用于对文档进行排序的计算机实现的方法和系统。 该方法包括基于多个文档的点击数据来识别多个查询文档对。 该方法还包括基于查询文档对构建潜在语义模型,并根据潜在语义模型对搜索文档进行排序。

    Web document keyword and phrase extraction
    4.
    发明授权
    Web document keyword and phrase extraction 有权
    Web文档关键字和短语提取

    公开(公告)号:US08135728B2

    公开(公告)日:2012-03-13

    申请号:US11619230

    申请日:2007-01-03

    IPC分类号: G06F7/00 G06F17/30 G06F13/14

    摘要: Extraction analysis techniques biased, in part, by query frequency information from a query log file and/or search engine cache are employed along with machine learning processes to determine candidate keywords and/or phrases of web documents. Web oriented features associated with the candidate keywords and/or phrases are also utilized to analyze the web documents. A keyword and/or phrase extraction mechanism can be utilized to score keywords and/or phrases in a web document and estimate a likelihood that the keywords and/or phrases are relevant, for example, in an advertising system and the like.

    摘要翻译: 提取分析技术部分地通过来自查询日志文件和/或搜索引擎高速缓冲存储器的查询频率信息以及机器学习过程来偏移来确定web文档的候选关键字和/或短语。 与候选关键字和/或短语相关联的面向Web的功能也用于分析网络文档。 可以使用关键字和/或短语提取机制来评估网络文档中的关键字和/或短语,并估计关键词和/或短语相关的可能性,例如在广告系统等中。

    Using IP address and domain for email spam filtering
    5.
    发明授权
    Using IP address and domain for email spam filtering 有权
    使用IP地址和域进行垃圾邮件过滤

    公开(公告)号:US07689652B2

    公开(公告)日:2010-03-30

    申请号:US11031672

    申请日:2005-01-07

    IPC分类号: G06F15/16 G06F15/173

    摘要: Email spam filtering is performed based on a combination of IP address and domain. When an email message is received, an IP address and a domain associated with the email message are determined. A cross product of the IP address (or portions of the IP address) and the domain (or portions of the domain) is calculated. If the email message is known to be either spam or non-spam, then a spam score based on the known spam status is stored in association with each (IP address, domain) pair element of the cross product. If the spam status of the email message is not known, then the (IP address, domain) pair elements of the cross product are used to lookup previously determined spam scores. A combination of the previously determined spam scores is used to determine whether or not to treat the received email message as spam.

    摘要翻译: 电子邮件垃圾邮件过滤是基于IP地址和域名的组合来执行的。 当接收到电子邮件消息时,确定与电子邮件消息相关联的IP地址和域。 计算IP地址(或IP地址的部分)和域(或域的部分)的交叉乘积。 如果电子邮件消息被称为垃圾邮件或非垃圾邮件,则根据已知垃圾邮件状态的垃圾邮件分数与交叉产品的每个(IP地址,域)对元素相关联地存储。 如果电子邮件的垃圾邮件状态未知,则交叉产品的(IP地址,域)对元素将用于查找先前确定的垃圾邮件分数。 使用先前确定的垃圾邮件分数的组合来确定是否将接收的电子邮件消息视为垃圾邮件。

    Clickthrough-based latent semantic model
    7.
    发明授权
    Clickthrough-based latent semantic model 有权
    基于点击的潜在语义模型

    公开(公告)号:US09009148B2

    公开(公告)日:2015-04-14

    申请号:US13329345

    申请日:2011-12-19

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30867

    摘要: There is provided a computer-implemented method and system for ranking documents. The method includes identifying a number of query-document pairs based on clickthrough data for a number of documents. The method also includes building a latent semantic model based on the query-document pairs and ranking the documents for a search based on the latent semantic model.

    摘要翻译: 提供了用于对文档进行排序的计算机实现的方法和系统。 该方法包括基于多个文档的点击数据来识别多个查询文档对。 该方法还包括基于查询文档对构建潜在语义模型,并根据潜在语义模型对搜索文档进行排序。

    Consistent phrase relevance measures
    8.
    发明授权
    Consistent phrase relevance measures 有权
    一致的短语相关性度量

    公开(公告)号:US08996515B2

    公开(公告)日:2015-03-31

    申请号:US13609257

    申请日:2012-09-11

    IPC分类号: G06F7/00 G06F17/30 G06Q30/02

    CPC分类号: G06F17/30687 G06Q30/02

    摘要: Two methods for measuring keyword-document relevance are described. The methods receive a keyword and a document as input and output a probability value for the keyword. The first method is a similarity-based approach which uses techniques for measuring similarity between two short-text segments to measure relevance between the keyword and the document. The second method is a regression-based approach based on an assumption that if an out-of-document phrase (the keyword) is semantically similar to an in-document phrase, then relevance scores of the in and out-of document phrases should be close to each other.

    摘要翻译: 描述了两种衡量关键字 - 文档相关性的方法。 方法接收关键字和文档作为输入,并输出关键字的概率值。 第一种方法是基于相似性的方法,其使用用于测量两个短文本段之间的相似性的技术来测量关键字和文档之间的相关性。 第二种方法是基于回归的方法,基于一个假设,如果文档外短语(关键字)在语义上类似于文档内短语,则文本内和外的短语的相关性分数应为 彼此接近

    CONSISTENT PHRASE RELEVANCE MEASURES
    9.
    发明申请
    CONSISTENT PHRASE RELEVANCE MEASURES 有权
    一致性相关措施

    公开(公告)号:US20120330978A1

    公开(公告)日:2012-12-27

    申请号:US13609257

    申请日:2012-09-11

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30687 G06Q30/02

    摘要: Two methods for measuring keyword-document relevance are described. The methods receive a keyword and a document as input and output a probability value for the keyword. The first method is a similarity-based approach which uses techniques for measuring similarity between two short-text segments to measure relevance between the keyword and the document. The second method is a regression-based approach based on an assumption that if an out-of-document phrase (the keyword) is semantically similar to an in-document phrase, then relevance scores of the in and out-of document phrases should be close to each other.

    摘要翻译: 描述了两种衡量关键字 - 文档相关性的方法。 方法接收关键字和文档作为输入,并输出关键字的概率值。 第一种方法是基于相似性的方法,其使用用于测量两个短文本段之间的相似性的技术来测量关键字和文档之间的相关性。 第二种方法是基于回归的方法,基于一个假设,如果文档外短语(关键字)在语义上类似于文档内短语,则文本内和外的短语的相关性分数应为 彼此接近

    PERSONALIZED EMAIL FILTERING
    10.
    发明申请
    PERSONALIZED EMAIL FILTERING 审中-公开
    个性化电子邮件过滤

    公开(公告)号:US20100211641A1

    公开(公告)日:2010-08-19

    申请号:US12371695

    申请日:2009-02-16

    IPC分类号: G06F15/16

    CPC分类号: G06F15/16 G06Q10/107

    摘要: Techniques and systems are described that utilize a scalable, “light-weight” user model, which can be combined with a traditional global email spam filter, to determine whether an email message sent to a target user is a desired email. A global email model is trained with a set of email messages to detect desired emails, and a user email model is also trained to detect desired emails. Training the user email model may comprise one or more of: using labeled training emails; using target user-based information; and using information from the global email model. Global and user model scores for an email sent to a target user can be combined to produce an email score. The email score can be compared with a desired email threshold to determine whether the email message sent to the target user is desired or not.

    摘要翻译: 描述了利用可扩展的“轻量级”用户模型的技术和系统,其可以与传统的全球电子邮件垃圾邮件过滤器组合,以确定发送给目标用户的电子邮件消息是否是期望的电子邮件。 全球电子邮件模型通过一组电子邮件进行培训,以检测所需的电子邮件,还会对用户电子邮件模型进行培训,以检测所需的电子邮件。 训练用户电子邮件模型可以包括以下一个或多个:使用标记的训练电子邮件; 使用目标用户信息; 并使用来自全球电子邮件模型的信息。 发送给目标用户的电子邮件的全局和用户模型分数可以组合起来,以产生一个电子邮件分数。 电子邮件分数可以与期望的电子邮件阈值进行比较,以确定是否期望发送给目标用户的电子邮件。