-
公开(公告)号:US07930353B2
公开(公告)日:2011-04-19
申请号:US11193691
申请日:2005-07-29
申请人: David M. Chickering , Geoffrey J. Hulten , Robert L. Rounthwaite , Christopher A. Meek , David E. Heckerman , Joshua T. Goodman
发明人: David M. Chickering , Geoffrey J. Hulten , Robert L. Rounthwaite , Christopher A. Meek , David E. Heckerman , Joshua T. Goodman
IPC分类号: G06F15/16
CPC分类号: H04L51/12
摘要: Decision trees populated with classifier models are leveraged to provide enhanced spam detection utilizing separate email classifiers for each feature of an email. This provides a higher probability of spam detection through tailoring of each classifier model to facilitate in more accurately determining spam on a feature-by-feature basis. Classifiers can be constructed based on linear models such as, for example, logistic-regression models and/or support vector machines (SVM) and the like. The classifiers can also be constructed based on decision trees. “Compound features” based on internal and/or external nodes of a decision tree can be utilized to provide linear classifier models as well. Smoothing of the spam detection results can be achieved by utilizing classifier models from other nodes within the decision tree if training data is sparse. This forms a base model for branches of a decision tree that may not have received substantial training data.
摘要翻译: 利用分类器模型填充的决策树利用电子邮件的每个功能使用单独的电子邮件分类器来提供增强的垃圾邮件检测。 这通过定制每个分类器模型提供了更高的垃圾邮件检测的概率,以便于在逐个特征的基础上更准确地确定垃圾邮件。 分类器可以基于诸如逻辑回归模型和/或支持向量机(SVM)等线性模型来构建。 分类器也可以基于决策树构建。 基于决策树的内部和/或外部节点的“复合特征”也可以用于提供线性分类器模型。 垃圾邮件检测结果的平滑可以通过使用来自决策树内的其他节点的分类器模型来实现,如果训练数据是稀疏的。 这形成了可能没有接收到大量训练数据的决策树的分支的基本模型。
-
公开(公告)号:US08140569B2
公开(公告)日:2012-03-20
申请号:US10447462
申请日:2003-05-29
CPC分类号: G06F17/30687 , G06F17/30536 , G06N7/005 , G06Q30/00
摘要: A dependency network is created from a training data set utilizing a scalable method. A statistical model (or pattern), such as for example a Bayesian network, is then constructed to allow more convenient inferencing. The model (or pattern) is employed in lieu of the training data set for data access. The computational complexity of the method that produces the model (or pattern) is independent of the size of the original data set. The dependency network directly returns explicitly encoded data in the conditional probability distributions of the dependency network. Non-explicitly encoded data is generated via Gibbs sampling, approximated, or ignored.
摘要翻译: 从使用可伸缩方法的训练数据集创建依赖网络。 然后构建统计模型(或模式),例如贝叶斯网络,以允许更方便的推论。 采用模型(或模式)代替用于数据访问的训练数据集。 产生模型(或模式)的方法的计算复杂度与原始数据集的大小无关。 依赖网络直接在依赖网络的条件概率分布中返回显式编码的数据。 通过Gibbs采样,近似或忽略来生成非显式编码数据。
-
公开(公告)号:US07831627B2
公开(公告)日:2010-11-09
申请号:US11324960
申请日:2006-01-03
IPC分类号: G06F17/30
CPC分类号: G06F17/30687 , G06F17/30536 , G06N7/005 , G06Q30/00
摘要: A dependency network is created from a training data set utilizing a scalable method. A statistical model (or pattern), such as for example a Bayesian network, is then constructed to allow more convenient inferencing. The model (or pattern) is employed in lieu of the training data set for data access. The computational complexity of the method that produces the model (or pattern) is independent of the size of the original data set. The dependency network directly returns explicitly encoded data in the conditional probability distributions of the dependency network. Non-explicitly encoded data is generated via Gibbs sampling, approximated, or ignored.
摘要翻译: 从使用可伸缩方法的训练数据集创建依赖网络。 然后构建统计模型(或模式),例如贝叶斯网络,以允许更方便的推论。 采用模型(或模式)代替用于数据访问的训练数据集。 产生模型(或模式)的方法的计算复杂度与原始数据集的大小无关。 依赖网络直接在依赖网络的条件概率分布中返回显式编码的数据。 通过Gibbs采样,近似或忽略来生成非显式编码数据。
-
公开(公告)号:US08533270B2
公开(公告)日:2013-09-10
申请号:US10601741
申请日:2003-06-23
申请人: Bryan T. Starbuck , Robert L. Rounthwaite , David E. Heckerman , Joshua T. Goodman , Eliot C. Gillum , Nathan D. Howell , Kenneth R. Aldinger
发明人: Bryan T. Starbuck , Robert L. Rounthwaite , David E. Heckerman , Joshua T. Goodman , Eliot C. Gillum , Nathan D. Howell , Kenneth R. Aldinger
IPC分类号: G06F15/16
CPC分类号: G06F17/3061 , G06Q10/107 , H04L51/12
摘要: The subject invention provides for an advanced and robust system and method that facilitates detecting spam. The system and method include components as well as other operations which enhance or promote finding characteristics that are difficult or the spammer to avoid and finding characteristics in non-spam that are difficult for spammers to duplicate. Exemplary characteristics include examining origination features in pairs, analyzing character and/or number sequences, strings, and sub-strings, detecting various entropy levels of one or more character sequences, strings and/or sub-strings as well as analyzing message and/or feature sizes.
摘要翻译: 本发明提供了一种便于检测垃圾邮件的先进且健壮的系统和方法。 该系统和方法包括增强或促进发现难以避免的特征或垃圾邮件发送者避免并且发现垃圾邮件发送者难以复制的非垃圾邮件特性的其他操作。 示例性特征包括成对检查起始特征,分析字符和/或数字序列,字符串和子串,检测一个或多个字符序列,字符串和/或子串的各种熵级,以及分析消息和/或 特征尺寸。
-
公开(公告)号:US07483947B2
公开(公告)日:2009-01-27
申请号:US10428649
申请日:2003-05-02
IPC分类号: G06F15/16
CPC分类号: G06Q10/107 , H04L51/12
摘要: Architecture for detecting and removing obfuscating clutter from the subject and/or body of a message, e.g., e-mail, prior to filtering of the message, to identify junk messages commonly referred to as spam. The technique utilizes the powerful features built into an HTML rendering engine to strip the HTML instructions for all non-substantive aspects of the message. Pre-processing includes pre-rendering of the message into a final format, which final format is that which is displayed by the rendering engine to the user. The final format message is then converted to a text-only format to remove graphics, color, non-text decoration, and spacing that cannot be rendered as ASCII-style or Unicode-style characters. The result is essentially to reduce each message to its common denominator essentials so that the junk mail filter can view each message on an equal basis.
摘要翻译: 用于在过滤消息之前检测和去除来自主体和/或消息主体(例如电子邮件)的模糊杂波的体系结构,以识别通常被称为垃圾邮件的垃圾邮件。 该技术利用内置于HTML呈现引擎中的强大功能来剥离消息的所有非实质性方面的HTML指令。 预处理包括将消息预渲染成最终格式,最终格式是由呈现引擎向用户显示的最终格式。 最终格式化消息然后转换为纯文本格式以删除不能以ASCII样式或Unicode风格字符呈现的图形,颜色,非文本装饰和间距。 结果基本上是将每个消息减少到其公分要素,以便垃圾邮件过滤器可以在平等的基础上查看每个消息。
-
公开(公告)号:US20100088380A1
公开(公告)日:2010-04-08
申请号:US12359126
申请日:2009-01-23
CPC分类号: G06Q10/107 , H04L51/12
摘要: Architecture for detecting and removing obfuscating clutter from the subject and/or body of a message, e.g., e-mail, prior to filtering of the message, to identify junk messages commonly referred to as spam. The technique utilizes the powerful features built into an HTML rendering engine to strip the HTML instructions for all non-substantive aspects of the message. Pre-processing includes pre-rendering of the message into a final format, which final format is that which is displayed by the rendering engine to the user. The final format message is then converted to a text-only format to remove graphics, color, non-text decoration, and spacing that cannot be rendered as ASCII-style or Unicode-style characters. The result is essentially to reduce each message to its common denominator essentials so that the junk mail filter can view each message on an equal basis.
摘要翻译: 用于在过滤消息之前检测和去除来自主体和/或消息主体(例如电子邮件)的模糊杂波的体系结构,以识别通常被称为垃圾邮件的垃圾邮件。 该技术利用内置于HTML呈现引擎中的强大功能来剥离消息的所有非实质性方面的HTML指令。 预处理包括将消息预渲染成最终格式,最终格式是由呈现引擎向用户显示的最终格式。 最终格式化消息然后转换为纯文本格式以删除不能以ASCII样式或Unicode风格字符呈现的图形,颜色,非文本装饰和间距。 结果基本上是将每个消息减少到其公分要素,以便垃圾邮件过滤器可以在平等的基础上查看每个消息。
-
公开(公告)号:US07640313B2
公开(公告)日:2009-12-29
申请号:US11779263
申请日:2007-07-17
IPC分类号: G06F15/16 , G06F15/173
CPC分类号: G06Q10/107 , H04L51/12
摘要: The invention relates to a system for filtering messages—the system includes a seed filter having associated therewith a false positive rate and a false negative rate. A new filter is also provided for filtering the messages, the new filter is evaluated according to the false positive rate and the false negative rate of the seed filter, the data used to determine the false positive rate and the false negative rate of the seed filter are utilized to determine a new false positive rate and a new false negative rate of the new filter as a function of threshold. The new filter is employed in lieu of the seed filter if a threshold exists for the new filter such that the new false positive rate and new false negative rate are together considered better than the false positive and the false negative rate of the seed filter.
摘要翻译: 本发明涉及一种用于过滤消息的系统 - 该系统包括具有与其相关联的假阳性率和假阴性率的种子滤波器。 还提供了一种过滤消息的新过滤器,根据种子过滤器的假阳性率和假阴性率评估新过滤器,用于确定种子过滤器的假阳性率和假阴性率的数据 用于确定新过滤器的新的假阳性率和新的假阴性率作为阈值的函数。 如果新的过滤器存在阈值,则使用新的过滤器来代替种子过滤器,使得新的假阳性率和新的假阴性率一起被认为优于种子过滤器的假阳性率和假阴性率。
-
公开(公告)号:US07249162B2
公开(公告)日:2007-07-24
申请号:US10374005
申请日:2003-02-25
IPC分类号: G06F15/16
CPC分类号: G06Q10/107 , H04L51/12
摘要: The invention relates to a system for filtering messages—the system includes a seed filter having associated therewith a false positive rate and a false negative rate. A new filter is also provided for filtering the messages, the new filter is evaluated according to the false positive rate and the false negative rate of the seed filter, the data used to determine the false positive rate and the false negative rate of the seed filter are utilized to determine a new false positive rate and a new false negative rate of the new filter as a function of threshold. The new filter is employed in lieu of the seed filter if a threshold exists for the new filter such that the new false positive rate and new false negative rate are together considered better than the false positive and the false negative rate of the seed filter.
摘要翻译: 本发明涉及一种用于过滤消息的系统 - 该系统包括具有与其相关联的假阳性率和假阴性率的种子滤波器。 还提供了一种过滤消息的新过滤器,根据种子过滤器的假阳性率和假阴性率评估新过滤器,用于确定种子过滤器的假阳性率和假阴性率的数据 用于确定新过滤器的新的假阳性率和新的假阴性率作为阈值的函数。 如果新的过滤器存在阈值,则使用新的过滤器来代替种子过滤器,使得新的假阳性率和新的假阴性率一起被认为优于种子过滤器的假阳性率和假阴性率。
-
公开(公告)号:US08250159B2
公开(公告)日:2012-08-21
申请号:US12359126
申请日:2009-01-23
CPC分类号: G06Q10/107 , H04L51/12
摘要: Architecture for detecting and removing obfuscating clutter from the subject and/or body of a message, e.g., e-mail, prior to filtering of the message, to identify junk messages commonly referred to as spam. The technique utilizes the powerful features built into an HTML rendering engine to strip the HTML instructions for all non-substantive aspects of the message. Pre-processing includes pre-rendering of the message into a final format, which final format is that which is displayed by the rendering engine to the user. The final format message is then converted to a text-only format to remove graphics, color, non-text decoration, and spacing that cannot be rendered as ASCII-style or Unicode-style characters. The result is essentially to reduce each message to its common denominator essentials so that the junk mail filter can view each message on an equal basis.
摘要翻译: 用于在过滤消息之前检测和去除来自主体和/或消息主体(例如电子邮件)的模糊杂波的体系结构,以识别通常被称为垃圾邮件的垃圾邮件。 该技术利用内置于HTML呈现引擎中的强大功能来剥离消息的所有非实质性方面的HTML指令。 预处理包括将消息预渲染成最终格式,最终格式是由呈现引擎向用户显示的最终格式。 最终格式化消息然后转换为纯文本格式以删除不能以ASCII样式或Unicode风格字符呈现的图形,颜色,非文本装饰和间距。 结果基本上是将每个消息减少到其公分要素,以便垃圾邮件过滤器可以在平等的基础上查看每个消息。
-
公开(公告)号:US07558832B2
公开(公告)日:2009-07-07
申请号:US11743466
申请日:2007-05-02
申请人: Robert L. Rounthwaite , Joshua T. Goodman , David E. Heckerman , John D. Mehr , Nathan D. Howell , Micah C. Rupersburg , Dean A. Slawson
发明人: Robert L. Rounthwaite , Joshua T. Goodman , David E. Heckerman , John D. Mehr , Nathan D. Howell , Micah C. Rupersburg , Dean A. Slawson
IPC分类号: G06F15/16
CPC分类号: H04L51/12 , G06Q10/107
摘要: The subject invention provides for a feedback loop system and method that facilitate classifying items in connection with spam prevention in server and/or client-based architectures. The invention makes uses of a machine-learning approach as applied to spam filters, and in particular, randomly samples incoming email messages so that examples of both legitimate and junk/spam mail are obtained to generate sets of training data. Users which are identified as spam-fighters are asked to vote on whether a selection of their incoming email messages is individually either legitimate mail or junk mail. A database stores the properties for each mail and voting transaction such as user information, message properties and content summary, and polling results for each message to generate training data for machine learning systems. The machine learning systems facilitate creating improved spam filter(s) that are trained to recognize both legitimate mail and spam mail and to distinguish between them.
摘要翻译: 本发明提供了一种反馈循环系统和方法,其有助于在服务器和/或基于客户端的体系结构中与垃圾邮件防止相关联的项目进行分类。 本发明将机器学习方法应用于垃圾邮件过滤器,特别是随机抽取传入的电子邮件消息,以便获得合法和垃圾/垃圾邮件的示例以生成训练数据集。 被要求被识别为垃圾邮件战士的用户被要求投票选择他们的收到的电子邮件的选择是单独的合法邮件还是垃圾邮件。 数据库存储每个邮件和投票交易的属性,例如用户信息,消息属性和内容摘要,以及每个消息的轮询结果,以生成机器学习系统的训练数据。 机器学习系统便于创建改进的垃圾邮件过滤器,该过滤器被训练以识别合法邮件和垃圾邮件并区分它们。
-
-
-
-
-
-
-
-
-