Architecture for automated data analysis
    2.
    发明授权
    Architecture for automated data analysis 有权
    自动数据分析架构

    公开(公告)号:US06330563B1

    公开(公告)日:2001-12-11

    申请号:US09298717

    申请日:1999-04-23

    IPC分类号: G06F1730

    摘要: An architecture for automated data analysis. In one embodiment, a computerized system comprising an automated problem formulation layer, a first learning engine, and a second learning engine. The automated problem formulation layer receives a data set. The data set has a plurality of records, where each record has a value for each of a plurality of raw transactional variables. The layer abstracts the raw transactional variables into cooked transactional variables. The first learning engine generates a model for the cooked transactional variables, while the second learning engine generates a model for the raw transactional variables.

    摘要翻译: 用于自动数据分析的架构。 在一个实施例中,包括自动化问题制定层,第一学习引擎和第二学习引擎的计算机化系统。 自动化问题制定层接收数据集。 数据集具有多个记录,其中每个记录具有多个原始事务变量中的每一个的值。 该层将原始事务变量抽象为熟的事务变量。 第一个学习引擎为煮熟的事务变量生成模型,而第二个学习引擎生成原始事务变量的模型。

    Methods and apparatus for performing speech recognition using acoustic models which are improved through an interactive process
    3.
    发明授权
    Methods and apparatus for performing speech recognition using acoustic models which are improved through an interactive process 有权
    使用通过交互过程改善的声学模型进行语音识别的方法和装置

    公开(公告)号:US06263308B1

    公开(公告)日:2001-07-17

    申请号:US09531055

    申请日:2000-03-20

    IPC分类号: G10L1502

    CPC分类号: G10L15/063

    摘要: Automated methods and apparatus for synchronizing audio and text data, e.g., in the form of electronic files, representing audio and text expressions of the same work or information are described. Also described are automated methods of detecting errors and other discrepancies between the audio and text versions of the same work. A speech recognition operation is performed on the audio data initially using a speaker independent acoustic model. The recognized text in addition to audio time stamps are produced by the speech recognition operation. The recognized text is compared to the text in text data to identify correctly recognized words. The acoustic model is then retrained using the correctly recognized text and corresponding audio segments from the audio data transforming the initial acoustic model into a speaker trained acoustic model. The retrained acoustic model is then used to perform an additional speech recognition operation on the audio data. The audio and text data are synchronized using the results of the updated acoustic model. In addition, one or more error reports based on the final recognition results are generated showing discrepancies between the recognized words and the words included in the text. By retraining the acoustic model in the above described manner, improved accuracy is achieved.

    摘要翻译: 描述用于同步音频和文本数据的自动方法和装置,例如以电子文件的形式,表示相同作品或信息的音频和文本表达。 还描述了检测相同作品的音频和文本版本之间的错误和其他差异的自动化方法。 首先使用与扬声器无关的声学模型对音频数据执行语音识别操作。 通过语音识别操作产生除音频时间戳之外的识别文本。 将识别的文本与文本数据中的文本进行比较,以识别正确识别的字词。 然后使用来自音频数据的正确识别的文本和对应的音频段将声学模型再训练,将初始声学模型变换成扬声器训练的声学模型。 然后再训练的声学模型用于对音频数据执行附加的语音识别操作。 使用更新的声学模型的结果来同步音频和文本数据。 此外,生成基于最终识别结果的一个或多个错误报告,显示识别的单词与文本中包含的单词之间的差异。 通过以上述方式重新训练声学模型,实现了提高的精度。

    Advanced spam detection techniques
    4.
    发明授权
    Advanced spam detection techniques 有权
    高级垃圾邮件检测技术

    公开(公告)号:US08533270B2

    公开(公告)日:2013-09-10

    申请号:US10601741

    申请日:2003-06-23

    IPC分类号: G06F15/16

    摘要: The subject invention provides for an advanced and robust system and method that facilitates detecting spam. The system and method include components as well as other operations which enhance or promote finding characteristics that are difficult or the spammer to avoid and finding characteristics in non-spam that are difficult for spammers to duplicate. Exemplary characteristics include examining origination features in pairs, analyzing character and/or number sequences, strings, and sub-strings, detecting various entropy levels of one or more character sequences, strings and/or sub-strings as well as analyzing message and/or feature sizes.

    摘要翻译: 本发明提供了一种便于检测垃圾邮件的先进且健壮的系统和方法。 该系统和方法包括增强或促进发现难以避免的特征或垃圾邮件发送者避免并且发现垃圾邮件发送者难以复制的非垃圾邮件特性的其他操作。 示例性特征包括成对检查起始特征,分析字符和/或数字序列,字符串和子串,检测一个或多个字符序列,字符串和/或子串的各种熵级,以及分析消息和/或 特征尺寸。

    Trees of classifiers for detecting email spam
    5.
    发明授权
    Trees of classifiers for detecting email spam 有权
    用于检测电子邮件垃圾邮件的分类树

    公开(公告)号:US07930353B2

    公开(公告)日:2011-04-19

    申请号:US11193691

    申请日:2005-07-29

    IPC分类号: G06F15/16

    CPC分类号: H04L51/12

    摘要: Decision trees populated with classifier models are leveraged to provide enhanced spam detection utilizing separate email classifiers for each feature of an email. This provides a higher probability of spam detection through tailoring of each classifier model to facilitate in more accurately determining spam on a feature-by-feature basis. Classifiers can be constructed based on linear models such as, for example, logistic-regression models and/or support vector machines (SVM) and the like. The classifiers can also be constructed based on decision trees. “Compound features” based on internal and/or external nodes of a decision tree can be utilized to provide linear classifier models as well. Smoothing of the spam detection results can be achieved by utilizing classifier models from other nodes within the decision tree if training data is sparse. This forms a base model for branches of a decision tree that may not have received substantial training data.

    摘要翻译: 利用分类器模型填充的决策树利用电子邮件的每个功能使用单独的电子邮件分类器来提供增强的垃圾邮件检测。 这通过定制每个分类器模型提供了更高的垃圾邮件检测的概率,以便于在逐个特征的基础上更准确地确定垃圾邮件。 分类器可以基于诸如逻辑回归模型和/或支持向量机(SVM)等线性模型来构建。 分类器也可以基于决策树构建。 基于决策树的内部和/或外部节点的“复合特征”也可以用于提供线性分类器模型。 垃圾邮件检测结果的平滑可以通过使用来自决策树内的其他节点的分类器模型来实现,如果训练数据是稀疏的。 这形成了可能没有接收到大量训练数据的决策树的分支的基本模型。

    Message rendering for identification of content features
    6.
    发明授权
    Message rendering for identification of content features 有权
    消息渲染用于识别内容功能

    公开(公告)号:US07483947B2

    公开(公告)日:2009-01-27

    申请号:US10428649

    申请日:2003-05-02

    IPC分类号: G06F15/16

    CPC分类号: G06Q10/107 H04L51/12

    摘要: Architecture for detecting and removing obfuscating clutter from the subject and/or body of a message, e.g., e-mail, prior to filtering of the message, to identify junk messages commonly referred to as spam. The technique utilizes the powerful features built into an HTML rendering engine to strip the HTML instructions for all non-substantive aspects of the message. Pre-processing includes pre-rendering of the message into a final format, which final format is that which is displayed by the rendering engine to the user. The final format message is then converted to a text-only format to remove graphics, color, non-text decoration, and spacing that cannot be rendered as ASCII-style or Unicode-style characters. The result is essentially to reduce each message to its common denominator essentials so that the junk mail filter can view each message on an equal basis.

    摘要翻译: 用于在过滤消息之前检测和去除来自主体和/或消息主体(例如电子邮件)的模糊杂波的体系结构,以识别通常被称为垃圾邮件的垃圾邮件。 该技术利用内置于HTML呈现引擎中的强大功能来剥离消息的所有非实质性方面的HTML指令。 预处理包括将消息预渲染成最终格式,最终格式是由呈现引擎向用户显示的最终格式。 最终格式化消息然后转换为纯文本格式以删除不能以ASCII样式或Unicode风格字符呈现的图形,颜色,非文本装饰和间距。 结果基本上是将每个消息减少到其公分要素,以便垃圾邮件过滤器可以在平等的基础上查看每个消息。

    MESSAGE RENDERING FOR IDENTIFICATION OF CONTENT FEATURES
    8.
    发明申请
    MESSAGE RENDERING FOR IDENTIFICATION OF CONTENT FEATURES 有权
    用于识别内容特征的消息呈现

    公开(公告)号:US20100088380A1

    公开(公告)日:2010-04-08

    申请号:US12359126

    申请日:2009-01-23

    IPC分类号: G06F21/00 G06F15/16

    CPC分类号: G06Q10/107 H04L51/12

    摘要: Architecture for detecting and removing obfuscating clutter from the subject and/or body of a message, e.g., e-mail, prior to filtering of the message, to identify junk messages commonly referred to as spam. The technique utilizes the powerful features built into an HTML rendering engine to strip the HTML instructions for all non-substantive aspects of the message. Pre-processing includes pre-rendering of the message into a final format, which final format is that which is displayed by the rendering engine to the user. The final format message is then converted to a text-only format to remove graphics, color, non-text decoration, and spacing that cannot be rendered as ASCII-style or Unicode-style characters. The result is essentially to reduce each message to its common denominator essentials so that the junk mail filter can view each message on an equal basis.

    摘要翻译: 用于在过滤消息之前检测和去除来自主体和/或消息主体(例如电子邮件)的模糊杂波的体系结构,以识别通常被称为垃圾邮件的垃圾邮件。 该技术利用内置于HTML呈现引擎中的强大功能来剥离消息的所有非实质性方面的HTML指令。 预处理包括将消息预渲染成最终格式,最终格式是由呈现引擎向用户显示的最终格式。 最终格式化消息然后转换为纯文本格式以删除不能以ASCII样式或Unicode风格字符呈现的图形,颜色,非文本装饰和间距。 结果基本上是将每个消息减少到其公分要素,以便垃圾邮件过滤器可以在平等的基础上查看每个消息。

    Adaptive junk message filtering system
    9.
    发明授权
    Adaptive junk message filtering system 有权
    自适应垃圾邮件过滤系统

    公开(公告)号:US07640313B2

    公开(公告)日:2009-12-29

    申请号:US11779263

    申请日:2007-07-17

    IPC分类号: G06F15/16 G06F15/173

    CPC分类号: G06Q10/107 H04L51/12

    摘要: The invention relates to a system for filtering messages—the system includes a seed filter having associated therewith a false positive rate and a false negative rate. A new filter is also provided for filtering the messages, the new filter is evaluated according to the false positive rate and the false negative rate of the seed filter, the data used to determine the false positive rate and the false negative rate of the seed filter are utilized to determine a new false positive rate and a new false negative rate of the new filter as a function of threshold. The new filter is employed in lieu of the seed filter if a threshold exists for the new filter such that the new false positive rate and new false negative rate are together considered better than the false positive and the false negative rate of the seed filter.

    摘要翻译: 本发明涉及一种用于过滤消息的系统 - 该系统包括具有与其相关联的假阳性率和假阴性率的种子滤波器。 还提供了一种过滤消息的新过滤器,根据种子过滤器的假阳性率和假阴性率评估新过滤器,用于确定种子过滤器的假阳性率和假阴性率的数据 用于确定新过滤器的新的假阳性率和新的假阴性率作为阈值的函数。 如果新的过滤器存在阈值,则使用新的过滤器来代替种子过滤器,使得新的假阳性率和新的假阴性率一起被认为优于种子过滤器的假阳性率和假阴性率。

    Adaptive junk message filtering system
    10.
    发明授权
    Adaptive junk message filtering system 有权
    自适应垃圾邮件过滤系统

    公开(公告)号:US07249162B2

    公开(公告)日:2007-07-24

    申请号:US10374005

    申请日:2003-02-25

    IPC分类号: G06F15/16

    CPC分类号: G06Q10/107 H04L51/12

    摘要: The invention relates to a system for filtering messages—the system includes a seed filter having associated therewith a false positive rate and a false negative rate. A new filter is also provided for filtering the messages, the new filter is evaluated according to the false positive rate and the false negative rate of the seed filter, the data used to determine the false positive rate and the false negative rate of the seed filter are utilized to determine a new false positive rate and a new false negative rate of the new filter as a function of threshold. The new filter is employed in lieu of the seed filter if a threshold exists for the new filter such that the new false positive rate and new false negative rate are together considered better than the false positive and the false negative rate of the seed filter.

    摘要翻译: 本发明涉及一种用于过滤消息的系统 - 该系统包括具有与其相关联的假阳性率和假阴性率的种子滤波器。 还提供了一种过滤消息的新过滤器,根据种子过滤器的假阳性率和假阴性率评估新过滤器,用于确定种子过滤器的假阳性率和假阴性率的数据 用于确定新过滤器的新的假阳性率和新的假阴性率作为阈值的函数。 如果新的过滤器存在阈值,则使用新的过滤器来代替种子过滤器,使得新的假阳性率和新的假阴性率一起被认为优于种子过滤器的假阳性率和假阴性率。