Selection of features predictive of biological conditions using protein mass spectrographic data
    1.
    发明授权
    Selection of features predictive of biological conditions using protein mass spectrographic data 失效
    使用蛋白质质谱数据选择预测生物条件的特征

    公开(公告)号:US07676442B2

    公开(公告)日:2010-03-09

    申请号:US11929169

    申请日:2007-10-30

    IPC分类号: G06N5/00

    摘要: Support vector machines are used to classify data contained within a structured dataset such as a plurality of signals generated by a spectral analyzer. The signals are pre-processed to ensure alignment of peaks across the spectra. Similarity measures are constructed to provide a basis for comparison of pairs of samples of the signal. A support vector machine is trained to discriminate between different classes of the samples. to identify the most predictive features within the spectra. In a preferred embodiment feature selection is performed to reduce the number of features that must be considered.

    摘要翻译: 支持向量机用于对包含在结构化数据集中的数据进行分类,例如由频谱分析仪产生的多个信号。 信号被预处理,以确保谱峰的峰对准。 构建相似性度量以提供用于比较信号样本对的基础。 训练支持向量机以区分不同类别的样本。 以识别光谱中最具预测性的特征。 在优选实施例中,执行特征选择以减少必须考虑的特征的数量。

    Kernels and kernel methods for spectral data
    2.
    发明授权
    Kernels and kernel methods for spectral data 有权
    光谱数据的内核和核心方法

    公开(公告)号:US07617163B2

    公开(公告)日:2009-11-10

    申请号:US10267977

    申请日:2002-10-09

    IPC分类号: G06F15/18

    摘要: Support vector machines are used to classify data contained within a structured dataset such as a plurality of signals generated by a spectral analyzer. The signals are pre-processed to ensure alignment of peaks across the spectra. Similarity measures are constructed to provide a basis for comparison of pairs of samples of the signal. A support vector machine is trained to discriminate between different classes of the samples. to identify the most predictive features within the spectra. In a preferred embodiment feature selection is performed to reduce the number of features that must be considered.

    摘要翻译: 支持向量机用于对包含在结构化数据集中的数据进行分类,例如由频谱分析仪产生的多个信号。 信号被预处理,以确保谱峰的峰对准。 构建相似性度量以提供用于比较信号样本对的基础。 训练支持向量机以区分不同类别的样本。 以识别光谱中最具预测性的特征。 在优选实施例中,执行特征选择以减少必须考虑的特征的数量。

    METHOD FOR FEATURE SELECTION AND FOR EVALUATING FEATURES IDENTIFIED AS SIGNIFICANT FOR CLASSIFYING DATA
    3.
    发明申请
    METHOD FOR FEATURE SELECTION AND FOR EVALUATING FEATURES IDENTIFIED AS SIGNIFICANT FOR CLASSIFYING DATA 有权
    特征选择和评估对于分类数据有重要意义的特征的方法

    公开(公告)号:US20110078099A1

    公开(公告)日:2011-03-31

    申请号:US12890705

    申请日:2010-09-26

    IPC分类号: G06F15/18

    摘要: A group of features that has been identified as “significant” in being able to separate data into classes is evaluated using a support vector machine which separates the dataset into classes one feature at a time. After separation, an extremal margin value is assigned to each feature based on the distance between the lowest feature value in the first class and the highest feature value in the second class. Separately, extremal margin values are calculated for a normal distribution within a large number of randomly drawn example sets for the two classes to determine the number of examples within the normal distribution that would have a specified extremal margin value. Using p-values calculated for the normal distribution, a desired p-value is selected. The specified extremal margin value corresponding to the selected p-value is compared to the calculated extremal margin values for the group of features. The features in the group that have a calculated extremal margin value less than the specified margin value are labeled as falsely significant.

    摘要翻译: 使用支持向量机将资源分为类别的“特征”组合进行评估,该支持向量机将数据集一次分为一个特征。 分离后,基于第一类中最低特征值与第二类中最高特征值之间的距离,为每个特征分配极值边缘值。 另外,对于两个类别的大量随机绘制的示例集合中的正态分布计算极值边界值,以确定具有指定的极值边界值的正态分布内的示例的数量。 使用为正态分布计算的p值,选择所需的p值。 对应于所选择的p值的指定极值余量值与所计算的特征组的极值边际值进行比较。 计算的极值余量值小于指定余量值的组中的特征被标记为错误显着。

    Methods for feature selection in a learning machine
    4.
    发明授权
    Methods for feature selection in a learning machine 有权
    学习机器中特征选择的方法

    公开(公告)号:US07318051B2

    公开(公告)日:2008-01-08

    申请号:US10478192

    申请日:2002-05-20

    IPC分类号: G06F15/18 G06E1/00 G06E3/00

    摘要: In a pre-processing step prior to training a learning machine, pre-processing includes reducing the quantity of features to be processed using feature selection methods selected from the group consisting of recursive feature elimination (RFE), minimizing the number of non-zero parameters of the system (lo-norm minimization), evaluation of cost function to identify a subset of features that are compatible with constraints imposed by the learning set, unbalanced correlation score and transductive feature selection. The features remaining after feature selection are then used to train a learning machine for purposes of pattern classification, regression, clustering and/or novelty detection. (FIG. 3, 300, 301, 302, 304, 306, 308, 309, 310, 311, 312, 314)

    摘要翻译: 在训练学习机之前的预处理步骤中,预处理包括使用从递归特征消除(RFE)中选出的特征选择方法来减少要处理的特征量的数量,使非零参数的数量最小化 的系统(最小化),评估成本函数以识别与由学习集施加的约束兼容的特征的子集,不平衡相关得分和转换特征选择。 然后,特征选择之后剩余的特征用于训练学习机,用于模式分类,回归,聚类和/或新颖性检测。 (图3),300,301,302,304,306,308,309,310,311,312,314,314,

    Identification of Co-Regulation Patterns By Unsupervised Cluster Analysis of Gene Expression Data
    5.
    发明申请
    Identification of Co-Regulation Patterns By Unsupervised Cluster Analysis of Gene Expression Data 失效
    通过基因表达数据的无监督聚类分析鉴定协调模式

    公开(公告)号:US20110125683A1

    公开(公告)日:2011-05-26

    申请号:US13019585

    申请日:2011-02-02

    IPC分类号: G06N3/12

    摘要: A method is provided for unsupervised clustering of gene expression data to identify co-regulation patterns. A clustering algorithm randomly divides the data into k different subsets and measures the similarity between pairs of datapoints within the subsets, assigning a score to the pairs based on similarity, with the greatest similarity giving the highest correlation score. A distribution of the scores is plotted for each k. The highest value of k that has a distribution that remains concentrated near the highest correlation score corresponds to the number of co-regulation patterns.

    摘要翻译: 提供了用于基因表达数据的无监督聚类以鉴定共调节模式的方法。 聚类算法将数据随机分为k个不同的子集,并测量子集内的数据点对之间的相似度,并根据相似度为该对分配一个分数,最大相似度给出最高相关分数。 为每个k绘制得分的分布。 具有在最高相关分数附近集中的分布的k的最高值对应于协调模式的数量。

    Model selection for cluster data analysis
    6.
    发明授权
    Model selection for cluster data analysis 失效
    集群数据分析的模型选择

    公开(公告)号:US07890445B2

    公开(公告)日:2011-02-15

    申请号:US11929522

    申请日:2007-10-30

    IPC分类号: G06F17/00 G06N5/00

    摘要: A model selection method is provided for choosing the number of clusters, or more generally the parameters of a clustering algorithm. The algorithm is based on comparing the similarity between pairs of clustering runs on sub-samples or other perturbations of the data. High pairwise similarities show that the clustering represents a stable pattern in the data. The method is applicable to any clustering algorithm, and can also detect lack of structure. We show results on artificial and real data using a hierarchical clustering algorithm.

    摘要翻译: 提供了一种模型选择方法,用于选择聚类数量,或更一般地选择聚类算法的参数。 该算法基于比较子样本上的聚类运行对与数据的其他扰动之间的相似性。 高成对相似性表明聚类表示数据中的稳定模式。 该方法适用于任何聚类算法,并且还可以检测到结构不足。 我们使用层次聚类算法来显示人造和实际数据的结果。

    Kernels and methods for selecting kernels for use in learning machines
    8.
    发明授权
    Kernels and methods for selecting kernels for use in learning machines 失效
    内核和选择用于学习机器的内核的方法

    公开(公告)号:US07788193B2

    公开(公告)日:2010-08-31

    申请号:US11929354

    申请日:2007-10-30

    IPC分类号: G06F15/18 G06F17/00 G06N5/00

    摘要: Learning machines, such as support vector machines, are used to analyze datasets to recognize patterns within the dataset using kernels that are selected according to the nature of the data to be analyzed. Where the datasets possesses structural characteristics, locational kernels can be utilized to provide measures of similarity among data points within the dataset. The locational kernels are then combined to generate a decision function, or kernel, that can be used to analyze the dataset. Where an invariance transformation or noise is present, tangent vectors are defined to identify relationships between the invariance or noise and the data points. A covariance matrix is formed using the tangent vectors, then used in generation of the kernel.

    摘要翻译: 使用学习机器(如支持向量机)分析数据集,以使用根据要分析的数据的性质选择的内核来识别数据集中的模式。 在数据集具有结构特征的情况下,可以利用位置内核提供数据集中的数据点之间的相似度度量。 然后组合位置内核以生成可用于分析数据集的决策函数或内核。 在存在不变变换或噪声的情况下,定义向量以识别不变性或噪声与数据点之间的关系。 使用切向矢量形成协方差矩阵,然后用于生成内核。

    Method for feature selection in a support vector machine using feature ranking
    9.
    发明授权
    Method for feature selection in a support vector machine using feature ranking 失效
    使用特征排序的支持向量机中特征选择的方法

    公开(公告)号:US07805388B2

    公开(公告)日:2010-09-28

    申请号:US11928784

    申请日:2007-10-30

    IPC分类号: G06N7/00

    摘要: In a pre-processing step prior to training a learning machine, pre-processing includes reducing the quantity of features to be processed using feature selection methods selected from the group consisting of recursive feature elimination (RFE), minimizing the number of non-zero parameters of the system (l0-norm minimization), evaluation of cost function to identify a subset of features that are compatible with constraints imposed by the learning set, unbalanced correlation score, transductive feature selection and single feature using margin-based ranking. The features remaining after feature selection are then used to train a learning machine for purposes of pattern classification, regression, clustering and/or novelty detection.

    摘要翻译: 在训练学习机之前的预处理步骤中,预处理包括使用从递归特征消除(RFE)中选出的特征选择方法来减少要处理的特征量的数量,使非零参数的数量最小化 (10-norm minimization),评估成本函数以识别与由学习集施加的约束兼容的特征的子集,不平衡相关得分,转换特征选择和使用基于边缘的排名的单个特征。 然后,特征选择之后剩余的特征用于训练学习机,用于模式分类,回归,聚类和/或新颖性检测。

    Pre-processed feature ranking for a support vector machine
    10.
    发明授权
    Pre-processed feature ranking for a support vector machine 失效
    支持向量机的预处理功能排名

    公开(公告)号:US07475048B2

    公开(公告)日:2009-01-06

    申请号:US10494876

    申请日:2002-11-07

    IPC分类号: G06F15/18

    摘要: A computer-implemented method is provided for ranking features within a large dataset containing a large number of features according to each feature's ability to separate data into classes. For each feature, a support vector machine separates the dataset into two classes and determines the margins between extremal points in the two classes. The margins for all of the features are compared and the features are ranked based upon the size of the margin, with the highest ranked features corresponding to the largest margins. A subset of features for classifying the dataset is selected from a group of the highest ranked features. In one embodiment, the method is used to identify the best genes for disease prediction and diagnosis using gene expression data from micro-arrays.

    摘要翻译: 提供了一种计算机实现的方法,用于根据每个特征将数据分离成类的能力,对包含大量特征的大型数据集中的特征进行排名。 对于每个特征,支持向量机将数据集分为两类,并确定两类极值点之间的边距。 比较所有功能的边距,并根据边距的大小对特征进行排名,排名最高的功能对应于最大的边距。 从一组最高排名的特征中选择用于分类数据集的特征的子集。 在一个实施方案中,该方法用于使用来自微阵列的基因表达数据鉴定用于疾病预测和诊断的最佳基因。