Method for identifying outliers in large data sets
    1.
    发明授权
    Method for identifying outliers in large data sets 失效
    识别大型数据集中异常值的方法

    公开(公告)号:US06643629B2

    公开(公告)日:2003-11-04

    申请号:US09442912

    申请日:1999-11-18

    IPC分类号: G06F1700

    摘要: A new method for identifying a predetermined number of data points of interest in a large data set. The data points of interest are ranked in relation to the distance to their neighboring points. The method employs partition-based detection algorithms to partition the data points and then compute upper and lower bounds for each partition. These bounds are then used to eliminate those partitions that do contain the predetermined number of data points of interest. The data points of interest are then computed from the remaining partitions that were not eliminated. The present method eliminates a significant number of data points from consideration as the points of interest, thereby resulting in substantial savings in computational expense compared to conventional methods employed to identify such points.

    摘要翻译: 一种用于在大数据集中识别预定数量的感兴趣的数据点的新方法。 感兴趣的数据点与其相邻点的距离相关。 该方法采用基于分区的检测算法对数据点进行分区,然后计算每个分区的上限和下限。 然后使用这些边界来消除那些包含预定数量的感兴趣的数据点的那些分区。 然后从尚未消除的剩余分区计算感兴趣的数据点。 本方法从考虑中消除了大量数据点作为感兴趣的点,从而与用于识别这些点的常规方法相比,大大节省了计算费用。

    Methods of imaging based on wavelet retrieval of scenes
    2.
    发明授权
    Methods of imaging based on wavelet retrieval of scenes 有权
    基于场景小波检索的成像方法

    公开(公告)号:US06751363B1

    公开(公告)日:2004-06-15

    申请号:US09371112

    申请日:1999-08-10

    IPC分类号: G06K954

    摘要: Methods of imaging objects based on wavelet retrieval of scenes utilize wavelet transformation of plural defined regions of a query image. By increasing the granularity of the query image to greater than one region, accurate feature vectors are obtained that allow for robust extraction of corresponding regions from a database of target images. The methods further include the use of sliding windows to decompose the query and target images into regions, and the clustering of the regions utilizing a novel similarity metric that ensures robust image matching in low response times.

    摘要翻译: 基于场景小波检索的物体成像方法利用查询图像的多个限定区域的小波变换。 通过将查询图像的粒度提高到大于一个区域,获得准确的特征向量,其允许从目标图像的数据库中鲁棒地提取对应的区域。 这些方法还包括使用滑动窗口将查询和目标图像分解成区域,以及利用新颖的相似性度量来聚类区域,以确保在低响应时间内稳健的图像匹配。

    System and method for constraint based sequential pattern mining
    3.
    发明授权
    System and method for constraint based sequential pattern mining 有权
    基于约束的顺序模式挖掘的系统和方法

    公开(公告)号:US06473757B1

    公开(公告)日:2002-10-29

    申请号:US09537082

    申请日:2000-03-28

    IPC分类号: G06F1730

    摘要: The present invention provides a method and system for sequential pattern mining with a given constraint. A Regular Expression (RE) is used for identifying the family of interesting frequent patterns. A family of methods that enforce the RE constraint to different degrees within the generating and pruning of candidate patterns during the mining process is utilized. This is accomplished by employing different relaxations of the RE constraint in the mining loop. Those sequences which satisfy the given constraint are thus identified most expeditiously.

    摘要翻译: 本发明提供了一种具有给定约束的顺序模式挖掘的方法和系统。 正则表达式(RE)用于识别有趣的频繁模式的家族。 利用在采矿过程中在候选模式的生成和修剪之内将RE约束强制到不同程度的一系列方法。 这是通过在采矿循环中采用RE约束的不同放松来实现的。 因此,最快地确定满足给定约束的那些序列。

    Method for mining association rules in data
    4.
    发明授权
    Method for mining association rules in data 失效
    数据挖掘关联规则的方法

    公开(公告)号:US06185549B2

    公开(公告)日:2001-02-06

    申请号:US09069135

    申请日:1998-04-29

    IPC分类号: G06F1700

    CPC分类号: G06F17/30539 G06Q30/0201

    摘要: An electronic data mining process for mining from an electronic data base using an electronic digital computer a listing of commercially useful information of the type known in the art as an association rule containing at least one uninstantiated condition. For example, the commercially useful information may be information useful for sales promotion, such as promotion of telephone usage. The computer retrieves from the database a plurality of stored parameters from which measures of the uninstatiated condition can be determined. The computer uses a dynamic programming algorithm and iterates over intervals or sub-ranges of the parameters to obtain what is called an at least partially optimized association rule, as optimized intervals or sub-ranges of at least some of the retrieved parameters, for example, time intervals of high usage of certain types of telephone connections. These optimized intervals are provided as the listed commercially useful information. The amount of needed iteration is reduced in some cases by using so-called bucketing and divide-and-conquer techniques. Extension of the process for a plurality of uninstantiated conditions is described.

    摘要翻译: 一种用于使用电子数字计算机从电子数据库挖掘的电子数据挖掘过程,本领域已知的类型的商业上有用的信息的列表作为包含至少一个未发生状态的关联规则。 例如,商业上有用的信息可以是有助于促销的信息,例如促进电话使用。 计算机从数据库中检索多个存储的参数,从该信息可以确定不受损状态的测量。 计算机使用动态规划算法,并且遍历参数的间隔或子范围,以获得所谓的至少部分优化的关联规则,作为至少一些检索参数的优化间隔或子范围,例如, 某些类型的电话连接的高使用时间间隔。 这些优化的间隔作为列出的商业有用的信息提供。 在某些情况下,通过使用所谓的抗衡和分治技术来减少需要的迭代量。 描述了用于多个未示例的条件的处理的扩展。

    Decision tree classifier with integrated building and pruning phases
    5.
    发明授权
    Decision tree classifier with integrated building and pruning phases 有权
    具有综合建筑和修剪阶段的决策树分类器

    公开(公告)号:US06247016B1

    公开(公告)日:2001-06-12

    申请号:US09189257

    申请日:1998-11-10

    IPC分类号: G06F1730

    摘要: A method of data classification using a decision tree having nodes is disclosed, along with an apparatus for perming the method. Periodically or after a certain number of nodes of the tree are split, the partially built tree is pruned. During the building phase the minimum cost of subtrees rooted at leaf nodes that can still be expanded (“yet to be expanded nodes”)is computed. With the computation of the minimum subtree cost at nodes, the nodes pruned are a subset of those that would have been pruned anyway during the pruning phase, and they are pruned while the tree is still being built.

    摘要翻译: 公开了一种使用具有节点的决策树进行数据分类的方法,以及用于对该方法进行烫发的装置。 定期地或在树的一定数量的节点被分割之后,修剪部分构建的树。 在构建阶段,计算根植于仍然可扩展的叶节点(“尚未扩展节点”)的子树的最小成本。 通过计算节点的最小子树成本,修剪的节点是在修剪阶段将被修剪的节点的子集,并且在树仍在构建时它们被修剪。

    Programmed medium for clustering large databases
    6.
    发明授权
    Programmed medium for clustering large databases 失效
    用于集群大数据库的程序化介质

    公开(公告)号:US6092072A

    公开(公告)日:2000-07-18

    申请号:US55941

    申请日:1998-04-07

    IPC分类号: G06F17/30

    摘要: The present invention relates to a computer method, apparatus and programmed medium for clustering large databases. The present invention represents each cluster to be merged by a constant number of well scattered points that capture the shape and extent of the cluster. The chosen scattered points are shrunk towards the mean of the cluster by a shrinking fraction to form a representative set of data points that efficiently represent the cluster. The clusters with the closest pair of representative points are merged to form a new cluster. The use of an efficient representation of the clusters allows the present invention to obtain improved clustering while efficiently eliminating outliers.

    摘要翻译: 本发明涉及用于聚类大数据库的计算机方法,装置和编程介质。 本发明表示通过捕获簇的形状和范围的恒定数量的良好散射点来合并的每个簇。 所选择的散点按照缩小的分数缩小到群集的平均值,以形成有效代表群集的一组代表性的数据点。 具有最接近的代表点对的集合被合并以形成新的集群。 使用集群的有效表示允许本发明获得改进的聚类,同时有效地消除异常值。

    Document descriptor extraction method
    7.
    发明授权
    Document descriptor extraction method 有权
    文件描述提取方法

    公开(公告)号:US07080314B1

    公开(公告)日:2006-07-18

    申请号:US09595719

    申请日:2000-06-16

    IPC分类号: G06F15/00

    CPC分类号: G06F17/2247

    摘要: The present invention discloses a document descriptor extraction method and system. The document descriptor extraction method and system creates a document descriptor by generalizing input sequences within a document; factoring the input sequences and generalized input sequences; and selecting a document descriptor from the input sequences, generalized sequences, and factored sequences, preferably using minimum descriptor length (MDL) principles. Novel algorithms are employed to perform the generalizing, factoring, and selecting.

    摘要翻译: 本发明公开了一种文档描述符提取方法和系统。 文档描述符提取方法和系统通过对文档内的输入序列进行泛化来创建文档描述符; 分解输入序列和广义输入序列; 以及优选地使用最小描述符长度(MDL)原理从输入序列,广义序列和因子序列中选择文档描述符。 采用新颖的算法进行泛化,分解和选择。

    Technique for effectively instantiating attributes in association rules
    8.
    发明授权
    Technique for effectively instantiating attributes in association rules 失效
    有效实例化关联规则属性的技术

    公开(公告)号:US5946683A

    公开(公告)日:1999-08-31

    申请号:US977878

    申请日:1997-11-25

    IPC分类号: G06F17/30

    摘要: In a data processing system, association rules are used to determine correlations of attributes of collected data, thereby extracting insightful information therefrom. In solving an optimized association rule problem where multiple instantiations for at least one uninstantiated attribute are required, unlike prior art, not all possible instantiations are considered to realize an optimized set of instantiations. Rather, using inventive pruning techniques, only selected instantiations need to be considered to realize same. In accordance with the invention, instantiations are assigned weights and are subject to pruning in an order dependent upon their weight. The weighted instantiations are tested based on selected criteria to identify, for example, those instantiations, consideration of which for the optimized set would be redundant in view of other instantiations to be considered. The identified instantiations are disregarded to increase the efficiency of determining the optimized set.

    摘要翻译: 在数据处理系统中,使用关联规则来确定收集的数据的属性的相关性,从而从中提取有见识的信息。 在解决优化的关联规则问题中,其中需要至少一个未启动属性的多个实例化,与现有技术不同,并不是所有可能的实例被认为实现优化的一组实例。 相反,使用创造性的修剪技术,仅需要考虑选择的实例化才能实现。 根据本发明,实例化被赋予权重,并且以取决于它们的重量的顺序进行修剪。 基于所选择的标准来测试加权实例,以识别例如那些实例,考虑到要考虑的其他实例化,对于优化集合的考虑将是多余的。 识别的实例被忽略以提高确定优化集合的效率。

    Method, apparatus and programmed medium for clustering databases with
categorical attributes
    10.
    发明授权
    Method, apparatus and programmed medium for clustering databases with categorical attributes 失效
    用于对具有分类属性的数据库进行聚类的方法,装置和程序化介质

    公开(公告)号:US6049797A

    公开(公告)日:2000-04-11

    申请号:US55940

    申请日:1998-04-07

    IPC分类号: G06F17/30 G06K9/62

    摘要: The present invention relates to a computer method, apparatus and programmed medium for clustering databases containing data with categorical attributes. The present invention assigns a pair of points to be neighbors if their similarity exceeds a certain threshold. The similarity value for pairs of points can be based on non-metric information. The present invention determines a total number of links between each cluster and every other cluster bases upon the neighbors of the clusters. A goodness measure between each cluster and every other cluster based upon the total number of links between each cluster and every other cluster and the total number of points within each cluster and every other cluster is then calculated. The present invention merges the two clusters with the best goodness measure. Thus, clustering is performed accurately and efficiently by merging data based on the amount of links between the data to be clustered.

    摘要翻译: 本发明涉及一种计算机方法,装置和用于对包含具有分类属性的数据进行聚类的数据库的编程介质。 如果它们的相似度超过特定阈值,则本发明将一对点分配为邻居。 点对的相似度值可以基于非度量信息。 本发明确定每个群集与每个其他群集之间的链路的总数量,基于群集的邻居。 基于每个集群和每个其他集群之间的链路总数和每个集群和每个其他集群中的总点数,然后计算每个集群和每个其他集群之间的良好度量。 本发明以最佳的品质度量合并了两个群。 因此,通过基于待聚集的数据之间的链接量合并数据,准确而有效地执行聚类。