Data mining method and system for generating a decision tree classifier
for data records based on a minimum description length (MDL) and
presorting of records
    1.
    发明授权
    Data mining method and system for generating a decision tree classifier for data records based on a minimum description length (MDL) and presorting of records 失效
    基于最小描述长度(MDL)和分段记录生成用于数据记录的决策树分类器的数据挖掘方法和系统

    公开(公告)号:US5787274A

    公开(公告)日:1998-07-28

    申请号:US564694

    申请日:1995-11-29

    IPC分类号: G06F17/30

    摘要: A method and apparatus are disclosed for generating a decision tree classifier from a training set of records. The method comprises the steps of: pre-sorting the records based on each numeric record attribute, creating a decision tree breadth-first, and pruning the tree based on the MDL principle. Preferably, the pre-sorting includes generating a class list and attribute lists, and independently sorting the numeric attribute lists. The growing of the tree includes evaluating possible splitting criteria and selecting a splitting test for each leaf node, based on a splitting index, and updating the class list to reflect new leaf nodes. In a preferred embodiment, the splitting index is a gini index. The pruning preferably includes encoding the decision tree and splitting tests in an MDL-based code, and determining whether to convert a node into a leaf node, prune its child nodes, or leave the node intact, based on the code length of the node.

    摘要翻译: 公开了一种从记录训练集合生成决策树分类器的方法和装置。 该方法包括以下步骤:基于每个数值记录属性对记录进行预排序,创建决策树宽度优先,并根据MDL原理修剪树。 优选地,预排序包括生成类列表和属性列表,并且独立地排序数字属性列表。 树的增长包括基于分割索引来评估可能的分割标准并为每个叶节点选择分裂测试,并且更新类列表以反映新的叶节点。 在优选实施例中,分割索引是基尼系数。 修剪优选地包括对基于MDL的代码中的决策树进行编码和分割测试,并且基于节点的代码长度来确定是否将节点转换为叶节点,修剪其子节点或使节点保持不变。

    Method and system for generating a decision-tree classifier independent
of system memory size
    2.
    发明授权
    Method and system for generating a decision-tree classifier independent of system memory size 失效
    用于生成独立于系统内存大小的决策树分类器的方法和系统

    公开(公告)号:US5799311A

    公开(公告)日:1998-08-25

    申请号:US646893

    申请日:1996-05-08

    IPC分类号: G06F17/30

    摘要: A method and system are disclosed for generating a decision-tree classifier from a training set of records, independent of the system memory size. The method comprises the steps of: generating an attribute list for each attribute of the records, sorting the attribute lists for numeric attributes, and generating a decision tree by repeatedly partitioning the records using the attribute lists. For each node, split points are evaluated to determine the best split test for partitioning the records at the node. Preferably, a gini index and class histograms are used in determining the best splits. The gini index indicates how well a split point separates the records while the class histograms reflect the class distribution of the records at the node. Also, a hash table is built as the attribute list of the split attribute is divided among the child nodes, which is then used for splitting the remaining attribute lists of the node. The created tree is further pruned based on the MDL principle, which encodes the tree and split tests in an MDL-based code, and determines whether to prune and how to prune each node based on the code length of the node.

    摘要翻译: 公开了用于从记录的训练集合生成决策树分类器的方法和系统,与系统存储器大小无关。 该方法包括以下步骤:为记录的每个属性生成属性列表,对数字属性的属性列表进行排序,以及通过使用属性列表重复分割记录来生成决策树。 对于每个节点,分析点进行评估,以确定分区节点上的记录的最佳分割测试。 优选地,使用基尼系数索引和类别直方图来确定最佳分割。 gini指数表示分割点将记录分离成多少,而类直方图反映了节点上记录的类分布。 此外,由于分割属性的属性列表在子节点之间划分,因此构建了哈希表,然后用于分割节点的剩余属性列表。 基于MDL原理进一步修剪创建的树,MDL原理对基于MDL的代码中的树和分割测试进行编码,并根据节点的代码长度确定是否修剪和如何修剪每个节点。

    Method and system for generating a decision-tree classifier in parallel
in a multi-processor system
    3.
    发明授权
    Method and system for generating a decision-tree classifier in parallel in a multi-processor system 有权
    在多处理器系统中并行生成决策树分类器的方法和系统

    公开(公告)号:US6138115A

    公开(公告)日:2000-10-24

    申请号:US245765

    申请日:1999-02-05

    IPC分类号: G06F17/30

    摘要: A method and system are disclosed for generating a decision-tree classifier in parallel in a multi-processor system, from a training set of records. The method comprises the steps of: partitioning the records among the processors, each processor generating an attribute list for each attribute, and the processors cooperatively generating a decision tree by repeatedly partitioning the records using the attribute lists. For each node, each processor determines its best split test and, along with other processors, selects the best overall split for the records at that node. Preferably, the gini-index and class histograms are used in determining the best splits. Also, each processor builds a hash table using the attribute list of the split attribute and shares it with other processors. The hash tables are used for splitting the remaining attribute lists. The created tree is then pruned based on the MDL principle, which encodes the tree and split tests in an MDL-based code, and determines whether to prune and how to prune each node based on the code length of the node.

    摘要翻译: 公开了一种用于在多处理器系统中从培训记录集并行生成决策树分类器的方法和系统。 该方法包括以下步骤:在处理器之间划分记录,每个处理器为每个属性生成属性列表,并且处理器通过使用属性列表重复分割记录来协同地生成决策树。 对于每个节点,每个处理器确定其最佳分割测试,并与其他处理器一起为该节点上的记录选择最佳的整体分割。 优选地,使用基尼系数索引和类别直方图来确定最佳分割。 此外,每个处理器使用split属性列表构建哈希表,并与其他处理器共享。 散列表用于分割剩余的属性列表。 然后,基于MDL原理修剪创建的树,MDL原理在基于MDL的代码中对树进行编码和分割测试,并根据节点的代码长度确定是否修剪和如何修剪每个节点。

    Determining query intent
    5.
    发明授权
    Determining query intent 有权
    确定查询意图

    公开(公告)号:US08612432B2

    公开(公告)日:2013-12-17

    申请号:US12816389

    申请日:2010-06-16

    IPC分类号: G06F7/00 G06F17/30 G06F15/18

    CPC分类号: G06F17/30979

    摘要: A tree structure has a node associated with each category of a hierarchy of item categories. Child nodes of the tree are associated with sub-categories of the categories associated with parent nodes. Training data including received queries and indicators of a selected item category for each received query is combined with the tree structure by associating each query with the node corresponding to the selected category of the query. When a query is received, a classifier is applied to the nodes to generate a probability that the query is intended to match an item of the category associated with the node. The classifier is applied until the probability is below a threshold. One or more categories associated with the nodes that are closest to the intent of the received query are selected and indicators of items of those categories that match the received query are output.

    摘要翻译: 树结构具有与项目类别的层次结构的每个类别相关联的节点。 树的子节点与与父节点相关联的类别的子类别相关联。 通过将每个查询与对应于所选择的查询类别的节点相关联,将包括接收到的查询和针对每个接收到的查询的所选项目类别的指示符的训练数据与树结构组合。 当接收到查询时,分类器被应用于节点以产生查询旨在匹配与节点相关联的类别的项目的概率。 应用分类器直到概率低于阈值。 选择与接收到的查询的意图最接近的节点相关联的一个或多个类别,并输出与接收到的查询匹配的那些类别的项目的指示符。

    Methods and systems for visually distinguishing user attribute similarities and differences
    6.
    发明授权
    Methods and systems for visually distinguishing user attribute similarities and differences 有权
    用于视觉区分用户属性相似性和差异的方法和系统

    公开(公告)号:US08413060B1

    公开(公告)日:2013-04-02

    申请号:US12000846

    申请日:2007-12-18

    申请人: Rakesh Agrawal

    发明人: Rakesh Agrawal

    IPC分类号: G06F3/00 G06F15/16

    CPC分类号: H04L51/04

    摘要: Methods, computer-readable storage media, and systems are provided to facilitate visually distinguishing common attributes of users an electronic communication network or messaging service. In particular, user profile attributes are compared between a first and second user, and similar attributes are visually highlighted by assigning, for example, a distinct font, font size, color, font effect, and/or other visual effect to the user's screen name to designate which attributes are similar. In addition, or alternatively, when the first user views a user profile of the second user, common user attributes are visually highlighted. In one embodiment, the font, font size, color, and/or font effect assigned to the highlighted attribute indicates a degree of similarity of the attribute. Such implementations may allow users to more easily recognize and interact with others that have similar interests and attributes.

    摘要翻译: 提供方法,计算机可读存储介质和系统以便于在电视通信网络或消息服务的视觉上区分用户的公共属性。 特别地,在第一和第二用户之间比较用户简档属性,并且通过向用户的屏幕名称分配例如不同的字体,字体大小,颜色,字体效果和/或其他视觉效果来视觉突出类似的属性 指定哪些属性相似。 另外或替代地,当第一用户查看第二用户的用户简档时,公共用户属性被视觉上突出显示。 在一个实施例中,分配给突出显示的属性的字体,字体大小,颜色和/或字体效果指示属性的相似程度。 这样的实现可以允许用户更容易地识别和与具有相似兴趣和属性的其他人交互。

    Middleware for query processing across a network of RFID databases
    7.
    发明授权
    Middleware for query processing across a network of RFID databases 失效
    用于RFID数据库网络查询处理的中间件

    公开(公告)号:US08244747B2

    公开(公告)日:2012-08-14

    申请号:US11566931

    申请日:2006-12-05

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30448 G06F17/30557

    摘要: An implementation wherein RFID data is shared across independent organizations has been addressed. RFID data is usually spread across different parties, e.g. enterprises in a supply chain and thus, efficient query processing across all parties is required. Traceability is emerging as one of the key applications of RFID technology. A generic data model is introduced for querying RFID data across a network of independently operated data sources. The model can be used to facilitate traceability query processing and give a set of representative traceability queries. A newly designed process-and-forward approach is implemented for executing traceability queries.

    摘要翻译: 解决了RFID数据在独立组织之间共享的实现。 RFID数据通常分布在不同方面,例如 供应链中的企业,因此需要各方有效的查询处理。 可追溯性正在成为RFID技术的关键应用之一。 引入了通用数据模型,用于通过独立运行的数据源网络查询RFID数据。 该模型可用于促进可追溯性查询处理,并提供一组具有代表性的可追溯性查询。 实施新设计的进程和转发方法来执行可追溯性查询。

    OBJECT CLASSIFICATION USING TAXONOMIES
    8.
    发明申请
    OBJECT CLASSIFICATION USING TAXONOMIES 有权
    使用TAXONOMIES的对象分类

    公开(公告)号:US20100185577A1

    公开(公告)日:2010-07-22

    申请号:US12414065

    申请日:2009-03-30

    IPC分类号: G06N5/02

    CPC分类号: G06N99/005

    摘要: As provided herein objects from a source catalog, such as a provider's catalog, can be added to a target catalog, such as an enterprise master catalog, in a scalable manner utilizing catalog taxonomies. A baseline classifier determines probabilities for source objects to target catalog classes. Source objects can be assigned to those classes with probabilities that meet a desired threshold and meet a desired rate. A classification cost for target classes can be determined for respective unassigned source objects, which can comprise determining an assignment cost and separation cost for the source objects for respective desired target classes. The separation and assignment costs can be combined to determine the classification cost, and the unassigned source objects can be assigned to those classes having a desired classification cost.

    摘要翻译: 如本文所提供的,可以使用目录分类法将来自源目录的诸如提供者目录的对象以可扩展的方式添加到目标目录,例如企业主目录。 基准分类器确定源对象到目标目录类的概率。 可以将源对象分配给具有满足期望阈值且满足期望速率的概率的那些类。 可以针对相应的未分配的源对象来确定目标类别的分类成本,其可以包括确定用于各个期望目标类别的源对象的分配成本和分离成本。 分离和分配成本可以组合以确定分类成本,并且未分配的源对象可以被分配给具有期望的分类成本的那些类。

    CUSTOMIZED SEARCH
    9.
    发明申请
    CUSTOMIZED SEARCH 有权
    自定义搜索

    公开(公告)号:US20100114925A1

    公开(公告)日:2010-05-06

    申请号:US12253658

    申请日:2008-10-17

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864 G06F17/30477

    摘要: Techniques are disclosed herein for providing a custom search engine. In one aspect, a first search query is received from a requestor. First search results contain search result items that match the first search query are obtained. A least one sub-query is generated from the first search results. The generating is based on rules for a particular custom search engine. Second search results that match the sub-query are then obtained. A search result set is formed from a corpus that includes the first search results and the second search results. The generating of the search result set is based on the rules for the particular custom search engine. The search result set is provided to the requester. In one aspect an interface for designing a custom search engine is provided. The interface allows the designer to specify the layout of a search results page.

    摘要翻译: 本文公开了用于提供定制搜索引擎的技术。 在一个方面,从请求者接收第一搜索查询。 首先搜索结果包含与第一个搜索查询匹配的搜索结果项。 从第一搜索结果生成至少一个子查询。 生成基于特定自定义搜索引擎的规则。 然后获得与子查询匹配的第二搜索结果。 搜索结果集由包含第一搜索结果和第二搜索结果的语料库形成。 搜索结果集的生成基于特定自定义搜索引擎的规则。 搜索结果集提供给请求者。 在一个方面,提供了一种用于设计定制搜索引擎的界面。 该界面允许设计者指定搜索结果页面的布局。