摘要:
Binning of predictor values used for generating a data mining model provides useful reduction in memory footprint and computation during the computationally dominant decision tree build phase, but reduces the information loss of the model and reduces the introduction of false information artifacts. A method of binning data in a database for data mining modeling in a database system, the data stored in a database table in the database system, the data mining modeling having selected at least one predictor and one target for the data, the data including a plurality of values of the predictor and a plurality of values of the target, the method comprises constructing a binary tree for the predictor that splits the values of the predictor into a plurality of portions, pruning the binary tree, and defining as bins of the predictor leaves of the tree that remain after pruning, each leaf of the tree representing a portion of the values of the predictor.
摘要:
A computer-implemented method of creating a data mining model in a database management system comprises accepting a database language statement at the database management system, the database language statement indicating a dataset and a data mining model to be created from the dataset, and creating, in the database management system, the indicated data mining model using the indicated dataset, wherein creation and application of the data mining model does not require moving data to a separate data mining engine.
摘要:
Decision trees are efficiently represented in a relational database. A computer-implemented method of representing a decision tree model in relational form comprises providing a directed acyclic graph comprising a plurality of nodes and a plurality of links, each link connecting a plurality of nodes, encoding a tree structure by including in each node a parent-child relationship of the node with other nodes, encoding in each node information relating to a split represented by the node, the split information including a splitting predictor and a split value, and encoding in each node a target histogram.
摘要:
A method, system, and computer program product for counting predictor-target pairs for a decision tree model provides the capability to generate count tables that is quicker and more efficient than previous techniques. A method of counting predictor-target pairs for a decision tree model, the decision tree model based on data stored in a database, the data comprising a plurality of rows of data, at least one predictor and at least one target, comprises generating a bitmap for each split node of data stored in a database system by intersecting a parent node bitmap and a bitmap of a predictor that satisfies a condition of the node, intersecting each split node bitmap with each predictor bitmap and with each target bitmap to form intersected bitmaps, and counting bits of each intersected bitmap to generate a count of predictor-target pairs.
摘要:
Binning of predictor values used for generating a data mining model provides useful reduction in memory footprint and computation during the computationally dominant decision tree build phase, but reduces the information loss of the model and reduces the introduction of false information artifacts. A method of binning data in a database for data mining modeling in a database system, the data stored in a database table in the database system, the data mining modeling having selected at least one predictor and one target for the data, the data including a plurality of values of the predictor and a plurality of values of the target, the method comprises constructing a binary tree for the predictor that splits the values of the predictor into a plurality of portions, pruning the binary tree, and defining as bins of the predictor leaves of the tree that remain after pruning, each leaf of the tree representing a portion of the values of the predictor.
摘要:
A method, system, and computer program product for counting predictor-target pairs for a decision tree model provides the capability to generate count tables that is quicker and more efficient than previous techniques. A method of counting predictor-target pairs for a decision tree model, the decision tree model based on data stored in a database, the data comprising a plurality of rows of data, at least one predictor and at least one target, comprises generating a bitmap for each split node of data stored in a database system by intersecting a parent node bitmap and a bitmap of a predictor that satisfies a condition of the node, intersecting each split node bitmap with each predictor bitmap and with each target bitmap to form intersected bitmaps, and counting bits of each intersected bitmap to generate a count of predictor-target pairs.
摘要:
An implementation of NMF functionality integrated into a relational database management system provides the capability to apply NMF to relational datasets and to sparse datasets. A database management system comprises a multi-dimensional data table operable to store data and a processing unit operable to perform non-negative matrix factorization on data stored in the multi-dimensional data table and to generate a plurality of data tables, each data table being smaller than the multi-dimensional data table and having reduced dimensionality relative to the multi-dimensional data table. The multi-dimensional data table may be a relational data table.
摘要:
Decision trees are efficiently represented in a relational database. A computer-implemented method of representing a decision tree model in relational form comprises providing a directed acyclic graph comprising a plurality of nodes and a plurality of links, each link connecting a plurality of nodes, encoding a tree structure by including in each node a parent-child relationship of the node with other nodes, encoding in each node information relating to a split represented by the node, the split information including a splitting predictor and a split value, and encoding in each node a target histogram.
摘要:
An implementation of SVM functionality integrated into a relational database management system (RDBMS) improves efficiency, time consumption, and data security, reduces the parameter tuning challenges presented to the inexperienced user, and reduces the computational costs of building SVM models. A database management system comprises data stored in the database management system and a processing unit comprising a client application programming interface operable to provide an interface to client software, a build unit operable to build a support vector machine model on at least a portion of the data stored in the database management system, and an apply unit operable to apply the support vector machine model using the data stored in the database management system. The database management system may be a relational database management system.
摘要:
An implementation of SVM functionality improves efficiency, time consumption, and data security, reduces the parameter tuning challenges presented to the inexperienced user, and reduces the computational costs of building SVM models. A system for support vector machine processing comprises data stored in the system, a client application programming interface operable to provide an interface to client software, a build unit operable to build a support vector machine model on at least a portion of the data stored in the system, based on a plurality of model-building parameters, a parameter estimation unit operable to estimate values for at least some of the model-building parameters, and an apply unit operable to apply the support vector machine model using the data stored in the system.