Generating a predictive model from multiple data sources
    1.
    发明授权
    Generating a predictive model from multiple data sources 有权
    从多个数据源生成预测模型

    公开(公告)号:US08996452B2

    公开(公告)日:2015-03-31

    申请号:US13545817

    申请日:2012-07-10

    IPC分类号: G06F7/00 G06F17/00 G06Q10/06

    CPC分类号: G06Q10/06

    摘要: Techniques are disclosed for generating an ensemble model from multiple data sources. In one embodiment, the ensemble model is generated using a global validation sample, a global holdout sample and base models generated from the multiple data sources. An accuracy value may be determined for each base model, on the basis of the global validation dataset. The ensemble model may be generated from a subset of the base models, where the subset is selected on the basis of the determined accuracy values.

    摘要翻译: 公开了用于从多个数据源生成集合模型的技术。 在一个实施例中,使用全局验证样本,全局保持样本和从多个数据源生成的基本模型来生成集合模型。 可以基于全局验证数据集为每个基本模型确定精度值。 集合模型可以从基本模型的子集生成,其中基于确定的精度值选择子集。

    GENERATING A PREDICTIVE MODEL FROM MULTIPLE DATA SOURCES
    3.
    发明申请
    GENERATING A PREDICTIVE MODEL FROM MULTIPLE DATA SOURCES 有权
    从多个数据源生成预测模型

    公开(公告)号:US20120239613A1

    公开(公告)日:2012-09-20

    申请号:US13048536

    申请日:2011-03-15

    IPC分类号: G06F7/00 G06F17/00 G06F17/30

    CPC分类号: G06Q10/06

    摘要: Techniques are disclosed for generating an ensemble model from multiple data sources. In one embodiment, the ensemble model is generated using a global validation sample, a global holdout sample and base models generated from the multiple data sources. An accuracy value may be determined for each base model, on the basis of the global validation dataset. The ensemble model may be generated from a subset of the base models, where the subset is selected on the basis of the determined accuracy values.

    摘要翻译: 公开了用于从多个数据源生成集合模型的技术。 在一个实施例中,使用全局验证样本,全局保持样本和从多个数据源生成的基本模型来生成集合模型。 可以基于全局验证数据集为每个基本模型确定精度值。 集合模型可以从基本模型的子集生成,其中基于确定的精度值选择子集。

    Computing and applying order statistics for data preparation
    5.
    发明授权
    Computing and applying order statistics for data preparation 有权
    计算和应用订单统计数据进行准备

    公开(公告)号:US08868573B2

    公开(公告)日:2014-10-21

    申请号:US13444718

    申请日:2012-04-11

    IPC分类号: G06F7/00

    摘要: Provided are techniques for generating order statistics and error bounds. For each of multiple, distributed data sources, a finite number of data bins are created for each field in that data source. Data values in each of the multiple, distributed data sources are processed to generate basic summaries for each of the data bins in a single pass of the data values. The data bins from each of the multiple, distributed data sources are sorted. One or more approximate order statistics are computed for a data set by accumulating counts from a number of the sorted data bins. Lower and upper error bounds are provided for each of the computed one or more approximate order statistics, wherein the lower and upper error bounds are values delimiting an interval containing a true value of an order statistic.

    摘要翻译: 提供了用于生成订单统计和错误界限的技术。 对于多个分布式数据源中的每一个,为数据源中的每个字段创建有限数量的数据仓。 处理多个分布式数据源中的每一个中的数据值,以便在单次数据值中为每个数据仓生成基本摘要。 来自多个分布式数据源中的每一个的数据仓被排序。 通过从多个排序的数据仓中累积计数,为数据集计算一个或多个近似顺序统计量。 为所计算的一个或多个近似秩统计中的每一个提供下限和上限误差界限,其中下限误差界限和上限误差界限是定义包含订单统计量的真实值的间隔的值。

    COMPUTING AND APPLYING ORDER STATISTICS FOR DATA PREPARATION
    6.
    发明申请
    COMPUTING AND APPLYING ORDER STATISTICS FOR DATA PREPARATION 审中-公开
    计算和应用订单统计数据准备

    公开(公告)号:US20130218908A1

    公开(公告)日:2013-08-22

    申请号:US13399838

    申请日:2012-02-17

    IPC分类号: G06F17/30

    摘要: Provided are techniques for generating order statistics and error bounds. For each of multiple, distributed data sources, a finite number of data bins are created for each field in that data source. Data values in each of the multiple, distributed data sources are processed to generate basic summaries for each of the data bins in a single pass of the data values. The data bins from each of the multiple, distributed data sources are sorted. One or more approximate order statistics are computed for a data set by accumulating counts from a number of the sorted data bins. Lower and upper error bounds are provided for each of the computed one or more approximate order statistics, wherein the lower and upper error bounds are values delimiting an interval containing a true value of an order statistic.

    摘要翻译: 提供了用于生成订单统计和错误界限的技术。 对于多个分布式数据源中的每一个,为数据源中的每个字段创建有限数量的数据仓。 处理多个分布式数据源中的每一个中的数据值,以便在单次数据值中为每个数据仓生成基本摘要。 来自多个分布式数据源中的每一个的数据仓被排序。 通过从多个排序的数据仓中累积计数,为数据集计算一个或多个近似顺序统计量。 为所计算的一个或多个近似秩统计中的每一个提供下限和上限误差界限,其中下限误差界限和上限误差界限是定义包含订单统计量的真实值的间隔的值。

    INTERESTINGNESS OF DATA
    9.
    发明申请
    INTERESTINGNESS OF DATA 有权
    资料的利益

    公开(公告)号:US20130006998A1

    公开(公告)日:2013-01-03

    申请号:US13172707

    申请日:2011-06-29

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30321

    摘要: Provided are techniques for analyzing fields. Statistical metrics for each field in a data set are received. A general interestingness index is generated for each field using one or more combination functions that aggregate standardized interestingness sub-indexes. One or more fields are identified as interesting for further analysis using the general interestingness index. One or more expert recommendations for field transformations are constructed for the identified one or more fields.

    摘要翻译: 提供分析领域的技术。 收到数据集中每个字段的统计量度。 使用聚合标准化兴趣子索引的一个或多个组合函数为每个字段生成一般的趣味性索引。 一个或多个字段被识别为有趣的进一步分析使用一般的趣味性指数。 为识别的一个或多个字段构建用于场转换的一个或多个专家建议。

    Interestingness of data
    10.
    发明授权
    Interestingness of data 有权
    数据有趣

    公开(公告)号:US08843498B2

    公开(公告)日:2014-09-23

    申请号:US13614335

    申请日:2012-09-13

    IPC分类号: G06F7/00 G06F17/30

    CPC分类号: G06F17/30321

    摘要: Provided are techniques for analyzing fields. Statistical metrics for each field in a data set are received. A general interestingness index is generated for each field using one or more combination functions that aggregate standardized interestingness sub-indexes. One or more fields are identified as interesting for further analysis using the general interestingness index. One or more expert recommendations for field transformations are constructed for the identified one or more fields.

    摘要翻译: 提供分析领域的技术。 收到数据集中每个字段的统计量度。 使用聚合标准化兴趣子索引的一个或多个组合函数为每个字段生成一般的趣味性索引。 一个或多个字段被识别为有趣的进一步分析使用一般的趣味性指数。 为识别的一个或多个字段构建用于场转换的一个或多个专家建议。