Single pass space efficient system and method for generating an approximate quantile in a data set having an unknown size
    11.
    发明授权
    Single pass space efficient system and method for generating an approximate quantile in a data set having an unknown size 失效
    用于在具有未知尺寸的数据集中生成近似分位数的单遍空间有效系统和方法

    公开(公告)号:US06343288B1

    公开(公告)日:2002-01-29

    申请号:US09268089

    申请日:1999-03-12

    IPC分类号: G06F1730

    摘要: A space-efficient system and method for generating an approximate &phgr;-quantile data element of a data set in a single pass over the data set, without a priori knowledge of the size of the data set. The approximate &phgr;-quantile is guaranteed to lie within a user-specified approximation error &egr; of the true quantile being sought with a probability of at least 1−&dgr;, with &dgr; being a user-defined probability of failure. B buffers, each having a capacity of k elements, initially are filled with elements from the data set, with the values of b and k depending on approximation error e and the probability &dgr;. The buffers are then collapsed into an output buffer, with the remaining buffers then being refilled with elements, collapsed (along with the previous output buffer), and so on until the entire data set has been processed and a single output remains. The element of the output corresponding to the approximate quantile is then output as the approximate quantile. In later iterations (when the height of the tree is at least equal to a predetermined height that depends on &dgr; and &egr;), the data is sampled non-uniformly to populate the buffers to render the desired performance. Parallel processors can be used, with the final output buffers of the processors being sent to a collecting processor P0 as input buffers to the collecting processor P0.

    摘要翻译: 一种空间有效的系统和方法,用于在数据集中的单次传递中生成数据集的近似分位数据元素,而无需对数据集的大小的先验知识。 大致的分位数被保证位于用至少1-delta的概率寻求的真实分位数的用户指定的近似误差εi中,其中Δ是用户定义的故障概率。 每个具有k个元素的容量的B缓冲器最初由数据集中的元素填充,其中b和k的值取决于近似误差e和概率delta。 缓冲区然后被折叠成输出缓冲区,剩余的缓冲区然后被元素重新填充(与先前的输出缓冲区一起),等等,直到整个数据集被处理并且保持单个输出。 然后输出对应于近似分位数的输出元素作为近似分位数。 在后面的迭代中(当树的高度至少等于取决于delta和epsi的预定高度时),数据被不均匀地采样以填充缓冲器以呈现期望的性能。 可以使用并行处理器,处理器的最终输出缓冲器被发送到收集处理器P0作为到采集处理器P0的输入缓冲器。