LOW MEMORY SAMPLING-BASED ESTIMATION OF DISTINCT ELEMENTS AND DEDUPLICATION

    公开(公告)号:US20190005099A1

    公开(公告)日:2019-01-03

    申请号:US16121696

    申请日:2018-09-05

    Abstract: Methods, computing systems and computer program products implement embodiments of the present invention that include partitioning a dataset into a full set of logical data units, and selecting a sample subset of the full set, the sample subset including a random sample of the full set based on a sampling ratio. A set of target hash values are selected from a full range of hash values, and, using a hash function, a respective unit hash value is calculated for each of the logical data units in the sample subset. A histogram is computed that indicates a duplication count of each of the unit hash values that matches a given target hash value, and based on the histogram, a number of distinct logical data units in the full set is estimated.

    ESTIMATION OF DATA REDUCTION RATE IN A DATA STORAGE SYSTEM
    2.
    发明申请
    ESTIMATION OF DATA REDUCTION RATE IN A DATA STORAGE SYSTEM 失效
    数据存储系统中数据降低速率的估计

    公开(公告)号:US20140052699A1

    公开(公告)日:2014-02-20

    申请号:US13589197

    申请日:2012-08-20

    Abstract: Systems and methods for estimating data reduction ratio for a data set is provided. The method comprises selecting a plurality of m elements from a data set comprising a plurality of N elements; associating an identifier hi for each of the plurality of m elements; associating an identifier he for each of the plurality of elements in the data set; tracking number of times an element i appears in a base set that includes the plurality of m elements selected from the data set; calculating a value counti that indicates the number of times an identifier he matches an identifier hi; and estimating data reduction ratio for the plurality of N elements in the data set, based on number of m number elements selected from the data set and the value counti.

    Abstract translation: 提供了一种用于估计数据集的数据缩减率的系统和方法。 该方法包括从包括多个N个元素的数据集中选择多个m个元素; 将多个m个元素中的每个元素的标识符hi相关联; 将数据集中的多个元素中的每一个的标识符he相关联; 跟踪元素i出现在基本集合中的次数,其包括从数据集中选择的多个m个元素; 计算表示与标识符hi匹配的标识符的次数的值counti; 并且基于从数据集中选择的m个数量的数量和值counti来估计数据集中的多个N个元素的数据缩减比率。

    Estimation of data reduction rate in a data storage system
    3.
    发明授权
    Estimation of data reduction rate in a data storage system 失效
    数据存储系统中数据缩减率的估计

    公开(公告)号:US08650163B1

    公开(公告)日:2014-02-11

    申请号:US13589197

    申请日:2012-08-20

    Abstract: Systems and methods for estimating data reduction ratio for a data set is provided. The method comprises selecting a plurality of m elements from a data set comprising a plurality of N elements; associating an identifier hi for each of the plurality of m elements; associating an identifier he for each of the plurality of elements in the data set; tracking number of times an element i appears in a base set that includes the plurality of m elements selected from the data set; calculating a value counti that indicates the number of times an identifier he matches an identifier hi; and estimating data reduction ratio for the plurality of N elements in the data set, based on number of m number elements selected from the data set and the value counti.

    Abstract translation: 提供了一种用于估计数据集的数据缩减率的系统和方法。 该方法包括从包括多个N个元素的数据集中选择多个m个元素; 将多个m个元素中的每个元素的标识符hi相关联; 将数据集中的多个元素中的每一个的标识符he相关联; 跟踪元素i出现在基本集合中的次数,其包括从数据集中选择的多个m个元素; 计算表示与标识符hi匹配的标识符的次数的值counti; 并且基于从数据集中选择的m个数量的数量和值counti来估计数据集中的多个N个元素的数据缩减比率。

Patent Agency Ranking