Compressing, storing and searching sequence data

    公开(公告)号:US10777304B2

    公开(公告)日:2020-09-15

    申请号:US15657359

    申请日:2017-07-24

    IPC分类号: G16B50/00 H03M7/30

    摘要: The redundancy in genomic sequence data is exploited by compressing sequence data in such a way as to allow direct computation on the compressed data using methods that are referred to herein as “compressive” algorithms. This approach reduces the task of computing on many similar genomes to only slightly more than that of operating on just one. In this approach, the redundancy among genomes is translated into computational acceleration by storing genomes in a compressed format that respects the structure of similarities and differences important to analysis. Specifically, these differences are the nucleotide substitutions, insertions, deletions, and rearrangements introduced by evolution. Once such a compressed library has been created, analysis is performed on it in time proportional to its compressed size, rather than having to reconstruct the full data set every time one wishes to query it.

    Compressing, storing and searching sequence data
    16.
    发明申请
    Compressing, storing and searching sequence data 审中-公开
    压缩,存储和搜索序列数据

    公开(公告)号:US20130191351A1

    公开(公告)日:2013-07-25

    申请号:US13722121

    申请日:2012-12-20

    IPC分类号: G06F19/28

    CPC分类号: G06F19/28 H03M7/3062

    摘要: The redundancy in genomic sequence data is exploited by compressing sequence data in such a way as to allow direct computation on the compressed data using methods that are referred to herein as “compressive” algorithms. This approach reduces the task of computing on many similar genomes to only slightly more than that of operating on just one. In this approach, the redundancy among genomes is translated into computational acceleration by storing genomes in a compressed format that respects the structure of similarities and differences important to analysis. Specifically, these differences are the nucleotide substitutions, insertions, deletions, and rearrangements introduced by evolution. Once such a compressed library has been created, analysis is performed on it in time proportional to its compressed size, rather than having to reconstruct the full data set every time one wishes to query it.

    摘要翻译: 通过压缩序列数据来利用基因组序列数据中的冗余,以便允许使用本文中称为“压缩”算法的方法对压缩数据进行直接计算。 这种方法减少了许多类似基因组的计算任务,只比仅在一个基因组上运行的任务多一些。 在这种方法中,基因组之间的冗余通过将基因组以相对于分析重要的相似性和差异结构的压缩格式存储而转化为计算加速度。 具体来说,这些差异是由进化引入的核苷酸替代,插入,缺失和重排。 一旦创建了这样一个压缩库,就可以及时对其进行分析,与其压缩大小成比例,而不是每次希望查询完整的数据集时,重新构建完整的数据集。