BASECALLER FOR DNA SEQUENCING USING MACHINE LEARNING
    1.
    发明申请
    BASECALLER FOR DNA SEQUENCING USING MACHINE LEARNING 审中-公开
    使用机器学习的DNA序列的基础知识

    公开(公告)号:US20150169824A1

    公开(公告)日:2015-06-18

    申请号:US14571022

    申请日:2014-12-15

    IPC分类号: G06F19/24

    CPC分类号: G06F19/24 G06F19/22

    摘要: Methods, systems, and apparatuses are provided for creating and using a machine-leaning model to call a base at a position of a nucleic acid based on intensity values measured during a production sequencing run. The model can be trained using training data from training sequencing runs performed earlier. The model is trained using intensity values and assumed sequences that are determined as the correct output. The training data can be filtered to improve accuracy. The training data can be selected in a specific manner to be representative of the type of organism to be sequenced. The model can be trained to use intensity signals from multiple cycles and from neighboring nucleic acids to improve accuracy in the base calls.

    摘要翻译: 提供了方法,系统和装置,用于创建和使用机器倾斜模型,以基于在生产测序运行期间测量的强度值来调用核酸位置处的碱基。 可以使用训练数据训练模型,训练数据来自前面进行的训练排序运行。 使用强度值和假定序列来训练该模型,该序列被确定为正确的输出。 训练数据可以被过滤以提高准确度。 可以以特定方式选择训练数据以代表待测序的生物体的类型。 可以训练该模型以使用来自多个周期和相邻核酸的强度信号来提高基本呼叫的准确性。

    USING DOUBLET INFORMATION IN GENOME MAPPING AND ASSEMBLY
    3.
    发明申请
    USING DOUBLET INFORMATION IN GENOME MAPPING AND ASSEMBLY 审中-公开
    在基因组映射和汇编中使用双重信息

    公开(公告)号:US20150317433A1

    公开(公告)日:2015-11-05

    申请号:US14701248

    申请日:2015-04-30

    IPC分类号: G06F19/22

    CPC分类号: G16B30/00

    摘要: Systems, methods, and apparatuses are provided for determining a sequence of a heteropolymer molecule. For example, all or part of a chromosome or a protein can be determined using sequence data from a plurality of heteropolymer fragments corresponding to the heteropolymer molecule. As one example, a position in the sequence read of a DNA fragment can be identified where a single base call is not clear. A multiplet base call can then be used, where the multiplet base call includes two or more bases at the position, along with a score for each base. The scores can be carried through mapping and assembly procedures, where the scores can be used to determine a final base call for the position in a chromosome of a genome of an organism. Other examples can be used for other monomer units besides bases.

    摘要翻译: 提供了用于确定杂聚物分子序列的系统,方法和装置。 例如,染色体或蛋白质的全部或部分可以使用来自对应于杂聚物分子的多个杂聚物片段的序列数据来确定。 作为一个示例,可以在单个基本呼叫不清楚的地方识别DNA片段读取序列中的位置。 然后可以使用多重基数调用,其中多重基数调用在位置包括两个或更多个基数,以及每个基数的分数。 分数可以通过测绘和组装程序进行,其中分数可用于确定生物体基因组染色体中位置的最终基本调用。 其他实例也可用于除碱基之外的其它单体单元。

    PHASING AND LINKING PROCESSES TO IDENTIFY VARIATIONS IN A GENOME
    4.
    发明申请
    PHASING AND LINKING PROCESSES TO IDENTIFY VARIATIONS IN A GENOME 审中-公开
    相关和链接过程识别基因组中的变化

    公开(公告)号:US20150094961A1

    公开(公告)日:2015-04-02

    申请号:US14503872

    申请日:2014-10-01

    IPC分类号: G06F19/22

    CPC分类号: G16B30/00 G16B20/00

    摘要: Long fragment read techniques can be used to identify deletions and resolve base calls by utilizing shared labels (e.g., shared aliquots) of a read with any reads corresponding to heterozygous loci (hets) of a haplotype. For example, the linking of a locus to a haplotype of multiple hets can increase the reads available at the locus for determining a base call for a particular haplotype. For a hemizygous deletion, a region can be linked to one or more hets, and the labels for a particular haplotype can be used to identify which reads in the region correspond to which haplotype. In this manner, since the reads for a particular haplotype can be identified, a hemizygous deletion can be determined. Further, a phasing rate of pulses can be used to identify large deletions. A deletion can be identified with the phasing rate is sufficiently low, and other criteria can be used.

    摘要翻译: 可以使用长片段读取技术来识别缺失并通过利用与单倍型的杂合位点(hets)相对应的任何读取的共享标签(例如共享等分试样)来解析基本调用。 例如,将一个基因座与多个单倍型的单倍型的连接可以增加在该位点处可用的读数,以确定特定单体型的碱基调用。 对于半合子缺失,区域可以连接到一个或多个疱疹,并且特定单元型的标签可用于鉴定区域中哪个读取对应于哪个单倍型。 以这种方式,由于可以鉴定特定单体型的读数,所以可以确定半合子缺失。 此外,可以使用脉冲的相位速率来识别大的缺失。 可以通过相位速率来确定删除,并且可以使用其他标准。

    Identification of DNA fragments and structural variations
    5.
    发明授权
    Identification of DNA fragments and structural variations 有权
    鉴定DNA片段和结构变异

    公开(公告)号:US09514272B2

    公开(公告)日:2016-12-06

    申请号:US13649966

    申请日:2012-10-11

    IPC分类号: G01N33/50 G06F19/22

    CPC分类号: G06F19/22

    摘要: Various short reads can be grouped and identified as coming from a same long DNA fragment (e.g., by using wells with a relatively low-concentration of DNA). A histogram of the genomic coverage of a group of short reads can provide the edges of the corresponding long fragment (pulse). The knowledge of these pulses can provide an ability to determine the haploid genome and to identify structural variations.

    摘要翻译: 可以将各种短读数分组并鉴定为来自相同的长DNA片段(例如,通过使用具有相对低浓度DNA的孔)。 一组短读取的基因组覆盖的直方图可以提供对应的长片段(脉冲)的边缘。 这些脉冲的知识可以提供确定单倍体基因组并鉴定结构变化的能力。

    LONG FRAGMENT DE NOVO ASSEMBLY USING SHORT READS
    6.
    发明申请
    LONG FRAGMENT DE NOVO ASSEMBLY USING SHORT READS 审中-公开
    LONG FRAGMENT DE NOVO大会使用短篇小说阅读

    公开(公告)号:US20150057947A1

    公开(公告)日:2015-02-26

    申请号:US14467797

    申请日:2014-08-25

    IPC分类号: G06F19/20

    摘要: Techniques perform de novo assembly. The assembly can use labels that indicate origins of the nucleic acid molecules. For example, a representative set of labels identified from initial reads that overlap with a seed can be used. Mate pair information can be used. A sequence read that aligns to an end of a contig can lead to using the other sequence read of a mate pair, and the other sequence read can be used to determine which branch to use to extend, e.g., in an external cloud or helper contig. A kmer index can include labels indicating an origin of each of the nucleic acid molecules that include each kmer, memory addresses of the reads that correspond to each kmer in the index, and a position in each of the mate pairs that includes the kmer. Haploid seeds can also be determined using polymorphic loci identified in a population.

    摘要翻译: 技术执行从头装配。 该组件可以使用指示核酸分子起源的标记。 例如,可以使用从与种子重叠的初始读取中识别的代表性标签集合。 配对信息可以使用。 与对等体的末端对齐的序列读取可以导致使用配对对读取的其他序列,并且可以使用其他序列读取来确定用于扩展的分支,例如在外部云或辅助对象中 。 kmer指数可以包括指示每个核酸分子的来源的标签,其包括每个kmer,对应于索引中每个kmer的读取的存储器地址以及包括kmer的每个配对对中的位置。 单倍体种子也可以使用在群体中鉴定的多态性基因座来确定。

    Phasing and linking processes to identify variations in a genome

    公开(公告)号:US10468121B2

    公开(公告)日:2019-11-05

    申请号:US14503872

    申请日:2014-10-01

    IPC分类号: G16B30/00 G16B20/00 G06F19/10

    摘要: Long fragment read techniques can be used to identify deletions and resolve base calls by utilizing shared labels (e.g., shared aliquots) of a read with any reads corresponding to heterozygous loci (hets) of a haplotype. For example, the linking of a locus to a haplotype of multiple hets can increase the reads available at the locus for determining a base call for a particular haplotype. For a hemizygous deletion, a region can be linked to one or more hets, and the labels for a particular haplotype can be used to identify which reads in the region correspond to which haplotype. In this manner, since the reads for a particular haplotype can be identified, a hemizygous deletion can be determined. Further, a phasing rate of pulses can be used to identify large deletions. A deletion can be identified with the phasing rate is sufficiently low, and other criteria can be used.

    IDENTIFICATION OF DNA FRAGMENTS AND STRUCTURAL VARIATIONS
    8.
    发明申请
    IDENTIFICATION OF DNA FRAGMENTS AND STRUCTURAL VARIATIONS 有权
    鉴定DNA片段和结构变异

    公开(公告)号:US20130096841A1

    公开(公告)日:2013-04-18

    申请号:US13649966

    申请日:2012-10-11

    IPC分类号: G06F17/18

    CPC分类号: G06F19/22

    摘要: Various short reads can be grouped and identified as coming from a same long DNA fragment (e.g., by using wells with a relatively low-concentration of DNA). A histogram of the genomic coverage of a group of short reads can provide the edges of the corresponding long fragment (pulse). The knowledge of these pulses can provide an ability to determine the haploid genome and to identify structural variations.

    摘要翻译: 可以将各种短读数分组并鉴定为来自相同的长DNA片段(例如,通过使用具有相对低浓度的DNA的孔)。 一组短读取的基因组覆盖的直方图可以提供对应的长片段(脉冲)的边缘。 这些脉冲的知识可以提供确定单倍体基因组并鉴定结构变化的能力。

    Long fragment de novo assembly using short reads

    公开(公告)号:US10726942B2

    公开(公告)日:2020-07-28

    申请号:US14467797

    申请日:2014-08-25

    IPC分类号: G16B30/00 G16B30/20

    摘要: Techniques perform de novo assembly. The assembly can use labels that indicate origins of the nucleic acid molecules. For example, a representative set of labels identified from initial reads that overlap with a seed can be used. Mate pair information can be used. A sequence read that aligns to an end of a contig can lead to using the other sequence read of a mate pair, and the other sequence read can be used to determine which branch to use to extend, e.g., in an external cloud or helper contig. A kmer index can include labels indicating an origin of each of the nucleic acid molecules that include each kmer, memory addresses of the reads that correspond to each kmer in the index, and a position in each of the mate pairs that includes the kmer. Haploid seeds can also be determined using polymorphic loci identified in a population.

    Basecaller for DNA sequencing using machine learning

    公开(公告)号:US10068053B2

    公开(公告)日:2018-09-04

    申请号:US14571022

    申请日:2014-12-15

    摘要: Methods, systems, and apparatuses are provided for creating and using a machine-leaning model to call a base at a position of a nucleic acid based on intensity values measured during a production sequencing run. The model can be trained using training data from training sequencing runs performed earlier. The model is trained using intensity values and assumed sequences that are determined as the correct output. The training data can be filtered to improve accuracy. The training data can be selected in a specific manner to be representative of the type of organism to be sequenced. The model can be trained to use intensity signals from multiple cycles and from neighboring nucleic acids to improve accuracy in the base calls.