Abstract:
Methods, systems, and apparatuses are provided for creating and using a machine-leaning model to call a base at a position of a nucleic acid based on intensity values measured during a production sequencing run. The model can be trained using training data from training sequencing runs performed earlier. The model is trained using intensity values and assumed sequences that are determined as the correct output. The training data can be filtered to improve accuracy. The training data can be selected in a specific manner to be representative of the type of organism to be sequenced. The model can be trained to use intensity signals from multiple cycles and from neighboring nucleic acids to improve accuracy in the base calls.
Abstract:
The present invention is directed to logic for analysis of nucleic acid sequence data that employs algorithms that lead to a substantial improvement in sequence accuracy and that can be used to phase sequence variations, e.g., in connection with the use of the long fragment read (LFR) process.
Abstract:
Systems, methods, and apparatuses are provided for determining a sequence of a heteropolymer molecule. For example, all or part of a chromosome or a protein can be determined using sequence data from a plurality of heteropolymer fragments corresponding to the heteropolymer molecule. As one example, a position in the sequence read of a DNA fragment can be identified where a single base call is not clear. A multiplet base call can then be used, where the multiplet base call includes two or more bases at the position, along with a score for each base. The scores can be carried through mapping and assembly procedures, where the scores can be used to determine a final base call for the position in a chromosome of a genome of an organism. Other examples can be used for other monomer units besides bases.
Abstract:
Long fragment read techniques can be used to identify deletions and resolve base calls by utilizing shared labels (e.g., shared aliquots) of a read with any reads corresponding to heterozygous loci (hets) of a haplotype. For example, the linking of a locus to a haplotype of multiple hets can increase the reads available at the locus for determining a base call for a particular haplotype. For a hemizygous deletion, a region can be linked to one or more hets, and the labels for a particular haplotype can be used to identify which reads in the region correspond to which haplotype. In this manner, since the reads for a particular haplotype can be identified, a hemizygous deletion can be determined. Further, a phasing rate of pulses can be used to identify large deletions. A deletion can be identified with the phasing rate is sufficiently low, and other criteria can be used.
Abstract:
Various short reads can be grouped and identified as coming from a same long DNA fragment (e.g., by using wells with a relatively low-concentration of DNA). A histogram of the genomic coverage of a group of short reads can provide the edges of the corresponding long fragment (pulse). The knowledge of these pulses can provide an ability to determine the haploid genome and to identify structural variations.
Abstract:
Techniques perform de novo assembly. The assembly can use labels that indicate origins of the nucleic acid molecules. For example, a representative set of labels identified from initial reads that overlap with a seed can be used. Mate pair information can be used. A sequence read that aligns to an end of a contig can lead to using the other sequence read of a mate pair, and the other sequence read can be used to determine which branch to use to extend, e.g., in an external cloud or helper contig. A kmer index can include labels indicating an origin of each of the nucleic acid molecules that include each kmer, memory addresses of the reads that correspond to each kmer in the index, and a position in each of the mate pairs that includes the kmer. Haploid seeds can also be determined using polymorphic loci identified in a population.
Abstract:
Long fragment read techniques can be used to identify deletions and resolve base calls by utilizing shared labels (e.g., shared aliquots) of a read with any reads corresponding to heterozygous loci (hets) of a haplotype. For example, the linking of a locus to a haplotype of multiple hets can increase the reads available at the locus for determining a base call for a particular haplotype. For a hemizygous deletion, a region can be linked to one or more hets, and the labels for a particular haplotype can be used to identify which reads in the region correspond to which haplotype. In this manner, since the reads for a particular haplotype can be identified, a hemizygous deletion can be determined. Further, a phasing rate of pulses can be used to identify large deletions. A deletion can be identified with the phasing rate is sufficiently low, and other criteria can be used.
Abstract:
Various short reads can be grouped and identified as coming from a same long DNA fragment (e.g., by using wells with a relatively low-concentration of DNA). A histogram of the genomic coverage of a group of short reads can provide the edges of the corresponding long fragment (pulse). The knowledge of these pulses can provide an ability to determine the haploid genome and to identify structural variations.
Abstract:
Techniques perform de novo assembly. The assembly can use labels that indicate origins of the nucleic acid molecules. For example, a representative set of labels identified from initial reads that overlap with a seed can be used. Mate pair information can be used. A sequence read that aligns to an end of a contig can lead to using the other sequence read of a mate pair, and the other sequence read can be used to determine which branch to use to extend, e.g., in an external cloud or helper contig. A kmer index can include labels indicating an origin of each of the nucleic acid molecules that include each kmer, memory addresses of the reads that correspond to each kmer in the index, and a position in each of the mate pairs that includes the kmer. Haploid seeds can also be determined using polymorphic loci identified in a population.
Abstract:
Methods, systems, and apparatuses are provided for creating and using a machine-leaning model to call a base at a position of a nucleic acid based on intensity values measured during a production sequencing run. The model can be trained using training data from training sequencing runs performed earlier. The model is trained using intensity values and assumed sequences that are determined as the correct output. The training data can be filtered to improve accuracy. The training data can be selected in a specific manner to be representative of the type of organism to be sequenced. The model can be trained to use intensity signals from multiple cycles and from neighboring nucleic acids to improve accuracy in the base calls.