Abstract:
A method of identifying co-expressed coding and noncoding genes is disclosed. The method may include receiving genetic sequences, mapping the genetic sequences to known coding and noncoding genes, correlating the mapped genes, and generating a co- expression network. A system for generating a co-expression network and providing the co-expression network to a user on a display is disclosed. The system may include a memory, one or more processors, one or more databases, and a display.
Abstract:
Data-driven generalized regression-based frameworks that support the transformation of measurements, applicable but not limited to gene expressions, from one platform to another over a wide dynamic range, with selected summary statistics / feature values as predictors for the model parameters. The framework consists of primary model training and transformation, and additional levels of categorical regression and transformation processes.
Abstract:
Some embodiments are directed to a data structure. The data structure includes multiple blocks and part of a first hash tree. The hash tree is computed from multiple hash values of the multiple genomic blocks. The part of the first hash tree includes at least the first two highest levels of the first hash tree but excluding one or more lower levels of the first hash tree.
Abstract:
A method (100) for compressing and decompressing a data file, comprising: (i) receiving (120) a data file for compression comprising a plurality of different attributes; (ii) identifying (130) a first attribute of the plurality of different attributes; (iii) selecting (140) a plurality of compression types and/or configurations; (iv) compressing (150) at least some of the data from the received data file for the identified first attribute using each of the selected plurality of compression types and/or configurations; (v) determining (160) which one of the selected plurality of compression types and/or configurations is most suitable for compression; (vi) generating (170) a compression parameter data structure comprising an identification of the selected plurality of compression types and/or configurations; (vii) compressing (180) the data from the received data file for the first attribute to generate a compressed data file; and (viii) storing (190) the compression parameter data structure and the compressed data file.
Abstract:
A system for characterizing intercellular communication and heterogeneity in cancer tumors, and more particularly a method for detecting sub-populations and receptor-ligand states for providing predictive information in relation to cancer and cancer treatment is disclosed. The system comprises the steps of obtaining from a NGS sequencer, single- cell RNA-seq for a plurality of cells within a tumor, correlation with a plurality of data sets from a curated gene list of receptor-ligand pairs, normalizing their transcript abundance data, assigning states (e.g. 0,1,2,3) to each curated receptor-ligand pair in each cell (e.g. depending on {L:R} = {0:0, 0:1, 1:0, 1:1}), thereby forming a matrix of receptor-ligand states, extracting sub-groups from the matrix that are not invariant and applying unsupervised clustering methods to identifying sub-clusters, identifying sub-populations within the set based on pair-wise distances between individual cells and similarity of cellular transcriptomes, identifying expressed ligands and receptors across the sub- populations, cross-referencing against the curated set of receptor-ligand pairs and providing a visually display the results by a mapping module for the clinician. The method can be used to study intercellular communication to elicit the etiology of diseases, and can be used to measure the disruption of intercellular communication to diagnose similarly disrupted disease patterns across patients.
Abstract:
A method for storing, by a processor, a genome graph representing a plurality of individual genomes, including: storing a linear representation of a reference genome in a data storage; receiving a first genome; identifying variations in the first genome from the reference genome; generating graph edges for each variation in the first genome from the reference genome; generating for each generated graph edge: an edge identifier that uniquely identifies the current edge in the genome graph; a start edge identifier that identifies the edge from which the current edge branches out; a start position that indicates the position on the start edge that serves as an anchoring point for the current edge; an end edge identifier that identifies the edge into which the current edge joins in; an end position that indicates the position on the end edge that serves as an anchoring point for the current edge; and a sequence indicating the nucleotide sequence of the current edge; and storing the edge identifier, start edge identifier, start position, end edge identifier, end edge position, and sequence for each generated graph edge in the data storage. Based on this genome graph data structure, we further propose a scheme for specifying a path, which may traverse one or more edges, and the ways to extend existing genomic data formats such as SAM, VCF and MPEG-G to support the use of genome graph reference using our proposed coordinate system.
Abstract:
A method (200) for evaluating nucleic acid sequencing data using a quality control analysis system (300), comprising: receiving (210) a plurality of reads of a nucleic acid sequence; extracting (220) a plurality of k-mers from the plurality of reads; identifying (230), using the plurality of extracted k-mers, one or more of a plurality of annotated k-mers found in the plurality of reads, wherein the plurality of extracted k-mers are stored in an annotation database (350), and further wherein the annotated k-mers are annotated with annotation information about the one or more nucleic acid sequences from which the annotated k-mers are generated; gathering (240), based on the identified annotated k-mers found in the plurality of reads, annotation information about the plurality of reads; and determining (250), based on the gathered annotation information, a quality control metric for at least some of the plurality of reads.
Abstract:
In patient cohort identification, clustering (30) of patients is performed using a patient comparison metric dependent on a set of features (24). Information is displayed on sample patients who are similar or dissimilar to a query patient according to the clustering. User inputted comparison values are received comparing the sample patients with the query patient. The set of features and/or feature weights are adjusted to generate an adjusted patient comparison metric having improved agreement with the user inputted comparison values. The clustering is repeated using the adjusted patient comparison metric. A patient cohort is identified from a cluster (34) containing the query patient produced by the last clustering repetition. The information on the sample patients may be shown by simultaneously displaying two or more graphical modality representations (70, 72, 74) each plotting the sample patients and the query patient against two or more features of the modality.