Gappy Total ReCaller: Efficient algorithms and data structures for accurate transcriptomics

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Understanding complex mammalian biology depends crucially on our ability to define a precise map of all the transcripts encoded in a genome, and to measure their relative abundances. A promising assay depends on RNASeq approaches, which builds on next generation sequencing pipelines capable of interrogating cDNAs extracted from a cell. The underlying pipeline starts with base-calling, collect the sequence reads and interpret the raw-read in terms of transcripts that are grouped with respect to different splice-variant isoforms of a messenger RNA. We address a very basic problem involved in all of these pipelines, namely accurate Bayesian base-calling, which could combine the analog intensity data with suitable underlying priors on base-composition in the transcripts. In the context of sequencing genomic DNA, a powerful approach for base-calling has been developed in the TotalReCaller pipeline. For these purposes, it uses a suitable reference whole-genome sequence in a compressed self-indexed format to derive its priors. However, TotalReCaller faces many new challenges in the transcriptomic domain, especially since we still lack a fully annotated library of all possible transcripts, and hence a sufficiently good prior. There are many possible solutions, similar to the ones developed for TotalReCaller, in applications addressing de novo sequencing and assembly, where partial contigs or string-graphs could be used to boot-strap the Bayesian priors on basecomposition. A similar approach would be applicable here too, partial assembly of transcripts can be used to characterize the splicing junctions or organize them in incompatibility graphs and then provided as priors for TotalReCaller. The key algorithmic techniques for this purpose have been addressed in a forthcoming paper on Stringomics. Here, we address a related but fundamental problem, by assuming that we only have a reference genome, with certain intervals marked as candidate regions for ORF (Open Reading Frames), but not necessarily complete annotations regarding the 5’ or 3’ termini of a gene or its exon-intron structure. The algorithms we describe find the most accurate base-calls of a cDNA with the best possible segmentation, all mapped to the genome appropriately.

Original languageEnglish (US)
Title of host publicationDistributed Computing and InternetTechnology - 11th International Conference, ICDCIT 2015, Proceedings
EditorsRaja Natarajan, Gautam Barua, Manas Ranjan Patra
PublisherSpringer Verlag
Pages150-161
Number of pages12
ISBN (Electronic)9783319149769
DOIs
StatePublished - 2015
Event11th International Conference on Distributed Computing and Internet Technology, ICDCIT 2015 - Bhubaneswar, India
Duration: Feb 5 2015Feb 8 2015

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8956
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other11th International Conference on Distributed Computing and Internet Technology, ICDCIT 2015
CountryIndia
CityBhubaneswar
Period2/5/152/8/15

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Gappy Total ReCaller: Efficient algorithms and data structures for accurate transcriptomics'. Together they form a unique fingerprint.

  • Cite this

    Mishra, B. (2015). Gappy Total ReCaller: Efficient algorithms and data structures for accurate transcriptomics. In R. Natarajan, G. Barua, & M. R. Patra (Eds.), Distributed Computing and InternetTechnology - 11th International Conference, ICDCIT 2015, Proceedings (pp. 150-161). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8956). Springer Verlag. https://doi.org/10.1007/978-3-319-14977-6_9