Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin Processing Genome Data using Scalable Database Technology Johann Christoph Freytag, Ph.D. [email protected] http://www.dbis.informatik.hu-berlin.de Stanford University, February 2004 © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag Processing Genome Data using Scalable Database Technology My Background Professor of DBIS @HUB PhD @ Harvard Univ. ERCR (European Computer Industry Research Centre), München (87-89) Visiting Scientist, Microsoft Res. (2002) DEC’s Database Technology Center, München (90-93) • Starburst project, IBM Almaden Research Center (85-87) • Visiting Scientist, Almaden Research Center (97/98) • Visiting Scientist, IBM SVL (2001) © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch Processing Genome Data using Scalable Database Technology 1 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin “What’s the Meaning of Life” DNA RNA Transcription Protein Translation ? Genomic „Transmitter“ ? Messenger Gene product Replication ! © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag ? Processing Genome Data using Scalable Database Technology Overview • (Biological) Motivation/Problems • Using Database Technology – – – – Gene-EYe Integration-Platform Data Cleansing BLAST-Integration into GDB In-and-Out-the-Database: Using Workflow for “dry” Experiments • Summary © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch Processing Genome Data using Scalable Database Technology 2 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin View of Biological Areas Environment Diseases Experiments Pathways Life Evolution DNA Genome RNA Transcriptome Amino Acids Proteome © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag Processing Genome Data using Scalable Database Technology Biological Motivation View of Data Data Source Environment Pathways Brenda KEGG DNA EMBL Genome RefSeq LocusLink © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch Diseases OMIM Life Gene Ontology RNA Transcriptome EMBL (EST) Processing Genome Data using Scalable Database Technology Experiments Express Evolution Taxonomy Amino Acids SWISS-PROT Proteome Interpro Biological Motivation 3 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin Complex Relationships A graph depicting the relationships between 400+ biological data sources served by the EBI via SRS Database Growth of EMBL (# of records) More than 400 Data Sources on the WEB © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag Source: http://www3.ebi.ac.uk/Services/DBStats/ Processing Genome Data using Scalable Database Technology DBIS (our) Approach Database SwissProt Model of the Biological World EMBL ESEMBLE ... © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch KABAT Processing Genome Data using Scalable Database Technology 4 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin • (Biological) Motivation/Problems • Using Scalable Database Technology – – – – Gene-EYe Integration-Platform Data Cleansing BLAST-Integration into GDB In-and-Out-the-Database: Using Workflow Concepts for “dry” Experiments • Summary © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag Processing Genome Data using Scalable Database Technology Overview Gene-EYe Integration-Platform Vision • Provide mechanisms for – – – – unified handling of different data sources data source integration change management user defined data preparation • Provide – relevant tools for sequence manipulation and retrieval – work flow support for operation and administration © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch Processing Genome Data using Scalable Database Technology Gen-Eye Vision 5 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin Gene-EYe Integration-Platform The Big Picture KNOWLEDGE Genome Data Warehouse Layer (GDW Schema) Biological Entities -> Biological Concepts (e.g. Life Cycle) CONTENT Genome DataBase Layer (GDB Schema) Relational Entities -> Biological Entities (e.g. Gene) DATA Genome Data Store Layer (GDS Schema) Flat File Data -> Relational Entities (e.g. EMBL) Processing Genome Data using Scalable Database Technology © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag Design GDS: From Flat File to Database Data Storage Data Cleansing Update/Admin Genome Data Store Layer (GDS Schema) GDS Load Tools ENSEMBL DDL InterPro DDL TAXO DDL SWALL DDL EMBL DDL © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch ENSEMBL scanner InterPro scanner TAXO scanner SWALL scanner EMBL scanner © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag GDS Admin Tools Processing Genome Data using Scalable Database Technology 6 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin The Data Import Pipeline - Revisited Data File Load Files Scanner Summary Loader CLOB Files Instance Spec. Load Spec. Gene-EYe GDS Format Spec. DDL-Gen. DDL Script Controller Content Spec. Phase 1: Property Files Phase 2: GEM1 Repository Perl scripts de.hui.dbis.geneeye.* (Java) Hand crafted Autogenerate from Metadata 1: CWM compliant GeneEYe Metadata Repository © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag Processing Genome Data using Scalable Database Technology Modeling the Maintenance Process © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch Processing Genome Data using Scalable Database Technology 7 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin GDB-Layer: From Data to Biology Data Integration Data Cleansing (Sem.) Queries Genome Database Layer (GDB Schema) Defined by and in cooperation w/ domain experts GDB Mapper (IBM Clio) [Definition] Data Storage Data Cleansing (Syn.) Update/Admin Genome Data Store Layer (GDS Schema) © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag Variant Tissue ENSEMBL InterPro TAXO SWALL EMBL [Data] Gene GDB Builder (IBM Clio?) Transcript Schema Protein Data Processing Genome Data using Scalable Database Technology Schema Mapping with Clio with permission of Dr. Felix Naumann IBM Almaden Research Center Clio Source Schema DB © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch User mapping Clio Target Schema SQL or XQuery DB Processing Genome Data using Scalable Database Technology 8 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin Clio Features with permission of Dr. Felix Naumann IBM Almaden Research Center • Schema Viewer – Visual mapping between schema elements • Attribute Matcher – Intelligent suggestions of likely mappings • Data Viewer – Data examples for mapping queries • Queries – SQL, XSLT, Xquery – Use and adhere to source and target schema constraints Processing Genome Data using Scalable Database Technology © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag GDW: Providing Facts for Research Data Mining Ontology Mapping Process Simulation Genome Data Warehouse Layer (GDW Schema) Ontology GDW Miner GDB Explorer © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch Variant Tissue Transcript Protein Gene Variant Tissue Transcript Protein Gene Genome Database Layer (GDB Schema) Data Integration Data Cleansing (Sem.) Queries Processing Genome Data using Scalable Database Technology 9 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin • (Biological) Motivation/Problems • Using Scalable Database Technology – – – – Gene-EYe Integration-Platform Data Cleansing BLAST-Integration into GDB In-and-Out-the-Database: Using Workflow Concepts for “dry” Experiments • Summary © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag Processing Genome Data using Scalable Database Technology Overview Errors in Genome Data DNA, RNA Sequence DNA Sequence Determination Classes of errors in data production 0,23% agagattagcgcgctagatcgatatgataga gctatatcatccgagatagcagatagctcta – genome gcacactattacacgagcagcgaccttatat 2,58% Genome Experimental errors DNA Feature Analysis Annotation errors Gene mRNA Transformation errors Structure Annotation 5% - 30% Protein Sequence Protein Propagated errors MDDREDLVYQAKLAEQAERYDEMVESMKKVD Sequence Stale data Determination AGMDVELTVEERNLLSVAYKNVIGARRASWY RIISSIEQKEENKGGEDKLKMIREYRQMVER [Müller, Naumann, Freytag, ICIQ, 2003] only 1.3 % difference Protein FUNCTION Function SUBUNIT The main difference are transcript Annotation DISEASE copy numbers in the brain © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch Function Annotation MAY ACT AS INTRACELLULAR SIGNALING COMPONENT ... 5% - 40% BINDS DIRECTLY TO ZO-1 INVOLVED IN ACUTE LEUKEMIAS Processing Genome Data using Scalable Database Technology 10 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin Reliability-based Merging (cont.) • Domain expert identifies reliable parts for merging • Definition of a set of views for integration r1 r2 ID 1 2 3 4 5 A1 A A B B C A2 1.2 2.3 3.1 3.4 5.6 A3 1900 1900 1900 2000 2002 ID 1 2 3 4 5 A1 B B B B D A2 1 2 3.1 3.4 5.6 2000 2000 2000 2000 2002 Current work: Which are the relevant r ID mismatch patterns ? 1 1 2 How to Merge assess their 34 A relevance & importance? 5 3 © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag A1 A A B B C A2 1.2 2.3 3.1 3.4 5.6 A3 2000 2000 2000 2000 2002 e.g. MIN() Processing Genome Data using Scalable Database Technology • (Biological) Motivation/Problems • Using Scalable Database Technology – Gene-EYe Integration-Platform – BLAST-Integration into GDB – In-and-Out-the-Database: Using Workflow Concepts for “dry” Experiments • Summary © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch Processing Genome Data using Scalable Database Technology Overview 11 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin BLAST: General Introduction DC-File • • • Algorithm/Package: Similarity Search Devloped by Altschul et al. (1990) Three Steps: 1. 2. 3. Search for Word Pairs (Iseq, DSeq) of Length L on the Data Collection of Sequences above Threshhold T Expansion of each Word Pair until the Value V of their Alignment is ∆ away from the local maximum Output of complete alignment (Highscoring Segment Pair, HSP), if Value(Alignment) > S Preprocessing FormatDB Index Sequences Features Query sequence BLAST BLAST Call Report Output: Powerset of Alignments © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag Processing Genome Data using Scalable Database Technology BLAST UDF Implementation • Goal: Using BLAST in SQL-statements • How? – BLAST-UDF implemented as Table Function – Use in SQL Query SELECT * FROM TABLE( BLAST(<Parameter>, <Query Sequence>, <Comparison Sequence> )) • Each call returns a set of alignments over Sequences in the Database © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch Processing Genome Data using Scalable Database Technology 12 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin Structure of UDB Table Function UDF BLAST • Implementation: – Mapping of program into calling structure for table functions – Communiaction between the different calls via scratchpad – scratchpad: Storage area which remains intact and unchanged between UDF calls • Storage of data structures for different steps • especially for output from postprocessing: SeqAlign © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag Initialize FIRST For all SEQUENCES Alignment without gaps OPEN Postprocessing For all ALIGNMENTS Output of the results FETCH Release data structures related with sequences CLOSE Release global data structures FINAL Processing Genome Data using Scalable Database Technology • (Biological) Motivation/Problems • Using Scalable Database Technology – Gene-EYe Integration-Platform – BLAST-Integration into GDB – In-and-Out-the-Database: Using Workflow Concepts for “dry” Experiments • Summary © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch Processing Genome Data using Scalable Database Technology Overview 13 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin The Challenge: Exon Skipping Gene Protein One Gen with 100 Exons ⇒2100 ~ 1030 Variations n Exons within one Gene © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag linearly combined (splicing) Used as Pattern for Protein Generation Processing Genome Data using Scalable Database Technology Challenge: Exon Skipping Do alternative fusion points new funktional (i.e. biologigacl meaningful) „patterns“? © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch Processing Genome Data using Scalable Database Technology 14 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin Functional Genomics: Gain of New Insight First Horizon: Simple Exon Skipping 110110000111111 New Functionality! © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag Processing Genome Data using Scalable Database Technology Flow of Processing Steps Generate Exon Sequence Local Database (automatic) Find Similarity Search Remote Tool (Web Based) Supported by local DB Check for Biological Validity © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch Processing Genome Data using Scalable Database Technology 15 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin Implementation ¾ Some facts… facts… ¾ 60 days 100% load ¾ One splice form per minute ¾ So far: ca. 90.000 splice forms ¾ First biolog. meaningful results © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag Processing Genome Data using Scalable Database Technology Cooperations • Cooperation with – Univ. of Jena (Rolf Backofen) – Berlin Center of Bioinformatics (BCB) • Charite, FU, Max-Planck-Institut (M. Vingron) – Industry: IBM, small companies, … Patrick Chappatte, Switzerland © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch Processing Genome Data using Scalable Database Technology 16 Prof. Johann-Christoph Freytag DBIS, Humboldt-Universität zu Berlin Database Environment IBM p-Server (sponsored by IBM) CPU CPU CPU CPU CPU CPU CPU CPU 2.3 TByte © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag DB2 Processing Genome Data using Scalable Database Technology Summary • Future Work • Lesson learnt – Query processing – Highly Dynamic Environment • Include domain knowledge • Data: changes frequently • User: changes frequently – Provide a framework for • • • • • Date integration Data processing Data changes Data dependencies….. Meta data management © 2004 DBIS, Humboldt Universität zu Berlin Prof. Johann Christoph Freytag © Prof. J.C. Freytag, Verwendung nur zum nicht-kommerziellen Gebrauch – Data cleansing – Set of UDFs for biological data processing – Visualization of Data Processing Genome Data using Scalable Database Technology Summary 17