Processing Genome Data using Scalable Database Technology My

Werbung
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
Processing Genome Data
using Scalable Database Technology
Johann Christoph Freytag, Ph.D.
[email protected]
http://www.dbis.informatik.hu-berlin.de
Stanford University, February 2004
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
Processing Genome Data
using Scalable Database Technology
My Background
Professor of DBIS
@HUB
PhD @ Harvard Univ.
ERCR (European
Computer Industry
Research Centre),
München (87-89)
Visiting Scientist,
Microsoft Res. (2002)
DEC’s Database
Technology Center,
München (90-93)
• Starburst project, IBM Almaden Research Center (85-87)
• Visiting Scientist, Almaden Research Center (97/98)
• Visiting Scientist, IBM SVL (2001)
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
Processing Genome Data
using Scalable Database Technology
1
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
“What’s the Meaning of Life”
DNA
RNA
Transcription
Protein
Translation
?
Genomic
„Transmitter“
?
Messenger
Gene product
Replication
!
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
?
Processing Genome Data
using Scalable Database Technology
Overview
• (Biological) Motivation/Problems
• Using Database Technology
–
–
–
–
Gene-EYe Integration-Platform
Data Cleansing
BLAST-Integration into GDB
In-and-Out-the-Database: Using Workflow for
“dry” Experiments
• Summary
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
Processing Genome Data
using Scalable Database Technology
2
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
View of Biological Areas
Environment
Diseases
Experiments
Pathways
Life
Evolution
DNA
Genome
RNA
Transcriptome
Amino Acids
Proteome
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
Processing Genome Data
using Scalable Database Technology
Biological
Motivation
View of Data Data Source
Environment
Pathways
Brenda
KEGG
DNA
EMBL
Genome
RefSeq
LocusLink
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
Diseases
OMIM
Life
Gene Ontology
RNA
Transcriptome
EMBL
(EST)
Processing Genome Data
using Scalable Database Technology
Experiments
Express
Evolution
Taxonomy
Amino Acids
SWISS-PROT
Proteome
Interpro
Biological
Motivation
3
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
Complex Relationships
A graph depicting the relationships
between 400+ biological data
sources served by the EBI via SRS
Database Growth of
EMBL (# of records)
More than 400 Data Sources
on the WEB
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
Source: http://www3.ebi.ac.uk/Services/DBStats/
Processing Genome Data
using Scalable Database Technology
DBIS (our) Approach
Database
SwissProt
Model of the
Biological World
EMBL
ESEMBLE
...
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
KABAT
Processing Genome Data
using Scalable Database Technology
4
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
• (Biological) Motivation/Problems
• Using Scalable Database Technology
–
–
–
–
Gene-EYe Integration-Platform
Data Cleansing
BLAST-Integration into GDB
In-and-Out-the-Database: Using Workflow
Concepts for “dry” Experiments
• Summary
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
Processing Genome Data
using Scalable Database Technology
Overview
Gene-EYe Integration-Platform
Vision
• Provide mechanisms for
–
–
–
–
unified handling of different data sources
data source integration
change management
user defined data preparation
• Provide
– relevant tools for sequence manipulation and
retrieval
– work flow support for operation and administration
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
Processing Genome Data
using Scalable Database Technology
Gen-Eye Vision
5
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
Gene-EYe Integration-Platform
The Big Picture
KNOWLEDGE
Genome Data Warehouse Layer (GDW Schema)
Biological Entities -> Biological Concepts (e.g. Life Cycle)
CONTENT
Genome DataBase Layer (GDB Schema)
Relational Entities -> Biological Entities (e.g. Gene)
DATA
Genome Data Store Layer (GDS Schema)
Flat File Data -> Relational Entities (e.g. EMBL)
Processing Genome Data
using Scalable Database Technology
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
Design
GDS: From Flat File to Database
Data Storage
Data Cleansing
Update/Admin
Genome Data Store Layer (GDS Schema)
GDS Load Tools
ENSEMBL DDL
InterPro DDL
TAXO DDL
SWALL DDL
EMBL DDL
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
ENSEMBL scanner
InterPro scanner
TAXO scanner
SWALL scanner
EMBL scanner
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
GDS Admin Tools
Processing Genome Data
using Scalable Database Technology
6
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
The Data Import Pipeline - Revisited
Data
File
Load
Files
Scanner
Summary
Loader
CLOB
Files
Instance
Spec.
Load Spec.
Gene-EYe
GDS
Format
Spec.
DDL-Gen.
DDL Script
Controller
Content
Spec.
Phase 1: Property Files
Phase 2: GEM1 Repository
Perl scripts
de.hui.dbis.geneeye.* (Java)
Hand crafted
Autogenerate from Metadata
1: CWM compliant GeneEYe Metadata Repository
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
Processing Genome Data
using Scalable Database Technology
Modeling the Maintenance Process
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
Processing Genome Data
using Scalable Database Technology
7
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
GDB-Layer: From Data to Biology
Data Integration
Data Cleansing (Sem.)
Queries
Genome Database Layer (GDB Schema)
Defined by and
in cooperation w/
domain experts
GDB Mapper (IBM Clio)
[Definition]
Data Storage
Data Cleansing (Syn.)
Update/Admin
Genome Data Store Layer (GDS Schema)
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
Variant
Tissue
ENSEMBL
InterPro
TAXO
SWALL
EMBL
[Data]
Gene
GDB Builder (IBM Clio?)
Transcript
Schema
Protein
Data
Processing Genome Data
using Scalable Database Technology
Schema Mapping with Clio
with permission of Dr. Felix Naumann
IBM Almaden Research Center
Clio
Source
Schema
DB
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
User mapping
Clio
Target
Schema
SQL or XQuery
DB
Processing Genome Data
using Scalable Database Technology
8
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
Clio Features
with permission of Dr. Felix Naumann
IBM Almaden Research Center
• Schema Viewer
– Visual mapping between schema elements
• Attribute Matcher
– Intelligent suggestions of likely mappings
• Data Viewer
– Data examples for mapping queries
• Queries
– SQL, XSLT, Xquery
– Use and adhere to source and target schema
constraints
Processing Genome Data
using Scalable Database Technology
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
GDW: Providing Facts for Research
Data Mining
Ontology Mapping
Process Simulation
Genome Data Warehouse Layer (GDW Schema)
Ontology
GDW Miner
GDB Explorer
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
Variant
Tissue
Transcript
Protein
Gene
Variant
Tissue
Transcript
Protein
Gene
Genome Database Layer (GDB Schema)
Data Integration
Data Cleansing (Sem.)
Queries
Processing Genome Data
using Scalable Database Technology
9
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
• (Biological) Motivation/Problems
• Using Scalable Database Technology
–
–
–
–
Gene-EYe Integration-Platform
Data Cleansing
BLAST-Integration into GDB
In-and-Out-the-Database: Using Workflow
Concepts for “dry” Experiments
• Summary
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
Processing Genome Data
using Scalable Database Technology
Overview
Errors in Genome Data
DNA, RNA Sequence
DNA Sequence
Determination
ƒ Classes of errors in
data production
0,23%
agagattagcgcgctagatcgatatgataga
gctatatcatccgagatagcagatagctcta
–
genome
gcacactattacacgagcagcgaccttatat
2,58%
ƒGenome
Experimental errors
DNA
Feature
ƒ Analysis
Annotation
errors
Gene
mRNA
ƒ Transformation errors
Structure Annotation
5% - 30%
Protein Sequence
ƒProtein
Propagated errors MDDREDLVYQAKLAEQAERYDEMVESMKKVD
Sequence
ƒ Stale data
Determination
AGMDVELTVEERNLLSVAYKNVIGARRASWY
RIISSIEQKEENKGGEDKLKMIREYRQMVER
[Müller, Naumann, Freytag, ICIQ, 2003]
only 1.3 % difference
Protein
FUNCTION
Function
SUBUNIT
The main difference are transcript
Annotation
DISEASE
copy numbers in the brain
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
Function Annotation
MAY ACT AS INTRACELLULAR
SIGNALING COMPONENT ...
5% - 40%
BINDS DIRECTLY TO ZO-1
INVOLVED IN ACUTE LEUKEMIAS
Processing Genome Data
using Scalable Database Technology
10
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
Reliability-based Merging (cont.)
• Domain expert identifies reliable parts for merging
• Definition of a set of views for integration
r1
r2
ID
1
2
3
4
5
A1
A
A
B
B
C
A2
1.2
2.3
3.1
3.4
5.6
A3
1900
1900
1900
2000
2002
ID
1
2
3
4
5
A1
B
B
B
B
D
A2
1
2
3.1
3.4
5.6
2000
2000
2000
2000
2002
Current work:
Which are the relevant
r
ID
mismatch patterns ? 1
1
2
How to Merge
assess their 34
A
relevance & importance?
5
3
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
A1
A
A
B
B
C
A2
1.2
2.3
3.1
3.4
5.6
A3
2000
2000
2000
2000
2002
e.g. MIN()
Processing Genome Data
using Scalable Database Technology
• (Biological) Motivation/Problems
• Using Scalable Database Technology
– Gene-EYe Integration-Platform
– BLAST-Integration into GDB
– In-and-Out-the-Database: Using Workflow
Concepts for “dry” Experiments
• Summary
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
Processing Genome Data
using Scalable Database Technology
Overview
11
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
BLAST: General Introduction
DC-File
•
•
•
Algorithm/Package: Similarity Search
Devloped by Altschul et al. (1990)
Three Steps:
1.
2.
3.
Search for Word Pairs (Iseq, DSeq) of
Length L on the Data Collection of
Sequences above Threshhold T
Expansion of each Word Pair until the
Value V of their Alignment is ∆ away
from the local maximum
Output of complete alignment (Highscoring Segment Pair, HSP), if
Value(Alignment) > S
Preprocessing
FormatDB
Index
Sequences
Features
Query sequence
BLAST
BLAST Call
Report
Output: Powerset of Alignments
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
Processing Genome Data
using Scalable Database Technology
BLAST UDF Implementation
• Goal: Using BLAST in SQL-statements
• How?
– BLAST-UDF implemented as Table Function
– Use in SQL Query
SELECT *
FROM TABLE( BLAST(<Parameter>, <Query Sequence>,
<Comparison Sequence> ))
• Each call returns a set of alignments
over Sequences in the Database
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
Processing Genome Data
using Scalable Database Technology
12
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
Structure of UDB Table Function
UDF BLAST
• Implementation:
– Mapping of program into
calling structure for table
functions
– Communiaction between the
different calls via scratchpad
– scratchpad: Storage area
which remains intact and
unchanged between UDF calls
• Storage of data structures for
different steps
• especially for output from
postprocessing: SeqAlign
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
Initialize
FIRST
For all SEQUENCES
Alignment without gaps
OPEN
Postprocessing
For all ALIGNMENTS
Output of the
results
FETCH
Release data structures
related with sequences
CLOSE
Release global data structures
FINAL
Processing Genome Data
using Scalable Database Technology
• (Biological) Motivation/Problems
• Using Scalable Database Technology
– Gene-EYe Integration-Platform
– BLAST-Integration into GDB
– In-and-Out-the-Database: Using Workflow
Concepts for “dry” Experiments
• Summary
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
Processing Genome Data
using Scalable Database Technology
Overview
13
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
The Challenge: Exon Skipping
Gene
Protein
One Gen with 100 Exons ⇒2100 ~ 1030 Variations
n Exons within one Gene
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
linearly combined
(splicing)
Used as Pattern
for Protein Generation
Processing Genome Data
using Scalable Database Technology
Challenge: Exon Skipping
Do alternative fusion points new
funktional (i.e. biologigacl meaningful) „patterns“?
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
Processing Genome Data
using Scalable Database Technology
14
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
Functional Genomics: Gain of New Insight
First Horizon: Simple Exon Skipping
110110000111111
New Functionality!
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
Processing Genome Data
using Scalable Database Technology
Flow of Processing Steps
Generate
Exon
Sequence
Local Database
(automatic)
Find
Similarity
Search
Remote Tool
(Web Based)
Supported by
local DB
Check
for Biological
Validity
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
Processing Genome Data
using Scalable Database Technology
15
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
Implementation
¾ Some facts…
facts…
¾ 60 days 100% load
¾ One splice form per minute
¾ So far: ca. 90.000 splice forms
¾ First biolog. meaningful results
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
Processing Genome Data
using Scalable Database Technology
Cooperations
• Cooperation with
– Univ. of Jena (Rolf Backofen)
– Berlin Center of Bioinformatics (BCB)
• Charite, FU, Max-Planck-Institut (M.
Vingron)
– Industry: IBM, small companies, …
Patrick Chappatte, Switzerland
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
Processing Genome Data
using Scalable Database Technology
16
Prof. Johann-Christoph Freytag
DBIS, Humboldt-Universität zu Berlin
Database Environment
IBM p-Server
(sponsored by IBM)
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
2.3 TByte
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
DB2
Processing Genome Data
using Scalable Database Technology
Summary
• Future Work
• Lesson learnt
– Query processing
– Highly Dynamic
Environment
• Include domain knowledge
• Data: changes frequently
• User: changes frequently
– Provide a framework for
•
•
•
•
•
Date integration
Data processing
Data changes
Data dependencies…..
Meta data management
© 2004 DBIS, Humboldt Universität zu Berlin
Prof. Johann Christoph Freytag
© Prof. J.C. Freytag, Verwendung
nur zum nicht-kommerziellen Gebrauch
– Data cleansing
– Set of UDFs for biological
data processing
– Visualization of Data
Processing Genome Data
using Scalable Database Technology
Summary
17
Herunterladen