Praktikum Information Integration

Werbung
Information Integration
Praktikum
Ulf Leser
Pre-Requisites
• You need to …
–
–
–
–
–
Have finished your Vorstudium (or have a special permission)
Be somewhat experienced in Java
Behave well to contribute to your groups’ solutions
Be willing to invest considerable time
Be prepared to present and discuss implemented solutions
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Organization
• The Praktikum consists of 5-6 assignments
–
–
–
–
Each assignment means to implement a certain program etc.
There are 2-3 weeks for each assignment
Solutions must be send in by mail the day before the next assignment
Solution will be presented; usually 1-4 solutions per assignment
• We will build groups (today)
– Each group has 2-3 students
– A group fails are is passes in its entirety
– All exercises must be solved to pass the Praktikum
• Schedule
– A typical Praktikum consists of
• Presentation and discussion of solutions to previous assignment
• Presentation of new Assignment
– All classes with presentations / new assignments are mandatory
– Other classes are voluntarily – I’ll be hear to answer questions
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
The Web
• All material will be linked on the web after each class
• All data files will be downloadable from the Web site
• Student’s presentations will not be put online
– So don’t worry about your layout
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Your Presentations
• Each group must be prepared to present its solution for
every assignment (there are only 5)
– We select groups at random at each class
– The group may send any student to present
– Before a student presents twice, all other group members must
have presented at least once
• Presentations need a certain level of quality
–
–
–
–
Present key ideas, key insights, key problems
Critical code snipplets are good
Putting up 100 LOC per slide is not good
Draw conclusions
• What did you learn?
• What went wrong?
• What would you do otherwise the next time?
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Tentative Schedule
• Certain dates
– 21.10.2008: Group formation; Assignment 1
– 4.11.2008: Solutions to assignment 1; Assignment 2
• Tentative further dates
–
–
–
–
–
18.11.2008:
09.12.2008:
06.01.2009:
27.01.2009:
10.02.2009:
Solutions
Solutions
Solutions
Solutions
Solutions
to
to
to
to
to
assignment
assignment
assignment
assignment
assignment
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
2;
3;
4;
5;
6;
assignment
assignment
assignment
assignment
Wrap-up
3
4
5
6
Contest ???
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Content
• Each group will eventually have built an integrated
information system of human genes, chromosomes,
diseases, and gene function
• Based on a relational database
• Some sources are integrated physically
– Downloads, flat-files, XML
• Some sources are integrated virtually
– JAVA-RPC-API
– Web wrapping (screen scraping)
• Data quality is an issue
– Consistency of duplicate information
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
• Biological Background
– Some slides from Silke Trissl
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Zellen
• Ca. 75 Billionen Zellen im menschlichen Körper
• Ca. 250 verschiedene Typen: Nerven, Haut, Muskeln, ...
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
DesoxyriboNucleicAcid
• DNA: Desoxyribonukleinsäure
• Träger der vererbten Information – Genom
• Alles Leben verwendet DNA (RNA) aus den selben 4
(5) Molekülen
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Das menschliche Genom
• ... AGGCTGATGGATTAGAGACC ...
• 23 Chromosomenpaare
• ~ 3.000.000.000
Buchstaben
• ~ 50% bestehen
aus 4 „Parasiten“
•
•
•
•
~
~
~
~
100.000 Gene
56.000 Gene
30.000 Gene
24.000 Gene
• ~ 20.000 Gene
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Was ist ein Gen?
Chromosom DNA
A
C
G
T
T
G
A
T
G
A
C
C
A
G
A
G
C
T
T
G
T
RNA
A
C
G
T
T
G
A
C
A
G
A
G
C
T
T
G
T
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Protein
Von der DNA zum Protein
• Transkription
– Abschreiben
– DNA → m(essenger)RNA
– DNA ↔ RNA
• Doppel- ↔ einzelsträngig
• Thymin (T) ↔ Uracil (U)
• Translation
– Übersetzen
– mRNA → Protein
– RNA ↔ Protein
• 3 Basen → 1 Aminosäure
„Central Dogma in Molecular Biology“
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Gen
• Abschnitt auf dem Genom
– Vorlage zur Herstellung eines funktionellen Produkts
• Nur RNA oder weiter zum Protein
– Alle direkt daran beteiligten Sequenzen
• 5‘ UTR, 3‘ UTR (Untranslatierte Region)
• Nur der Abschnitt zwischen Start- und Stopcodon wird in mRNA
übersetzt
Stopcodon
Startcodon
Enhancer
Promotor
Vorlage für Primärtranskript
5‘
3‘
TATA
3‘
5‘ UTR
Intron
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
5‘
Exon
3‘ UTR
Translation
• Übersetzung
– Nukleotidsequenz der mRNA
– zu Aminosäuresequenz der Proteine
• Je 3 Basen (Codon) codieren für 1 Aminosäure
• Wie viele mögliche Kombinationen?
–
–
–
–
Triplett → 3 Stellen
4 mögliche Buchstaben (A, T (U), G, C)
43 = 64 mögliche Kombinationen
Aber nur 20 Aminosäuren
• Redundanz im genetischen Code
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Tertiärstruktur
• Tertiärstruktur: Räumliche Anordnung der
Sekundärstrukturelemente
1b71 aus Protein Data Bank
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Mutationen in DNA – Auswirkungen
Wildtyp
CTTAGTGACTACGGTAAA
Leu Ser Asp Tyr Gly Lys
DNA
Protein
DNA
Fatale
Mutationen
CTTAGTGACTAGGGTAAA
Leseraster
Mutationen
C T T A G T G A A C T A C G G T A A A DNA
neutrale
Mutationen
CTTAGCGACTACGGTAAA
Funktionale
Mutationen
CTTAGTGAATACGGTAAA
Leu Ser Asp Stop-Codon
Leu Ser His Asp Leu Thr
Leu Ser Asp Tyr Gly Lys
Leu Ser His Tyr Gly Lys
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Protein
Protein
DNA
Protein
DNA
Protein
Menschliche Gene
• ~ 25.000 Gene
– Niemand weiß bisher die genaue Zahl
• Länge zwischen 100bp und 2Mbp (Introns+Exons)
• Durchschnittliche Länge der codierenden Region: 1400 bps
– Durchschnittliche Proteinlänge 447 Aminosäuren
• Durchschnittliches Gen hat 9 Exons
• Nur wenige Prozent des menschlichen Genoms ist
kodierend
–
–
–
–
–
Rest: „junk“?
Viele Repeats, Transposons
Regulatorische Elemente
Pseudogene
Chromosomale Struktur: Zentromere und Telomere
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Protein – schon bekannt
•
•
•
•
Name:
Phenylalanine-4-hydroxylase
Länge:
452 Aminosäuren
EC-Nummer: 1.14.16.1
Katalysierte Reaktion:
Phenylalanin
Tetrahydobiopterin
O2
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Tyrosin
4a-Hydroxytetrahydobiopterin
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
PROSITE – konservierte Regionen
• Pattern
–
P - D - x(2) - H - [DE] - [LIVF] - [LIVMFY] - G - H - [LIVMC] - [PA]
• Wo ist es in der Proteinsequenz?
MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKL
RLFEENDVNLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIG
ATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFKDPVY
RARRKQFADIAYNYRHGQPIPRVEYMEEEKKTWGTVFKTLKSLYKTHACYE
292
281
YNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLSSRDFLG
GLAFRVFHCTQYIRHGSKPMYTPQPDICHELLGHVPLFSDRSFAQFSQEIG
LASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCL
SEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRP
FSVRYDPYTQRIEVLDNTQQLKILADSINSEIGILCSALQKIK
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Gene Location
• PAH gene: long (q) arm of chromosome 12 between
positions 22 and 24.2
• Base pair 101,756,233 to 101,835,510
Source: http://ghr.nlm.nih.gov
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Breast Cancer: BRCA1 and BRCA2
•
•
•
An estimated 5 percent to 10 percent of all breast cancers are hereditary…
Variations of the BRCA1, BRCA2, CDH1, PTEN, STK11, and TP53 genes
increase the risk of developing breast cancer
The AR, ATM, BARD1, BRIP1, CHEK2, DIRAS3, ERBB2, NBN, PALB2, RAD50,
and RAD51 genes are associated with breast cancer
Source: http://ghr.nlm.nih.gov
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Sickle Cell Disease
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
What we want to do
• A practical project in information integration
• Which diseases (functions) are located on which
chromosomes?
• Requires information about
–
–
–
–
–
Genes
Gene locations
Gene – disease associations
Gene – function associations
…
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Group Formation - Tuesday
•
•
•
•
•
•
•
•
Group 1: Schröder, Bicking, Rinck
Group 2: Wagner, Zahn, Warmuth
Group 3: Menger, Kuhlmeyer, Bauersfeld
Group 4: Herzog, Redlin, Haddenhorst
Group 5: Arzt, Freund, Dautcourt
Group 6: Henke, Gehrels, Scheunemann
Group 7: Hamann, Kalleske, Przewozny
Please build groups in GOYA asap
– Using group name „GroupX“
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Group Formation - Thursday
•
•
•
•
•
•
•
•
Group 10: Jacob, Schuh, Münnich
Group 11: Starlinger, Rheinländer, Eicher
Group 12: Konnegen, Dinh Viet, Afanasyeva
Group 13: Preuß, Bachmann
Group 14: Serediouk, Pöthig, Schulze
Group 15: Quade, Zimmermann, Ermakova
Group 16: Schrepfer, Wolf
Please build groups in GOYA asap
– Using group name „GroupX“
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Questions ?
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Herunterladen