Information Integration Praktikum Ulf Leser Pre-Requisites • You need to … – – – – – Have finished your Vorstudium (or have a special permission) Be somewhat experienced in Java Behave well to contribute to your groups’ solutions Be willing to invest considerable time Be prepared to present and discuss implemented solutions Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Organization • The Praktikum consists of 5-6 assignments – – – – Each assignment means to implement a certain program etc. There are 2-3 weeks for each assignment Solutions must be send in by mail the day before the next assignment Solution will be presented; usually 1-4 solutions per assignment • We will build groups (today) – Each group has 2-3 students – A group fails are is passes in its entirety – All exercises must be solved to pass the Praktikum • Schedule – A typical Praktikum consists of • Presentation and discussion of solutions to previous assignment • Presentation of new Assignment – All classes with presentations / new assignments are mandatory – Other classes are voluntarily – I’ll be hear to answer questions Ulf Leser: Information Integration, Praktikum, WS 2008/2009 The Web • All material will be linked on the web after each class • All data files will be downloadable from the Web site • Student’s presentations will not be put online – So don’t worry about your layout Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Your Presentations • Each group must be prepared to present its solution for every assignment (there are only 5) – We select groups at random at each class – The group may send any student to present – Before a student presents twice, all other group members must have presented at least once • Presentations need a certain level of quality – – – – Present key ideas, key insights, key problems Critical code snipplets are good Putting up 100 LOC per slide is not good Draw conclusions • What did you learn? • What went wrong? • What would you do otherwise the next time? Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Tentative Schedule • Certain dates – 21.10.2008: Group formation; Assignment 1 – 4.11.2008: Solutions to assignment 1; Assignment 2 • Tentative further dates – – – – – 18.11.2008: 09.12.2008: 06.01.2009: 27.01.2009: 10.02.2009: Solutions Solutions Solutions Solutions Solutions to to to to to assignment assignment assignment assignment assignment Ulf Leser: Information Integration, Praktikum, WS 2008/2009 2; 3; 4; 5; 6; assignment assignment assignment assignment Wrap-up 3 4 5 6 Contest ??? Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Content • Each group will eventually have built an integrated information system of human genes, chromosomes, diseases, and gene function • Based on a relational database • Some sources are integrated physically – Downloads, flat-files, XML • Some sources are integrated virtually – JAVA-RPC-API – Web wrapping (screen scraping) • Data quality is an issue – Consistency of duplicate information Ulf Leser: Information Integration, Praktikum, WS 2008/2009 • Biological Background – Some slides from Silke Trissl Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Zellen • Ca. 75 Billionen Zellen im menschlichen Körper • Ca. 250 verschiedene Typen: Nerven, Haut, Muskeln, ... Ulf Leser: Information Integration, Praktikum, WS 2008/2009 DesoxyriboNucleicAcid • DNA: Desoxyribonukleinsäure • Träger der vererbten Information – Genom • Alles Leben verwendet DNA (RNA) aus den selben 4 (5) Molekülen Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Das menschliche Genom • ... AGGCTGATGGATTAGAGACC ... • 23 Chromosomenpaare • ~ 3.000.000.000 Buchstaben • ~ 50% bestehen aus 4 „Parasiten“ • • • • ~ ~ ~ ~ 100.000 Gene 56.000 Gene 30.000 Gene 24.000 Gene • ~ 20.000 Gene Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Was ist ein Gen? Chromosom DNA A C G T T G A T G A C C A G A G C T T G T RNA A C G T T G A C A G A G C T T G T Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Protein Von der DNA zum Protein • Transkription – Abschreiben – DNA → m(essenger)RNA – DNA ↔ RNA • Doppel- ↔ einzelsträngig • Thymin (T) ↔ Uracil (U) • Translation – Übersetzen – mRNA → Protein – RNA ↔ Protein • 3 Basen → 1 Aminosäure „Central Dogma in Molecular Biology“ Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Gen • Abschnitt auf dem Genom – Vorlage zur Herstellung eines funktionellen Produkts • Nur RNA oder weiter zum Protein – Alle direkt daran beteiligten Sequenzen • 5‘ UTR, 3‘ UTR (Untranslatierte Region) • Nur der Abschnitt zwischen Start- und Stopcodon wird in mRNA übersetzt Stopcodon Startcodon Enhancer Promotor Vorlage für Primärtranskript 5‘ 3‘ TATA 3‘ 5‘ UTR Intron Ulf Leser: Information Integration, Praktikum, WS 2008/2009 5‘ Exon 3‘ UTR Translation • Übersetzung – Nukleotidsequenz der mRNA – zu Aminosäuresequenz der Proteine • Je 3 Basen (Codon) codieren für 1 Aminosäure • Wie viele mögliche Kombinationen? – – – – Triplett → 3 Stellen 4 mögliche Buchstaben (A, T (U), G, C) 43 = 64 mögliche Kombinationen Aber nur 20 Aminosäuren • Redundanz im genetischen Code Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Tertiärstruktur • Tertiärstruktur: Räumliche Anordnung der Sekundärstrukturelemente 1b71 aus Protein Data Bank Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Mutationen in DNA – Auswirkungen Wildtyp CTTAGTGACTACGGTAAA Leu Ser Asp Tyr Gly Lys DNA Protein DNA Fatale Mutationen CTTAGTGACTAGGGTAAA Leseraster Mutationen C T T A G T G A A C T A C G G T A A A DNA neutrale Mutationen CTTAGCGACTACGGTAAA Funktionale Mutationen CTTAGTGAATACGGTAAA Leu Ser Asp Stop-Codon Leu Ser His Asp Leu Thr Leu Ser Asp Tyr Gly Lys Leu Ser His Tyr Gly Lys Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Protein Protein DNA Protein DNA Protein Menschliche Gene • ~ 25.000 Gene – Niemand weiß bisher die genaue Zahl • Länge zwischen 100bp und 2Mbp (Introns+Exons) • Durchschnittliche Länge der codierenden Region: 1400 bps – Durchschnittliche Proteinlänge 447 Aminosäuren • Durchschnittliches Gen hat 9 Exons • Nur wenige Prozent des menschlichen Genoms ist kodierend – – – – – Rest: „junk“? Viele Repeats, Transposons Regulatorische Elemente Pseudogene Chromosomale Struktur: Zentromere und Telomere Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Protein – schon bekannt • • • • Name: Phenylalanine-4-hydroxylase Länge: 452 Aminosäuren EC-Nummer: 1.14.16.1 Katalysierte Reaktion: Phenylalanin Tetrahydobiopterin O2 Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Tyrosin 4a-Hydroxytetrahydobiopterin Ulf Leser: Information Integration, Praktikum, WS 2008/2009 PROSITE – konservierte Regionen • Pattern – P - D - x(2) - H - [DE] - [LIVF] - [LIVMFY] - G - H - [LIVMC] - [PA] • Wo ist es in der Proteinsequenz? MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKL RLFEENDVNLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIG ATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFKDPVY RARRKQFADIAYNYRHGQPIPRVEYMEEEKKTWGTVFKTLKSLYKTHACYE 292 281 YNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGFRLRPVAGLLSSRDFLG GLAFRVFHCTQYIRHGSKPMYTPQPDICHELLGHVPLFSDRSFAQFSQEIG LASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCL SEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRP FSVRYDPYTQRIEVLDNTQQLKILADSINSEIGILCSALQKIK Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Gene Location • PAH gene: long (q) arm of chromosome 12 between positions 22 and 24.2 • Base pair 101,756,233 to 101,835,510 Source: http://ghr.nlm.nih.gov Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Breast Cancer: BRCA1 and BRCA2 • • • An estimated 5 percent to 10 percent of all breast cancers are hereditary… Variations of the BRCA1, BRCA2, CDH1, PTEN, STK11, and TP53 genes increase the risk of developing breast cancer The AR, ATM, BARD1, BRIP1, CHEK2, DIRAS3, ERBB2, NBN, PALB2, RAD50, and RAD51 genes are associated with breast cancer Source: http://ghr.nlm.nih.gov Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Sickle Cell Disease Ulf Leser: Information Integration, Praktikum, WS 2008/2009 What we want to do • A practical project in information integration • Which diseases (functions) are located on which chromosomes? • Requires information about – – – – – Genes Gene locations Gene – disease associations Gene – function associations … Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Group Formation - Tuesday • • • • • • • • Group 1: Schröder, Bicking, Rinck Group 2: Wagner, Zahn, Warmuth Group 3: Menger, Kuhlmeyer, Bauersfeld Group 4: Herzog, Redlin, Haddenhorst Group 5: Arzt, Freund, Dautcourt Group 6: Henke, Gehrels, Scheunemann Group 7: Hamann, Kalleske, Przewozny Please build groups in GOYA asap – Using group name „GroupX“ Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Group Formation - Thursday • • • • • • • • Group 10: Jacob, Schuh, Münnich Group 11: Starlinger, Rheinländer, Eicher Group 12: Konnegen, Dinh Viet, Afanasyeva Group 13: Preuß, Bachmann Group 14: Serediouk, Pöthig, Schulze Group 15: Quade, Zimmermann, Ermakova Group 16: Schrepfer, Wolf Please build groups in GOYA asap – Using group name „GroupX“ Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Questions ? Ulf Leser: Information Integration, Praktikum, WS 2008/2009