First European i2b2 Academic User Meeting IDRT: Unlocking Research Data Sources with ETL for use in a Structured Research Database The IDRT Team (in alphabetical order): Christian Bauer (presenter), Benjamin Baum, Jan Christoph, Igor Engel, Thomas Ganslandt, Matthias Löbe, Sebastian Mate, Daniel Plog, Hans-Ulrich Prokosch, Matthias Quade, Ulrich Sax, Sebastian Stäubert, Lars Voitel, Alfred Winter TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. IDRT Architecture Concept Dimension Table Loading Transformation Observation Fact Table Target Metadata Mapping Standard Metadata Source Metadata i2b2 Patient Dimension Table Talend Open Studio Oracle Database Modelling Staging Oracle Database Source Fact Data PIDgen Individual extractors for each terminology Extraction Datasources Configurable extractors for each type of datasource CSV OWL ... ODM SQL CSV ICD LOINC ... CDMS EHR §21 Standard Terminologies Operative Systems TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen Talend Open Studio Java Source Systems page 2 IDRT - Goal create tools for a simple and easy import of medical data into the i2b2 database challenges How do we get the data into the database? How do we get i2b2 ontologies for the data? TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen page 3 ETL Concept Dimension Table Loading Transformation Observation Fact Table Target Metadata Mapping Standard Metadata Source Metadata i2b2 Patient Dimension Table Talend Open Studio Oracle Database Modelling Staging Oracle Database Source Fact Data PIDgen Individual extractors for each terminology Extraction Datasources Configurable extractors for each type of datasource CSV OWL ... ODM SQL CSV ICD LOINC ... CDMS EHR §21 Standard Terminologies TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen Talend Open Studio Java Source Systems page 4 ETL Which data formats do we need? CSV SQL CDISC-ODM How can we import the data into the i2b2 database? create generic ETL jobs for the data formats How can we get i2b2 ontologies for the patient data? use configuration files to get some user input How will we create the ETL? use Talend Open Studio TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen page 5 ETL / Talend Open Studio open source data integration program used for the creation of ETL (extract – transform – load) graphical code generator (Java) TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen page 6 CSV/SQL-ETL user creates configuration file nice names, data types, patient ids job loads files/database patient data job creates i2b2 ontology and patients based on configuration and patient data job loads i2b2 specific data and patient data into the i2b2 database TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen page 7 CSV/SQL-ETL user creates configuration file nice names, data types, patient ids job loads files/database patient data job creates i2b2 ontology and patients based on configuration and patient data job loads i2b2 specific data and patient data into the i2b2 database TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen page 8 CSV/SQL-ETL ETL jobs can be run inside Talend Open Studio, as java code or in a gui gui has automatic creation of configurations gui has easy editing and automatic saving of configurations gui has server browser TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen page 9 CSV/SQL-ETL TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen page 10 ODM-ETL CDISC xml standard represents a paper based trial study ( study -> events -> forms -> item groups -> items) no configuration needed! TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen page 11 ETL Datenformat TOS-Intern IDRT-Datenstruktur TOS-Intern TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen i2b2 Transformation file/db ODM Datenbank Loading Transformation Extraction Teilprozess Arbeitsschritt Loading TOS Loading Datei Extraction Werkzeug Extraction Datasources all ELT jobs have two sections writing to i2b2 database is the same job for all the IDRT imports i2b2-Oracle page 12 Security: integration of a patient pseudonymization service patient data with personal information Talend Open Studio sub job patient data with psyeudoym answer from the patend identifaction Success:ZLTGHE3D:Es wurde ein passender Fall gefunden. TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen page 13 Discussion & Outlook IDRT ETL provides an easy to use package for importing patient data into i2b2. But … CSV/SQL i2b2 ontologies are unstructured and often not nice to look at no complexe i2b2 ontologies we need sub data elements ( 1 patient -> n biomaterial specimen ) IDRT 2 solutions: expand the ETL to incorporate sub data elements create a editor to manipulate and create ontologies TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen page 14 IDRT2 - Editor Source-Ontologie Source-Ontology > termionolgies mapping ot termionology > ICD10 > patient data 1 > diagnose mapping of a datet > date > diagnose > medication takings > Menge > Einheit 3 > Trinkgewohnheit mapping of more than one item > 0,2l Rotwein pro Tag > 200ml Wein pro Tag > 2 kleine Glas Rotwein tgl Source-Ontologie Target-Ontology > Ontology > diagnose > A00-B99 > C00-D48 >… > medication takings > Menge > Einheit > Trinkgewohnheit > 200 ml Wein/Tag TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen bundle c oncepts 2 page 15 i2b2 Queries Ontology Tables Concepts Modifier Target Metadata Transformation Patient Tables i2b2 Oracle Database Talend Open Studio Oracle Database Mapping Java/RCP Talend Open Studio Oracle Database i2b2-Ontology-Editor i2b2 Modelling Queries Staging Observation Fact Table Ontology Tables Standard Metadata Concepts Observation Source Fact Table Metadata Source Patient Tables Fact Data i2b2 Oracle Database PIDgen Extraction Datasources Individual extractors for each terminology ClaML CSV … ICD §21 … Standard Terminologies Configurable extractors for each type of datasource ODM CDMS SQL EHR Talend Open Studio Java CSV §21 Operative Systems Antrag IDRT2 / TMF ITQM 25.01.2012 / Christian Bauer / Seite 16 Discussion & Outlook IDRT ETL provides an easy to use package for importing patient data into i2b2. But … CSV/SQL i2b2 ontologies are unstructured and often not nice to look at no complexe i2b2 ontologies we need sub data elements ( 1 patient -> n biomaterial specimen ) IDRT 2 solutions: expand the ETL to incorporate sub data elements create a editor to manipulate and create ontologies TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V. 1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen page 17