IDRT: Unlocking Research Data Sources with ETL

Werbung
First European i2b2 Academic User Meeting
IDRT: Unlocking Research Data Sources with ETL
for use in a Structured Research Database
The IDRT Team (in alphabetical order):
Christian Bauer (presenter), Benjamin Baum, Jan Christoph,
Igor Engel, Thomas Ganslandt, Matthias Löbe, Sebastian Mate,
Daniel Plog, Hans-Ulrich Prokosch, Matthias Quade, Ulrich Sax,
Sebastian Stäubert, Lars Voitel, Alfred Winter
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
IDRT Architecture
Concept
Dimension Table
Loading
Transformation
Observation
Fact Table
Target
Metadata
Mapping
Standard
Metadata
Source
Metadata
i2b2
Patient
Dimension Table
Talend
Open Studio
Oracle Database
Modelling
Staging
Oracle
Database
Source
Fact Data
PIDgen
Individual extractors
for each terminology
Extraction
Datasources
Configurable extractors
for each type of datasource
CSV
OWL
...
ODM
SQL
CSV
ICD
LOINC
...
CDMS
EHR
§21
Standard Terminologies
Operative Systems
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
Talend
Open Studio
Java
Source
Systems
page 2
IDRT - Goal
 create tools for a simple and easy import of medical data into
the i2b2 database
 challenges
How do we get the data into the database?
How do we get i2b2 ontologies for the data?
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
page 3
ETL
Concept
Dimension Table
Loading
Transformation
Observation
Fact Table
Target
Metadata
Mapping
Standard
Metadata
Source
Metadata
i2b2
Patient
Dimension Table
Talend
Open Studio
Oracle Database
Modelling
Staging
Oracle
Database
Source
Fact Data
PIDgen
Individual extractors
for each terminology
Extraction
Datasources
Configurable extractors
for each type of datasource
CSV
OWL
...
ODM
SQL
CSV
ICD
LOINC
...
CDMS
EHR
§21
Standard Terminologies
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
Talend
Open Studio
Java
Source
Systems
page 4
ETL
 Which data formats do we need?
CSV
SQL
CDISC-ODM
 How can we import the data into the i2b2 database?
create generic ETL jobs for the data formats
 How can we get i2b2 ontologies for the patient data?
use configuration files to get some user input
 How will we create the ETL?
use Talend Open Studio
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
page 5
ETL / Talend Open Studio
 open source data integration program
used for the creation of ETL (extract – transform – load)
graphical code generator (Java)
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
page 6
CSV/SQL-ETL
 user creates configuration file
nice names, data types, patient ids
 job loads files/database patient data
 job creates i2b2 ontology and patients based on configuration
and patient data
 job loads i2b2 specific data and patient data into the i2b2
database
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
page 7
CSV/SQL-ETL
 user creates configuration file
nice names, data types, patient ids
 job loads files/database patient data
 job creates i2b2 ontology and patients based on configuration
and patient data
 job loads i2b2 specific data and patient data into the i2b2
database
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
page 8
CSV/SQL-ETL
 ETL jobs can be run inside Talend Open Studio, as java code
or in a gui
 gui has automatic creation of configurations
 gui has easy editing and automatic saving of configurations
 gui has server browser
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
page 9
CSV/SQL-ETL
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
page 10
ODM-ETL
 CDISC xml standard
 represents a paper based trial study ( study -> events ->
forms -> item groups -> items)
 no configuration needed!
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
page 11
ETL
Datenformat
TOS-Intern
IDRT-Datenstruktur
TOS-Intern
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
i2b2
Transformation
file/db
ODM
Datenbank
Loading
Transformation
Extraction
Teilprozess
Arbeitsschritt
Loading
TOS
Loading
Datei
Extraction
Werkzeug
Extraction
Datasources
 all ELT jobs have two sections
 writing to i2b2 database is the same job for all the IDRT
imports
i2b2-Oracle
page 12
Security:
integration of a patient pseudonymization service
patient data with personal information
Talend Open Studio sub job
patient data with psyeudoym
answer from the patend identifaction
Success:ZLTGHE3D:Es wurde ein
passender Fall gefunden.
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
page 13
Discussion & Outlook
IDRT ETL provides an easy to use package for importing patient
data into i2b2. But …



CSV/SQL i2b2 ontologies are unstructured and often not nice to look
at
no complexe i2b2 ontologies
we need sub data elements ( 1 patient -> n biomaterial specimen )
IDRT 2 solutions:


expand the ETL to incorporate sub data elements
create a editor to manipulate and create ontologies
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
page 14
IDRT2 - Editor
Source-Ontologie
Source-Ontology
> termionolgies
mapping ot termionology
> ICD10
> patient data
1
> diagnose
mapping of a datet
> date
> diagnose
> medication takings
> Menge
> Einheit
3
> Trinkgewohnheit
mapping of more than one item
> 0,2l Rotwein pro Tag
> 200ml Wein pro Tag
> 2 kleine Glas Rotwein tgl
Source-Ontologie
Target-Ontology
> Ontology
> diagnose
> A00-B99
> C00-D48
>…
> medication takings
> Menge
> Einheit
> Trinkgewohnheit
> 200 ml Wein/Tag
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
bundle c
oncepts
2
page 15
i2b2
Queries
Ontology Tables
Concepts
Modifier
Target
Metadata
Transformation
Patient Tables
i2b2
Oracle Database
Talend Open Studio
Oracle Database
Mapping
Java/RCP
Talend Open Studio
Oracle Database
i2b2-Ontology-Editor
i2b2
Modelling
Queries
Staging
Observation
Fact Table
Ontology
Tables
Standard
Metadata
Concepts
Observation
Source
Fact
Table
Metadata
Source
Patient Tables
Fact Data
i2b2
Oracle Database
PIDgen
Extraction
Datasources
Individual extractors
for each terminology
ClaML
CSV
…
ICD
§21
…
Standard Terminologies
Configurable extractors
for each type of datasource
ODM
CDMS
SQL
EHR
Talend Open Studio
Java
CSV
§21
Operative Systems
Antrag IDRT2 / TMF
ITQM 25.01.2012 /
Christian Bauer /
Seite 16
Discussion & Outlook
IDRT ETL provides an easy to use package for importing patient
data into i2b2. But …



CSV/SQL i2b2 ontologies are unstructured and often not nice to look
at
no complexe i2b2 ontologies
we need sub data elements ( 1 patient -> n biomaterial specimen )
IDRT 2 solutions:


expand the ETL to incorporate sub data elements
create a editor to manipulate and create ontologies
TMF - Technologie und Methodenplattform für die vernetzte medizinische Forschung e. V.
1. European i2b2 Academic User Meeting – 25/03/13 - Erlangen
page 17
Herunterladen