slide deck - Beuth Hochschule für Technik Berlin

Werbung
Agenda
The following work is a prototype financed by the BMWi in the
projects Smart Data Web and MACSS. This talk focusses on:
Digitization: Insights from text and relational data
Our Work: In-Data-Base-Text-Mining (INDREX)
What‘s next? Learning to Join between Text and Relations
Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com
2
Why again Digitisation?
Exclusive signals help us and our maschines optimizing our core business.
1. Live longer and healthier, maximize therapy adherence.
2. Understand what will your customer likely buy, now and tomorrow.
3. Given our understanding of our customer we optimize
•
output for given human labor
•
natural resources consumption
•
technology development and usage
… until “good enough”
These optimization may create cost savings in our organization.
Better: We sell a platform that enables these optimizations for our customers.
Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com
3
Data-driven Product Examples
We conduct BMWi projects (Smart Data Web, ExCELL and MACCS) and industry
funded projects with our partners.
Siemens SE, Zalando SE, ubermetrics GmbH: All about
my brands, products, suppliers, competitors, acquisition
targets … from blogs, forums, news, …. and please, fresh!
Charité, SAP SE: Interactive medical systems, such
Patients diary, Anamnesis, all about me, correlate
information with signals from medical systems
SpringerNature: Return scientists condensed tables instead
of force them to read publications
We heard many more: fraud, CRM, call center support, due
dilligence of M&A, legal information from the Web, …
Image Sources: Brinki (http://www.flickr.com/photos/brinkmann/493590524/) [CC BY-SA 2.0]
Heidelberger Life-Science Lab (Heidelberger Life-Science Lab) [CC BY-SA 3.0]
Paul Goyette (http://www.flickr.com/photos/pgoyette/168076182/) [CC BY-SA 2.0 ]
Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com
4
The typical Data Science Process
Data driven products implement an iterative process.
All about leads, brands, suppliers, patients, diseases….
Sample from
raw data
Iterate and
resample
Data-driven product
Test model
Cleanse and
recombine
samples
Learn Model
Plattform: Distributions (IBM Big Insight, Cloudera, MAPR,
Hortenworks), Main Memory Data Bases or Cloud Providers (AWS etc.)
Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com
5
INDREX leverages existing Language Models
INDREX builds a universal relation on top of the DWH and text data. The
single system benefits from built-in features of the RDBMS (columnar
layout, security, views, optimizer, transactional behavior, main memory, …).
KRAKEN: N-ary Facts in Open Information Extraction A. Akbik, A.
Löser. AKBC-WEKEX @IJCAI 2012.
Unsupervised Discovery of Relations and Discriminative Extraction
Patterns. A. Akbik, L. Visengeriyeva, P. Herger, H. Hemsen, A.
Löser. COLING 2012
Effective Selectional Restrictions for Unsupervised Relation
Extraction. A. Akbik, L. Visengeriyeva, J. Kirschnick and A. Löser.
IJCNLP 2013
J. Kirschnick, T. Kilias, H. Hemsen, A Löser, P. Adolphs, H. Ehrig, H.
Düwiger: A Marketplace for Web Scale Analytics and Text Annotation
Services. COLING (Demos) 2014
T. Kilias, A. Löser, P. Andritsos: In-Database Relation Extraction.
Information Systems Journal 2015
Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com
X and his lover Y married in Z
N-ary Open information extraction
married_in(nnp:X, nnp:Y, num:Z)
Lemmatization
marry_in(nnp:X, nnp:Y, num:Z)
Unsupervised synonym resolution
tie_the_knot_in(nnp:X, nnp:Y, num:Z)
Argument type resolution
marry_in(person1:X, person2:Y, time:Z)
Entity linkage
|x| text.person=rdbms.customer
11
Shared Nothing + Shared Memory (from ´16)
INDREX runs on the Shared Nothing RDBMS Cloudera IMPALA. From 2016 on
we will work on a Shared Memory Implementation based on SAP HANA.
Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com
13
GoOLAP: Analytics Search Engine
Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com
16
Entity Recognition with Deep Neural Networks
We develop character-based models for entity recognition with high recall
unsupervised language models trained with word embeddings
adaptive annotation using recursive neural networks (RNN)
Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com
18
Take Away Message
INDREX recombines text- with relational data in a single system and behind your
firewall. The declarative approach is tested with SQL programmers with
Cloudera, Zalando, Siemens or SpringerNature (and with Charité in 2016).
Further developments are:
•
•
•
Shared Memory for the medical domain with SAP, DFKI and
Charité
Helper functions, such as dictionary learning, robust failuretolerant embeddings, batch and interactive joins
Active learning of entity linkage and relations
Feel free to test! The INDREX prototype comes as virtual box
(Cloudera CDH 5) and with a toy corpus.
Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com
19
Alexander Löser
Professor @ Beuth University of Applied Sciences Berlin for
Database Systems and Text-based Information Systems (DATEXIS)
Speaker of Beuth-Research-Cluster “Data Science”
Consultant for Bayer AG, IBM Inc., SpringerNature, Ebay/mobile.de,
Zalando SE, Bisnode GmbH, …
Big Data Studies for Government: BMWi, BMVIT, EU
Stations
TU-Berlin, SAP SE, IBM Almaden Research Center, HP Labs Bristol
13 international and national large research projects (EU FP 5&6, BMWi, BMBF)
Awards
Trusted Cloud (Federal Ministry for Economic Affairs and Energy 2011)
Smart Data (Federal Ministry for Economic Affairs and Energy 2014, 2x)
Smart Service World: (Federal Ministry for Economic Affairs and Energy 2015, 1x)
https://prof.beuth-hochschule.de/loeser/ http://de.linkedin.com/in/loeser
http://scholar.google.com/citations?user=am2ohp0AAAAJ&hl=de
Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com
21
Selected Publications from the last 3 years
In-Data-Base-Text-Mining
1.
2.
3.
Torsten Kilias, Alexander Löser, Periklis Andritsos: INDREX: In-database relation
extraction. Inf. Syst. 53: 124-144 (2015)
Sebastian Arnold, Alexander Löser, Torsten Kilias: Resolving Common
Analytical Tasks in Text Databases. DOLAP 2015: 75-84
Alexander Löser, Christoph Nagel, Stephan Pieper, Christoph Boden:
Beyond search: Retrieving complete tuples from a text-database. Information
Systems Frontiers 15(3): 311-329 (2013)
Data Marketplaces
1.
2.
Johannes Kirschnick, Torsten Kilias, Holmer Hemsen, Alexander Löser, Peter
Adolphs, Heiko Ehrig, Holger Düwiger: A Marketplace for Web Scale Analytics and
Text Annotation Services. COLING (Demos) 2014: 100-104
Alexander Muschalle, Florian Stahl, Alexander Löser, Gottfried Vossen: Pricing
Approaches for Data Markets. BIRTE 2012: 129-144
Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com
22
Span Data Model supports RDBMS-optimizer
We represent text data with the span data model. Moreover, we provide
transformation functions from this model to the bag of word, sequence based,
dependency-tree based and relational model.
T. Kilias, A. Löser, P. Andritsos: In-Database Relation Extraction. Information Systems Journal 2015.
Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com
23
Herunterladen