Agenda The following work is a prototype financed by the BMWi in the projects Smart Data Web and MACSS. This talk focusses on: Digitization: Insights from text and relational data Our Work: In-Data-Base-Text-Mining (INDREX) What‘s next? Learning to Join between Text and Relations Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com 2 Why again Digitisation? Exclusive signals help us and our maschines optimizing our core business. 1. Live longer and healthier, maximize therapy adherence. 2. Understand what will your customer likely buy, now and tomorrow. 3. Given our understanding of our customer we optimize • output for given human labor • natural resources consumption • technology development and usage … until “good enough” These optimization may create cost savings in our organization. Better: We sell a platform that enables these optimizations for our customers. Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com 3 Data-driven Product Examples We conduct BMWi projects (Smart Data Web, ExCELL and MACCS) and industry funded projects with our partners. Siemens SE, Zalando SE, ubermetrics GmbH: All about my brands, products, suppliers, competitors, acquisition targets … from blogs, forums, news, …. and please, fresh! Charité, SAP SE: Interactive medical systems, such Patients diary, Anamnesis, all about me, correlate information with signals from medical systems SpringerNature: Return scientists condensed tables instead of force them to read publications We heard many more: fraud, CRM, call center support, due dilligence of M&A, legal information from the Web, … Image Sources: Brinki (http://www.flickr.com/photos/brinkmann/493590524/) [CC BY-SA 2.0] Heidelberger Life-Science Lab (Heidelberger Life-Science Lab) [CC BY-SA 3.0] Paul Goyette (http://www.flickr.com/photos/pgoyette/168076182/) [CC BY-SA 2.0 ] Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com 4 The typical Data Science Process Data driven products implement an iterative process. All about leads, brands, suppliers, patients, diseases…. Sample from raw data Iterate and resample Data-driven product Test model Cleanse and recombine samples Learn Model Plattform: Distributions (IBM Big Insight, Cloudera, MAPR, Hortenworks), Main Memory Data Bases or Cloud Providers (AWS etc.) Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com 5 INDREX leverages existing Language Models INDREX builds a universal relation on top of the DWH and text data. The single system benefits from built-in features of the RDBMS (columnar layout, security, views, optimizer, transactional behavior, main memory, …). KRAKEN: N-ary Facts in Open Information Extraction A. Akbik, A. Löser. AKBC-WEKEX @IJCAI 2012. Unsupervised Discovery of Relations and Discriminative Extraction Patterns. A. Akbik, L. Visengeriyeva, P. Herger, H. Hemsen, A. Löser. COLING 2012 Effective Selectional Restrictions for Unsupervised Relation Extraction. A. Akbik, L. Visengeriyeva, J. Kirschnick and A. Löser. IJCNLP 2013 J. Kirschnick, T. Kilias, H. Hemsen, A Löser, P. Adolphs, H. Ehrig, H. Düwiger: A Marketplace for Web Scale Analytics and Text Annotation Services. COLING (Demos) 2014 T. Kilias, A. Löser, P. Andritsos: In-Database Relation Extraction. Information Systems Journal 2015 Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com X and his lover Y married in Z N-ary Open information extraction married_in(nnp:X, nnp:Y, num:Z) Lemmatization marry_in(nnp:X, nnp:Y, num:Z) Unsupervised synonym resolution tie_the_knot_in(nnp:X, nnp:Y, num:Z) Argument type resolution marry_in(person1:X, person2:Y, time:Z) Entity linkage |x| text.person=rdbms.customer 11 Shared Nothing + Shared Memory (from ´16) INDREX runs on the Shared Nothing RDBMS Cloudera IMPALA. From 2016 on we will work on a Shared Memory Implementation based on SAP HANA. Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com 13 GoOLAP: Analytics Search Engine Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com 16 Entity Recognition with Deep Neural Networks We develop character-based models for entity recognition with high recall unsupervised language models trained with word embeddings adaptive annotation using recursive neural networks (RNN) Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com 18 Take Away Message INDREX recombines text- with relational data in a single system and behind your firewall. The declarative approach is tested with SQL programmers with Cloudera, Zalando, Siemens or SpringerNature (and with Charité in 2016). Further developments are: • • • Shared Memory for the medical domain with SAP, DFKI and Charité Helper functions, such as dictionary learning, robust failuretolerant embeddings, batch and interactive joins Active learning of entity linkage and relations Feel free to test! The INDREX prototype comes as virtual box (Cloudera CDH 5) and with a toy corpus. Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com 19 Alexander Löser Professor @ Beuth University of Applied Sciences Berlin for Database Systems and Text-based Information Systems (DATEXIS) Speaker of Beuth-Research-Cluster “Data Science” Consultant for Bayer AG, IBM Inc., SpringerNature, Ebay/mobile.de, Zalando SE, Bisnode GmbH, … Big Data Studies for Government: BMWi, BMVIT, EU Stations TU-Berlin, SAP SE, IBM Almaden Research Center, HP Labs Bristol 13 international and national large research projects (EU FP 5&6, BMWi, BMBF) Awards Trusted Cloud (Federal Ministry for Economic Affairs and Energy 2011) Smart Data (Federal Ministry for Economic Affairs and Energy 2014, 2x) Smart Service World: (Federal Ministry for Economic Affairs and Energy 2015, 1x) https://prof.beuth-hochschule.de/loeser/ http://de.linkedin.com/in/loeser http://scholar.google.com/citations?user=am2ohp0AAAAJ&hl=de Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com 21 Selected Publications from the last 3 years In-Data-Base-Text-Mining 1. 2. 3. Torsten Kilias, Alexander Löser, Periklis Andritsos: INDREX: In-database relation extraction. Inf. Syst. 53: 124-144 (2015) Sebastian Arnold, Alexander Löser, Torsten Kilias: Resolving Common Analytical Tasks in Text Databases. DOLAP 2015: 75-84 Alexander Löser, Christoph Nagel, Stephan Pieper, Christoph Boden: Beyond search: Retrieving complete tuples from a text-database. Information Systems Frontiers 15(3): 311-329 (2013) Data Marketplaces 1. 2. Johannes Kirschnick, Torsten Kilias, Holmer Hemsen, Alexander Löser, Peter Adolphs, Heiko Ehrig, Holger Düwiger: A Marketplace for Web Scale Analytics and Text Annotation Services. COLING (Demos) 2014: 100-104 Alexander Muschalle, Florian Stahl, Alexander Löser, Gottfried Vossen: Pricing Approaches for Data Markets. BIRTE 2012: 129-144 Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com 22 Span Data Model supports RDBMS-optimizer We represent text data with the span data model. Moreover, we provide transformation functions from this model to the bag of word, sequence based, dependency-tree based and relational model. T. Kilias, A. Löser, P. Andritsos: In-Database Relation Extraction. Information Systems Journal 2015. Beuth Hochschule für Technik Berlin – Alexander Löser – www.datexis.com 23