In-Database Text Mining and Fact Extraction Challenge: How can we support explorative text data mining? Could we leverage in-database data mining? Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com) 2 Our recent Work on Relation Extraction Core observation: Human language has already evolved synonymous words and grammatical (syntactic) structures for expressing binary or higher order facts. X and his lover Y married in Z KRAKEN: N-ary Facts in Open Information Extraction A. Akbik, A. Löser. AKBC-WEKEX @IJCAI 2012. Unsupervised Discovery of Relations and Discriminative Extraction Patterns. A. Akbik, L. Visengeriyeva, P. Herger, H. Hemsen, A. Löser. COLING 2012 Effective Selectional Restrictions for Unsupervised Relation Extraction. A. Akbik, L. Visengeriyeva, J. Kirschnick and A. Löser. IJCNLP 2013 N-ary Open Information Extraction married_in(nnp:X, nnp:Y, num:Z) Lemmatization marry_in(nnp:X, nnp:Y, num:Z) Unsupervised synonym resolution tie_the_knot_in(nnp:X, nnp:Y, num:Z) Argument type Resolution marry_in(person:X, person:Y, time:Z) Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com) 3 Span Data Model We represent text data with the span data model. Moreover, we provide transformation functions from this model to the bag of word, sequence based, dependency-tree based and relational model. T. Kilias, A. Löser, P. Andritsos: In-Database Relation Extraction. Information Systems Journal 2015 (to appear) Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com) 5 INDREX: Markus Lanz in 2012 (Example) INDREX permits SQL developers to mine facts from text data. We support three operator classes: (1) per-document extraction, (2) cross corpus extraction and aggregations and (3) joining text data with existing structured data. T. Kilias, A. Löser, P. Andritsos: In-Database Relation Extraction. Information Systems Journal 2015 (to appear) Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com) 6 INDREX: Example (2) The query extracts relations from news that likely represent acquisitions. T. Kilias, A. Löser, P. Andritsos: In-Database Relation Extraction. Information Systems Journal 2015 (to appear) Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com) 7 INDREX: Example (3) INDREX permits text mining and SQL functionality, such as aggregations, in a single system. It benefits from built-in RDBMS optimizations. T. Kilias, A. Löser, P. Andritsos: In-Database Relation Extraction. Information Systems Journal 2015 (to appear) Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com) 8 INDREX @ Cloudera IMPALA We observe nearly two orders of magnitude faster execution times on a Parquet/IMAPALA based system compared with a Hadoop/Pig System. (Setup for both systems: Annotated Reuters NIST Text Corpus). T. Kilias, A. Löser, P. Andritsos: In-Database Relation Extraction. Information Systems Journal 2015 (to appear) Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com) 9 Facts-As-You-Type. Learning join predicates. WHAT‘S NEXT? Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com) 10 Infoboxes: Facts-as-You-Type Our prototype extracts relations while you type. Can we piggyback user feedback for learning join predicates? https://dbl43.beuth-hochschule.de/infoboxes Our prototype uses the open information extraction system CLAUSIE. Luciano Del Corro and Rainer Gemulla: ClausIE: Clause-Based Open Information Extraction. WWW 2013 Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com) 13