INDREX permits SQL developers to mine facts from text data. We

Werbung
In-Database Text Mining and Fact Extraction
Challenge: How can we support explorative text data mining?
Could we leverage in-database data mining?
Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com)
2
Our recent Work on Relation Extraction
Core observation: Human language has already evolved synonymous words and
grammatical (syntactic) structures for expressing binary or higher order facts.
X and his lover Y married in Z
KRAKEN: N-ary Facts in Open Information Extraction A.
Akbik, A. Löser. AKBC-WEKEX @IJCAI 2012.
Unsupervised Discovery of Relations and Discriminative
Extraction Patterns. A. Akbik, L. Visengeriyeva, P. Herger, H.
Hemsen, A. Löser. COLING 2012
Effective Selectional Restrictions for Unsupervised Relation
Extraction. A. Akbik, L. Visengeriyeva, J. Kirschnick and A.
Löser. IJCNLP 2013
N-ary Open Information Extraction
married_in(nnp:X, nnp:Y, num:Z)
Lemmatization
marry_in(nnp:X, nnp:Y, num:Z)
Unsupervised synonym resolution
tie_the_knot_in(nnp:X, nnp:Y, num:Z)
Argument type Resolution
marry_in(person:X, person:Y, time:Z)
Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com)
3
Span Data Model
We represent text data with the span data model. Moreover, we provide
transformation functions from this model to the bag of word, sequence based,
dependency-tree based and relational model.
T. Kilias, A. Löser, P. Andritsos: In-Database Relation Extraction. Information Systems Journal 2015 (to appear)
Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com)
5
INDREX: Markus Lanz in 2012 (Example)
INDREX permits SQL developers to mine facts from text data. We support three
operator classes: (1) per-document extraction, (2) cross corpus extraction and
aggregations and (3) joining text data with existing structured data.
T. Kilias, A. Löser, P. Andritsos: In-Database Relation Extraction. Information Systems Journal 2015 (to appear)
Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com)
6
INDREX: Example (2)
The query extracts relations from news that likely represent acquisitions.
T. Kilias, A. Löser, P. Andritsos: In-Database Relation Extraction. Information Systems Journal 2015 (to appear)
Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com)
7
INDREX: Example (3)
INDREX permits text mining and SQL functionality, such as aggregations, in a
single system. It benefits from built-in RDBMS optimizations.
T. Kilias, A. Löser, P. Andritsos: In-Database Relation Extraction. Information Systems Journal 2015 (to appear)
Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com)
8
INDREX @ Cloudera IMPALA
We observe nearly two orders of magnitude faster execution times on a
Parquet/IMAPALA based system compared with a Hadoop/Pig System.
(Setup for both systems: Annotated Reuters NIST Text Corpus).
T. Kilias, A. Löser, P. Andritsos: In-Database Relation Extraction. Information Systems Journal 2015 (to appear)
Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com)
9
Facts-As-You-Type. Learning join predicates.
WHAT‘S NEXT?
Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com)
10
Infoboxes: Facts-as-You-Type
Our prototype extracts relations while you type. Can we piggyback user feedback
for learning join predicates? https://dbl43.beuth-hochschule.de/infoboxes
Our prototype uses the open information extraction system CLAUSIE.
Luciano Del Corro and Rainer Gemulla: ClausIE: Clause-Based Open Information Extraction. WWW 2013
Beuth Hochschule für Technik Berlin – Prof. Dr. habil. Alexander Löser (www.datexis.com)
13
Herunterladen