In-Database Analytics: Predictive Analytics, Data Mining, R and Oracle Business Intelligence Detlef E. Schröder Leitender Systemberater STCC DB Mitte [email protected] www.oracledwh.de Copyright 2011 Oracle Corporation Agenda • • • • • • • • Überblick Oracle Datamining (ODM) Integration in den SQL Developer 3.X Demo ODM im SQL Dev 3.0 Integration mit OBIEE Demo ODM in OBIEE Statistische Funktionen in der DB (free) Oracle Enterprise Data Mining inkl. R Learn More Copyright 2011 Oracle Corporation Easier Überblick ODM Copyright 2011 Oracle Corporation Data-Mining laut Wikipedia Unter Data-Mining (der englische Begriff bedeutet etwa „aus einem Datenberg etwas Wertvolles extrahieren“, eine adäquate deutsche Übersetzung existiert nicht.[1] Der Duden empfiehlt die Schreibweise „Data-Mining“[2]) versteht man die systematische Anwendung von Methoden, die meist statistisch-mathematisch begründet sind, auf einen Datenbestand mit dem Ziel, neue Muster zu erkennen. Hierbei geht es auch um die Verarbeitung sehr großer Datenbestände (die nicht mehr manuell verarbeitet werden könnten), wofür effiziente Methoden benötigt werden, deren Zeitkomplexität sie für solche Datenmengen geeignet macht. Die Methoden finden aber auch für kleinere Datenmengen Anwendung. In der Praxis, vor allem im deutschen Sprachgebrauch, etablierte sich der angelsächsische Begriff „DataMining“ für den gesamten Prozess der so genannten „Knowledge Discovery in Databases“ (Wissensentdeckung in Datenbanken; KDD), der auch Schritte wie die Vorverarbeitung beinhaltet, während Data-Mining eigentlich nur den Analyseschritt des Prozesses bezeichnet.[3] Die reine Erfassung, Speicherung und Verarbeitung von großen Datenmengen wird gelegentlich fälschlicherweise auch mit dem Buzzword Data-Mining bezeichnet. Hier gibt es aber akkuratere Begriffe wie beispielsweise Data Warehousing. Korrekt verwendet bezeichnet es die Extraktion von Wissen, das „gültig (im statistischen Sinne), bisher unbekannt und potentiell nützlich“[4] ist „zur Bestimmung bestimmter Regelmäßigkeiten, Gesetzmäßigkeiten und verborgener Zusammenhänge“.[2] Fayyad definiert es als „ein Schritt des KDD-Prozesses, der darin besteht Datenanalyse- und Entdeckungsalgorithmen anzuwenden, die unter akzeptablen Effizienzbegrenzungen eine spezielle Auflistung von Mustern (oder Modellen) der Daten liefern“.[3] Copyright 2011 Oracle Corporation Ein Beispiel •Wann wird Vollmilch- oder Zartbitter-Schokolade gekauft? •Beobachtung der letzten Verkäufe Alter Geschlecht Wetter Schokolade 30-50 M Sonnig Vollmilch <30 W Regen Zartbitter >60 M Regen Vollmilch 30-50 M Regen Zartbitter •Ziel: Vorhersage der Schokolade aufgrund früherer Beobachtungen Copyright 2011 Oracle Corporation Vorhersage Alter Geschlecht Wetter Schokolade 30-50 M Sonnig Vollmilch <30 W Regen Zartbitter >60 M Regen Vollmilch 30-50 M Regen Zartbitter <30 M Regen ??? •Erster Versuch: Regeln ableiten Copyright 2011 Oracle Corporation Entscheidungsbaum Geschlecht M W Alter <30 Wetter Regen Vollmilch Alter <30-50 Wetter >60 Wetter Sonne Zartbitter Was, wenn immer mehr Beobachtungen kommen, Neue Werte, Wie groß ist der Faktor-Einfluss, Reihenfolge, ... Copyright 2011 Oracle Corporation Oracle Data Mining Algorithms Problem Algorithm Classification Logistic Regression (GLM) Decision Trees Naïve Bayes Support Vector Machine Multiple Regression (GLM) Support Vector Machine Regression Anomaly Detection Attribute Importance Association Rules Clustering Feature Extraction One Class SVM Minimum Description Length (MDL) A1 A2 A3 A4 A5 A6 A7 Apriori Hierarchical K-Means Hierarchical O-Cluster NonNegative Matrix Factorization F1 F2 F3 F4 Copyright 2011 Oracle Corporation Applicability Classical statistical technique Popular / Rules / transparency Embedded app Wide / narrow data / text Classical statistical technique Wide / narrow data / text Lack examples of target field Attribute reduction Identify useful data Reduce data noise Market basket analysis Link analysis Product grouping Text mining Gene and protein analysis Text analysis Feature reduction Oracle Data Mining Decision Trees • Classification, Prediction, Patient “profiling” Simple DM model: Age >45 <45 Age Status No Infection Can include unstructured data (e.g. physician’s comments), transactions data (e.g. lab results & longitudinal data), etc. Infection >35 Temp <100 Gender >100 Risk = 0 F M <=35 Days ICU >4 <=4 Risk = 1 Risk = 0 Risk = 1 Risk = 0 Risk = 1 IF (Age > 45 AND Status = Infection AND Temp = >100) THEN Probability(High Risk=1) = .77 Support = 250 Copyright 2011 Oracle Corporation Data Mining Provides Better Information, Valuable Insights and Predictions Cell Phone Churners vs. Loyal Customers Segment #3: IF CUST_MO > 7 AND INCOME < $175K, THEN Prediction = Cell Phone Churner, Confidence = 83%, Support = 6/39 Insight & Prediction Segment #1: IF CUST_MO > 14 AND INCOME < $90K, THEN Prediction = Cell Phone Churner, Confidence = 100%, Support = 8/39 Customer Months Source: Inspired from Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management by Michael J. A. Berry, Gordon S. Linoff Copyright 2011 Oracle Corporation Data Mining Provides Better Information, Valuable Insights and Predictions Cell Phone Fraud vs. Loyal Customers ? Customer Months Source: Inspired from Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management by Michael J. A. Berry, Gordon S. Linoff Copyright 2011 Oracle Corporation Oracle Data Mining Anomaly Detection Problem: Detect rare cases • “One-Class” SVM Models • • • • • Fraud, noncompliance Outlier detection Network intrusion detection Disease outbreaks Rare events, true novelty Copyright 2011 Oracle Corporation Typical Data Mining Use Cases • Retail · Customer segmentation · Response modeling · Recommend next likely product · Profile high value customers • Banking · Credit scoring · Probability of default · Customer profitability · Customer targeting • Insurance · Risk factor identification · Claims fraud · Policy bundling · Employee retention • Higher Education · Alumni donations · Student acquisition · Student retention · At-risk student identification • Healthcare · Patient procedure recommendation · Patient outcome prediction · Fraud detection · Doctor & nurse note analysis • Life Sciences · Drug discovery & interaction · Common factors in (un)healthy patients · Cancer cell classification · Drug safety surveillance • Telecommunications · Customer churn · Identify cross-sell opportunities · Network intrusion detection • Public Sector · Taxation fraud & anomalies · Crime analysis · Pattern recognition in military surveillance Copyright 2011 Oracle Corporation • Manufacturing · Root cause analysis of defects · Warranty analysis · Reliability analysis · Yield analysis • Automotive · Feature bundling for customer segments · Supplier quality analysis · Problem diagnosis • Chemical · New compound discovery · Molecule clustering · Product yield analysis • Utilities · Predict power line / equipment failure · Product bundling · Consumer fraud detection Datamining Klassifikation Predictive Descriptive • Classification • Regression • TimeSeries Analysis • Prediction • Clustering • Summarization • Association Rules • Sequence Discovery Copyright 2011 Oracle Corporation Data-Mining als Prozess • Daten auswählen • Daten bereiningen • Daten Vorbereiten • Data-Mining im engeren Sinne • Interpretation und Anwendung Copyright 2011 Oracle Corporation Datenvorbereitung Descriptive Datenanalyse • Welcher Wert wie häufig? • Bestimmte Häufigkeitsverteilungen • Korrelation von Attributen • Welche Attribute Trennen Klassen wie gut? • Mittelwert, Median, Mode • (Standard-) Verteilung, Sigma • Bsp DataDiscovery.sql Copyright 2011 Oracle Corporation Oracle—Hardware and Software Engineered to Work Together • Oracle is the world's most complete, open, and integrated business software and hardware systems company • Data Warehousing, VLDB and ILM • Oracle R Enterprise • R for the Enterprise • Oracle Data Mining Option • 12- in-DB data mining algorithms Oracle has taught the Database how to do Advanced Math/Statistics/Data Mining, and more… Copyright 2011 Oracle Corporation What is Data Mining? • Automatically sifts through data to find hidden patterns, discover new insights, and make predictions • Data Mining can provide valuable results: • • • • Predict customer behavior (Classification) Predict or estimate a value (Regression) Segment a population (Clustering) Identify factors more associated with a business problem (Attribute Importance) • Find profiles of targeted people or items (Decision Trees) • Determine important relationships and “market baskets” within the population (Associations) • Find fraudulent or “rare events” (Anomaly Detection) Copyright 2011 Oracle Corporation • 12 years “stem celling analytics” into Oracle • Designed advanced analytics into database kernel to leverage relational database strengths • Naïve Bayes and Association Rules—1st algorithms added • Leverages counting, conditional probabilities, and much more • Now, analytical database platform • 12 cutting edge machine learning algorithms and 50+ statistical functions • A data mining model is a schema object in the database, built via a PL/SQL API and scored via built-in SQL functions. • When building models, leverage existing scalable technology • (e.g., parallel execution, bitmap indexes, aggregation techniques) and add new core database technology (e.g., recursion within the parallel infrastructure, IEEE float, etc.) • True power of embedding within the database is evident when scoring models using built-in SQL functions (incl. Exadata) select cust_id from customers where region = ‘US’ and prediction_probability(churnmod, ‘Y’ using *) > 0.8; Copyright 2011 Oracle Corporation In-Database Data Mining Traditional Analytics Oracle Data Mining Results Data Import Data Mining Model “Scoring” Data Preparation and Transformation Savings Data Mining Model Building Data Prep & Transformation Model “Scoring” Data remains in the Database Embedded data preparation Data Extraction Cutting edge machine learning algorithms inside the SQL kernel of Database Model “Scoring” Embedded Data Prep Model Building Data Preparation Hours, Days or Weeks Source Data • Faster time for “Data” to “Insights” • Lower TCO—Eliminates • Data Movement • Data Duplication • Maintains Security Dataset s/ Work Area Analytic al Process ing Process Output Target Secs, Mins or Hours SQL—Most powerful language for data preparation and transformation Data remains in the Database Copyright 2011 Oracle Corporation You Can Think of It Like This… Traditional SQL Oracle Data Mining • “Human-driven” queries • Domain expertise • Any “rules” must be defined and managed • SQL Queries • • • • • • • • SELECT DISTINCT AGGREGATE WHERE AND OR GROUP BY ORDER BY RANK • Automated knowledge discovery, model building and deployment • Domain expertise to assemble the “right” data to mine + • ODM “Verbs” • • • • • • • • PREDICT DETECT CLUSTER CLASSIFY REGRESS PROFILE IDENTIFY FACTORS ASSOCIATE Copyright 2011 Oracle Corporation ODM im SQLDev3.X Copyright 2011 Oracle Corporation SQL Developer 3.0/ Oracle Data Miner 11g Release 2 GUI • Graphical User Interface for data analyst • SQL Developer Extension (OTN download) • Explore data— discover new insights • Build and evaluate data mining models • Apply predictive models • Share analytical workflows • Deploy SQL Apply code/scripts Copyright 2011 Oracle Corporation Oracle Data Miner Nodes (Partial List) Tables and Views Transformations Explore Data Modeling Text Copyright 2011 Oracle Corporation Oracle Data Mining and Unstructured Data • Oracle Data Mining mines unstructured i.e. “text” data • Include free text and comments in ODM models • Cluster and Classify documents • Oracle Text used to preprocess unstructured text Copyright 2011 Oracle Corporation Easier Star Demo Copyright 2011 Oracle Corporation Exadata + Data Mining 11g Release 2 “DM Scoring” Pushed to Storage! Faster • In 11g Release 2, SQL predicates and Oracle Data Mining models are pushed to storage level for execution For example, find the US customers likely to churn: select cust_id from customers Scoring function executed in Exadata where region = ‘US’ and prediction_probability(churnmod,‘Y’ using *) > 0.8; Copyright 2011 Oracle Corporation Real-time Prediction for a Customer • On-the-fly, single record apply with new data (e.g. from call center) Select prediction_probability(CLAS_DT_5_2, 'Yes' USING 7800 as bank_funds, 125 as checking_amount, 20 as credit_balance, 55 as age, 'Married' as marital_status, 250 as MONEY_MONTLY_OVERDRAWN, 1 as house_ownership) from dual; Call Center Social Media Branch ECM BI Get Advice Web Email CRM Copyright 2011 Oracle Corporation Mobile Statistische Funktionen in der DB (free) Copyright 2011 Oracle Corporation 11g Statistics & SQL Analytics (Free) • Ranking functions Statistics Descriptive Statistics • rank, dense_rank, cume_dist, percent_rank, ntile • Window Aggregate functions (moving and cumulative) • Avg, sum, min, max, count, variance, stddev, first_value, last_value • LAG/LEAD functions • Direct inter-row reference using offsets • Reporting Aggregate functions • Sum, avg, min, max, variance, stddev, count, ratio_to_report • Statistical Aggregates • Correlation, linear regression family, covariance • Linear regression • Fitting of an ordinary-least-squares regression line to a set of number pairs. • Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions • DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, median, stats_mode, variance, standard deviation, quantile values, +/- n sigma values, top/bottom 5 values • Correlations • Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric). • Cross Tabs • Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa • Hypothesis Testing • Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA • Distribution Fitting • Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential Note: Statistics and SQL Analytics are included in Oracle Database Standard Edition Copyright 2011 Oracle Corporation Ability to Import/Export 3rd Party DM Models • ODM 11g Release 2 adds ability to import 3rd party models (PMML), convert to native ODM models and score them in-DB • Supported models for ODM model export: • Decision Trees (PMML) • Supported algorithms for ODM model import: • Multiple regression models (PMML) • Logistic regression models (PMML) • Benefits • SAS, SPSS, R, etc. data mining models can scored on Exadata • Imported dm models become native ODM models and inherit all ODM benefits including scoring at Exadata storage layer, 1st class objects, security, etc. Faster Copyright 2011 Oracle Corporation ODM in OBIEE Copyright 2011 Oracle Corporation Oracle Communications Industry Data Model Example Better Information for OBIEE Dashboards ODM’s predictions & probabilities are available in the Database for reporting using Oracle BI EE and other tools Copyright 2011 Oracle Corporation Exadata with Analytics and Business Intelligence—Better Together • In-database data mining builds predictive models that predict customer behavior • OBIEE’s integrated spatial mapping shows where Customer “most likely” be be HIGH and VERY HIGH value customer in the future Copyright 2011 Oracle Corporation Exadata with Analytics and Business Intelligence—Better Together • Deliver advanced in-database analytics Oracle Data Mining’s Predictions versus “Actuals” highlight areas for improvement and insights through OBIEE • Ability to drill-through for detail • Harness the power of Exadata for “Better BI & analytics” Copyright 2011 Oracle Corporation Exadata with Analytics and Business Intelligence—Better Together Drill-through for details about top factors that define HIGH and VERY HIGH value customers • Exadata power • OBIEE ease-of-use Copyright 2011 Oracle Corporation Fusion HCM Predictive Analytics Factory Installed PA/ODM Methodologies Copyright 2011 Oracle Corporation Easier OBIEE Demo Copyright 2011 Oracle Corporation Oracle R Copyright 2011 Oracle Corporation What is R Enterprise? Open Source • Oracle R Enterprise brings R’s statistical functionality closer to the Oracle Database 1. Eliminate R’s memory constraint by enabling R to work directly & transparently on database objects • R Open Source Allows R to run on very large data sets 2. Architected for Enterprise production infrastructure • • Automatically exploits database parallelism without require parallel R programming Build and immediately deploy 3. Oracle R leverages the latest R algorithms and packages • R is an embedded component of the DBMS server Copyright 2011 Oracle Corporation Architecture and Performance • Transparently function-ships R constructs to database via R SQL translation • Data structures • Functions • Performs data-heavy computations in database • R for summary analysis and graphics • Transparent implementation enables using wide range of R “packages” from open source community Seconds • Data manipulation functions (select, project, join) • Basic statistical functions (avg, sum, summary) • Advanced statistical functions(gamma, beta) Copyright 2011 Oracle Corporation Better Business Intelligence Enrich BI Dashboards with Statistics, Data Mining and Adv. Analytics • • • • Ad hoc Exploratory data analysis Interactive graphics Problem-solving Oracle R Enterprise's and ODM's results become a data feed for OBIEE Copyright 2011 Oracle Corporation Benefits • "R for the Enterprise" • Oracle R Enterprise enables you to: R Open Source • Run R to interactively explore and analyze data inside the Database • Develop R scripts on big data stored as tables and views inside the Oracle database and then deploy them within the enterprise—with no changes to your code • Leverage R’s familiar R console and open source R GUIs and IDEs to explore and analyze data either in the database and stored as R data frames • Develop R scripts and deploy them within the enterprise—with no changes to your code • Meet the statistical and advanced analytical requirements of the enterprise • Exploit an information technology platform designed to support analytically-driven applications. • Leverage 30+ years of experience of ever advancing Oracle Database technology. Copyright 2011 Oracle Corporation Summary "R for the Enterprise" • • • • • Enables DBAs and LOB users to readily integrate R models into production Enables R models to be integrated into BI dashboards Enables R programmers/statisticians to work against database data without knowing SQL Reduces the number of LOB help requests for SQL queries to obtain data Removes the need to manage data outside Oracle Database Save money on SA$! • Use Oracle R Enterprise instead of Base SAS and reduce SA$ Annual Usage Fees • Private analytical sandboxes for LOB/data analyst to work directly on database data in-database Oracle in-Database Analytics for Big Data • • • • Over 100 built-in statistical functions that are compatible with Base SAS High performance in-database linear algebra Data parallelism for open source R packages executing in-database Develop your own algorithms for execution closer to the data, and leverage database parallelism Copyright 2011 Oracle Corporation Easier Learn More Copyright 2011 Oracle Corporation Learn More Oracle Data Mining on OTN Oracle Data Mining Blog Oracle Data Mining Copyright 2011 Oracle Corporation Learn More Copyright 2011 Oracle Corporation Copyright 2011 Oracle Corporation