What is Data Mining? - Oracle Data Warehouse Community Seite

In-Database Analytics:
Predictive Analytics,
Data Mining, R and Oracle Business Intelligence
Detlef E. Schröder
Leitender Systemberater
STCC DB Mitte
[email protected]
www.oracledwh.de
Copyright 2011 Oracle Corporation
Agenda
•
•
•
•
•
•
•
•
Überblick Oracle Datamining (ODM)
Integration in den SQL Developer 3.X
Demo ODM im SQL Dev 3.0
Integration mit OBIEE
Demo ODM in OBIEE
Statistische Funktionen in der DB (free)
Oracle Enterprise Data Mining inkl. R
Learn More
Copyright 2011 Oracle Corporation
Easier
Überblick ODM
Copyright 2011 Oracle Corporation
Data-Mining laut Wikipedia
Unter Data-Mining (der englische Begriff bedeutet etwa „aus einem Datenberg etwas
Wertvolles extrahieren“, eine adäquate deutsche Übersetzung existiert nicht.[1] Der Duden
empfiehlt die Schreibweise „Data-Mining“[2]) versteht man die systematische Anwendung von
Methoden, die meist statistisch-mathematisch begründet sind, auf einen Datenbestand mit
dem Ziel, neue Muster zu erkennen. Hierbei geht es auch um die Verarbeitung sehr großer
Datenbestände (die nicht mehr manuell verarbeitet werden könnten), wofür effiziente
Methoden benötigt werden, deren Zeitkomplexität sie für solche Datenmengen geeignet
macht. Die Methoden finden aber auch für kleinere Datenmengen Anwendung. In der Praxis,
vor allem im deutschen Sprachgebrauch, etablierte sich der angelsächsische Begriff „DataMining“ für den gesamten Prozess der so genannten „Knowledge Discovery in Databases“
(Wissensentdeckung in Datenbanken; KDD), der auch Schritte wie die Vorverarbeitung
beinhaltet, während Data-Mining eigentlich nur den Analyseschritt des Prozesses
bezeichnet.[3]
Die reine Erfassung, Speicherung und Verarbeitung von großen Datenmengen wird
gelegentlich fälschlicherweise auch mit dem Buzzword Data-Mining bezeichnet. Hier gibt es
aber akkuratere Begriffe wie beispielsweise Data Warehousing. Korrekt verwendet bezeichnet
es die Extraktion von Wissen, das „gültig (im statistischen Sinne), bisher unbekannt und
potentiell nützlich“[4] ist „zur Bestimmung bestimmter Regelmäßigkeiten, Gesetzmäßigkeiten
und verborgener Zusammenhänge“.[2] Fayyad definiert es als „ein Schritt des KDD-Prozesses,
der darin besteht Datenanalyse- und Entdeckungsalgorithmen anzuwenden, die unter
akzeptablen Effizienzbegrenzungen eine spezielle Auflistung von Mustern (oder Modellen) der
Daten liefern“.[3]
Copyright 2011 Oracle Corporation
Ein Beispiel
•Wann wird Vollmilch- oder Zartbitter-Schokolade gekauft?
•Beobachtung der letzten Verkäufe
Alter
Geschlecht
Wetter
Schokolade
30-50
M
Sonnig
Vollmilch
<30
W
Regen
Zartbitter
>60
M
Regen
Vollmilch
30-50
M
Regen
Zartbitter
•Ziel: Vorhersage der Schokolade aufgrund früherer
Beobachtungen
Copyright 2011 Oracle Corporation
Vorhersage
Alter
Geschlecht
Wetter
Schokolade
30-50
M
Sonnig
Vollmilch
<30
W
Regen
Zartbitter
>60
M
Regen
Vollmilch
30-50
M
Regen
Zartbitter
<30
M
Regen
???
•Erster Versuch: Regeln ableiten
Copyright 2011 Oracle Corporation
Entscheidungsbaum
Geschlecht
M
W
Alter
<30
Wetter
Regen
Vollmilch
Alter
<30-50
Wetter
>60
Wetter
Sonne
Zartbitter
Was, wenn immer mehr
Beobachtungen kommen,
Neue Werte, Wie groß ist der
Faktor-Einfluss, Reihenfolge, ...
Copyright 2011 Oracle Corporation
Oracle Data Mining Algorithms
Problem
Algorithm
Classification
Logistic Regression (GLM)
Decision Trees
Naïve Bayes
Support Vector Machine
Multiple Regression (GLM)
Support Vector Machine
Regression
Anomaly
Detection
Attribute
Importance
Association
Rules
Clustering
Feature
Extraction
One Class SVM
Minimum Description
Length (MDL)
A1 A2 A3 A4 A5 A6 A7
Apriori
Hierarchical K-Means
Hierarchical O-Cluster
NonNegative Matrix
Factorization
F1 F2 F3 F4
Copyright 2011 Oracle Corporation
Applicability
Classical statistical technique
Popular / Rules / transparency
Embedded app
Wide / narrow data / text
Classical statistical technique
Wide / narrow data / text
Lack examples of target field
Attribute reduction
Identify useful data
Reduce data noise
Market basket analysis
Link analysis
Product grouping
Text mining
Gene and protein analysis
Text analysis
Feature reduction
Oracle Data Mining
Decision Trees
• Classification, Prediction, Patient “profiling”
Simple DM model:
Age
>45
<45
Age
Status
No Infection
Can include
unstructured data (e.g.
physician’s comments),
transactions data (e.g.
lab results &
longitudinal data), etc.
Infection
>35
Temp
<100
Gender
>100
Risk = 0
F
M
<=35
Days ICU
>4
<=4
Risk = 1 Risk = 0 Risk = 1 Risk = 0 Risk = 1
IF (Age > 45 AND Status = Infection AND Temp = >100)
THEN Probability(High Risk=1) = .77 Support = 250
Copyright 2011 Oracle Corporation
Data Mining Provides
Better Information, Valuable Insights and Predictions
Cell Phone Churners
vs. Loyal Customers
Segment #3:
IF CUST_MO > 7 AND
INCOME < $175K, THEN
Prediction = Cell Phone
Churner, Confidence =
83%, Support = 6/39
Insight &
Prediction
Segment #1:
IF CUST_MO > 14 AND
INCOME < $90K, THEN
Prediction = Cell Phone
Churner, Confidence =
100%, Support = 8/39
Customer Months
Source: Inspired from Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management by Michael J. A. Berry, Gordon S. Linoff
Copyright 2011 Oracle Corporation
Data Mining Provides
Better Information, Valuable Insights and Predictions
Cell Phone Fraud
vs. Loyal Customers
?
Customer Months
Source: Inspired from Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management by Michael J. A. Berry, Gordon S. Linoff
Copyright 2011 Oracle Corporation
Oracle Data Mining
Anomaly Detection
Problem: Detect rare cases
• “One-Class” SVM Models
•
•
•
•
•
Fraud, noncompliance
Outlier detection
Network intrusion detection
Disease outbreaks
Rare events, true novelty
Copyright 2011 Oracle Corporation
Typical Data Mining Use Cases
• Retail
· Customer segmentation
· Response modeling
· Recommend next likely
product
· Profile high value customers
• Banking
· Credit scoring
· Probability of default
· Customer profitability
· Customer targeting
• Insurance
· Risk factor identification
· Claims fraud
· Policy bundling
· Employee retention
• Higher Education
· Alumni donations
· Student acquisition
· Student retention
· At-risk student identification
• Healthcare
· Patient procedure
recommendation
· Patient outcome prediction
· Fraud detection
· Doctor & nurse note analysis
• Life Sciences
· Drug discovery & interaction
· Common factors in
(un)healthy patients
· Cancer cell classification
· Drug safety surveillance
• Telecommunications
· Customer churn
· Identify cross-sell
opportunities
· Network intrusion detection
• Public Sector
· Taxation fraud & anomalies
· Crime analysis
· Pattern recognition in
military surveillance
Copyright 2011 Oracle Corporation
• Manufacturing
· Root cause analysis of
defects
· Warranty analysis
· Reliability analysis
· Yield analysis
• Automotive
· Feature bundling for
customer segments
· Supplier quality analysis
· Problem diagnosis
• Chemical
· New compound discovery
· Molecule clustering
· Product yield analysis
• Utilities
· Predict power line /
equipment failure
· Product bundling
· Consumer fraud detection
Datamining Klassifikation
Predictive
Descriptive
• Classification
• Regression
• TimeSeries Analysis
• Prediction
• Clustering
• Summarization
• Association Rules
• Sequence Discovery
Copyright 2011 Oracle Corporation
Data-Mining als Prozess
• Daten auswählen
• Daten bereiningen
• Daten Vorbereiten
• Data-Mining im
engeren Sinne
• Interpretation und
Anwendung
Copyright 2011 Oracle Corporation
Datenvorbereitung
Descriptive Datenanalyse
• Welcher Wert wie häufig?
• Bestimmte Häufigkeitsverteilungen
• Korrelation von Attributen
• Welche Attribute Trennen Klassen wie gut?
• Mittelwert, Median, Mode
• (Standard-) Verteilung, Sigma
•  Bsp DataDiscovery.sql
Copyright 2011 Oracle Corporation
Oracle—Hardware and Software
Engineered to Work Together
• Oracle is the world's most complete,
open, and integrated business
software and hardware systems
company
• Data Warehousing, VLDB and ILM
• Oracle R Enterprise
• R for the Enterprise
• Oracle Data Mining Option
• 12- in-DB data mining algorithms
Oracle has taught the Database how to do
Advanced Math/Statistics/Data Mining, and
more…
Copyright 2011 Oracle Corporation
What is Data Mining?
• Automatically sifts through data to
find hidden patterns, discover new insights,
and make predictions
• Data Mining can provide valuable results:
•
•
•
•
Predict customer behavior (Classification)
Predict or estimate a value (Regression)
Segment a population (Clustering)
Identify factors more associated with a business
problem (Attribute Importance)
• Find profiles of targeted people or items (Decision Trees)
• Determine important relationships and “market baskets”
within the population (Associations)
• Find fraudulent or “rare events” (Anomaly Detection)
Copyright 2011 Oracle Corporation
• 12 years “stem celling analytics” into Oracle
• Designed advanced analytics into database kernel to leverage relational
database strengths
• Naïve Bayes and Association Rules—1st algorithms added
• Leverages counting, conditional probabilities, and much more
• Now, analytical database platform
• 12 cutting edge machine learning algorithms and 50+ statistical functions
• A data mining model is a schema object in the database, built via a PL/SQL API
and scored via built-in SQL functions.
• When building models, leverage existing scalable technology
• (e.g., parallel execution, bitmap indexes, aggregation techniques) and add new core
database technology (e.g., recursion within the parallel infrastructure, IEEE float, etc.)
• True power of embedding within the database is evident when scoring models
using built-in SQL functions (incl. Exadata)
select cust_id
from customers
where region = ‘US’
and prediction_probability(churnmod, ‘Y’ using *) > 0.8;
Copyright 2011 Oracle Corporation
In-Database Data Mining
Traditional Analytics
Oracle Data Mining
Results
Data Import
Data Mining
Model “Scoring”
Data Preparation
and
Transformation
Savings
Data Mining
Model Building
Data Prep &
Transformation
Model “Scoring”
Data remains in the Database
Embedded data preparation
Data Extraction
Cutting edge machine learning
algorithms inside the SQL kernel of
Database
Model “Scoring”
Embedded Data Prep
Model Building
Data Preparation
Hours, Days or Weeks
Source
Data
• Faster time for
“Data” to “Insights”
• Lower TCO—Eliminates
• Data Movement
• Data Duplication
• Maintains Security
Dataset
s/ Work
Area
Analytic
al
Process
ing
Process
Output
Target
Secs, Mins or Hours
SQL—Most powerful language for data
preparation and transformation
Data remains in the Database
Copyright 2011 Oracle Corporation
You Can Think of It Like This…
Traditional SQL
Oracle Data Mining
• “Human-driven” queries
• Domain expertise
• Any “rules” must be
defined and managed
• SQL Queries
•
•
•
•
•
•
•
•
SELECT
DISTINCT
AGGREGATE
WHERE
AND OR
GROUP BY
ORDER BY
RANK
• Automated knowledge
discovery, model building and
deployment
• Domain expertise to assemble
the “right” data to mine
+
• ODM “Verbs”
•
•
•
•
•
•
•
•
PREDICT
DETECT
CLUSTER
CLASSIFY
REGRESS
PROFILE
IDENTIFY FACTORS
ASSOCIATE
Copyright 2011 Oracle Corporation
ODM im SQLDev3.X
Copyright 2011 Oracle Corporation
SQL Developer 3.0/
Oracle Data Miner 11g Release 2 GUI
• Graphical User
Interface for data
analyst
• SQL Developer
Extension (OTN download)
• Explore data—
discover new insights
• Build and evaluate
data mining models
• Apply predictive
models
• Share analytical
workflows
• Deploy SQL Apply
code/scripts
Copyright 2011 Oracle Corporation
Oracle Data Miner Nodes (Partial List)
Tables and Views
Transformations
Explore Data
Modeling
Text
Copyright 2011 Oracle Corporation
Oracle Data Mining and Unstructured Data
• Oracle Data
Mining mines
unstructured i.e.
“text” data
• Include free text
and comments in
ODM models
• Cluster and
Classify
documents
• Oracle Text
used to
preprocess
unstructured text
Copyright 2011 Oracle Corporation
Easier
Star Demo
Copyright 2011 Oracle Corporation
Exadata + Data Mining 11g Release 2
“DM Scoring” Pushed to Storage!
Faster
• In 11g Release 2, SQL predicates and Oracle Data Mining
models are pushed to storage level for execution
For example, find the US customers likely to churn:
select cust_id
from customers
Scoring function executed in Exadata
where region = ‘US’
and prediction_probability(churnmod,‘Y’ using *) > 0.8;
Copyright 2011 Oracle Corporation
Real-time Prediction for a Customer
• On-the-fly, single record apply with new data (e.g. from
call center)
Select prediction_probability(CLAS_DT_5_2, 'Yes'
USING 7800 as bank_funds, 125 as checking_amount, 20 as
credit_balance, 55 as age, 'Married' as marital_status,
250 as MONEY_MONTLY_OVERDRAWN, 1 as house_ownership)
from dual;
Call Center
Social Media
Branch
ECM
BI
Get Advice
Web
Email
CRM
Copyright 2011 Oracle Corporation
Mobile
Statistische Funktionen
in der DB (free)
Copyright 2011 Oracle Corporation
11g Statistics & SQL Analytics (Free)
• Ranking functions
Statistics
Descriptive Statistics
• rank, dense_rank, cume_dist,
percent_rank, ntile
• Window Aggregate functions
(moving and cumulative)
• Avg, sum, min, max, count, variance,
stddev, first_value, last_value
• LAG/LEAD functions
• Direct inter-row reference using offsets
• Reporting Aggregate functions
• Sum, avg, min, max, variance, stddev,
count, ratio_to_report
• Statistical Aggregates
• Correlation, linear regression family,
covariance
• Linear regression
• Fitting of an ordinary-least-squares
regression line to a set of number pairs.
• Frequently combined with the
COVAR_POP, COVAR_SAMP, and
CORR functions
• DBMS_STAT_FUNCS: summarizes
numerical columns of a table and returns
count, min, max, range, mean, median,
stats_mode, variance, standard deviation,
quantile values, +/- n sigma values,
top/bottom 5 values
• Correlations
• Pearson’s correlation coefficients, Spearman's
and Kendall's (both nonparametric).
• Cross Tabs
• Enhanced with % statistics: chi squared, phi
coefficient, Cramer's V, contingency
coefficient, Cohen's kappa
• Hypothesis Testing
• Student t-test , F-test, Binomial test, Wilcoxon
Signed Ranks test, Chi-square, Mann Whitney
test, Kolmogorov-Smirnov test, One-way
ANOVA
• Distribution Fitting
• Kolmogorov-Smirnov Test, Anderson-Darling
Test, Chi-Squared Test, Normal, Uniform,
Weibull, Exponential
Note: Statistics and SQL Analytics are included in Oracle Database Standard Edition
Copyright 2011 Oracle Corporation
Ability to Import/Export 3rd Party DM Models
• ODM 11g Release 2 adds ability to import 3rd party models
(PMML), convert to native ODM models and score them in-DB
• Supported models for ODM model export:
• Decision Trees (PMML)
• Supported algorithms for ODM model import:
• Multiple regression models (PMML)
• Logistic regression models (PMML)
• Benefits
• SAS, SPSS, R, etc. data mining models can scored on Exadata
• Imported dm models become native ODM models and inherit all ODM benefits
including scoring at Exadata storage layer, 1st class objects, security, etc.
Faster
Copyright 2011 Oracle Corporation
ODM in OBIEE
Copyright 2011 Oracle Corporation
Oracle Communications Industry Data Model Example
Better Information for OBIEE Dashboards
ODM’s predictions & probabilities
are available in the Database for
reporting using Oracle BI EE and
other tools
Copyright 2011 Oracle Corporation
Exadata with Analytics and Business
Intelligence—Better Together
• In-database
data mining
builds
predictive
models that
predict
customer
behavior
• OBIEE’s
integrated
spatial
mapping
shows
where
Customer “most likely” be
be HIGH and VERY HIGH
value customer in the future
Copyright 2011 Oracle Corporation
Exadata with Analytics and Business
Intelligence—Better Together
• Deliver
advanced
in-database
analytics
Oracle Data Mining’s Predictions
versus “Actuals” highlight areas
for improvement and insights
through
OBIEE
• Ability to
drill-through
for detail
• Harness the
power of
Exadata for
“Better BI &
analytics”
Copyright 2011 Oracle Corporation
Exadata with Analytics and Business
Intelligence—Better Together
Drill-through for details
about top factors that
define HIGH and VERY
HIGH value customers
• Exadata
power
• OBIEE
ease-of-use
Copyright 2011 Oracle Corporation
Fusion HCM Predictive Analytics
Factory Installed PA/ODM Methodologies
Copyright 2011 Oracle Corporation
Easier
OBIEE Demo
Copyright 2011 Oracle Corporation
Oracle R
Copyright 2011 Oracle Corporation
What is
R
Enterprise?
Open Source
• Oracle R Enterprise brings R’s statistical
functionality closer to the Oracle Database
1. Eliminate R’s memory constraint by enabling R
to work directly & transparently on database objects
•
R
Open Source
Allows R to run on very large data sets
2. Architected for Enterprise production infrastructure
•
•
Automatically exploits database parallelism without require parallel R
programming
Build and immediately deploy
3. Oracle R leverages the latest R algorithms and packages
•
R is an embedded component of the DBMS server
Copyright 2011 Oracle Corporation
Architecture and Performance
• Transparently function-ships R
constructs to database via R  SQL
translation
• Data structures
• Functions
• Performs data-heavy computations in
database
• R for summary analysis and graphics
• Transparent implementation enables
using wide range of R “packages” from
open source community
Seconds
• Data manipulation functions (select, project, join)
• Basic statistical functions (avg, sum, summary)
• Advanced statistical functions(gamma, beta)
Copyright 2011 Oracle Corporation
Better Business Intelligence
Enrich BI Dashboards with Statistics, Data Mining and Adv. Analytics
•
•
•
•
Ad hoc
Exploratory data analysis
Interactive graphics
Problem-solving
 Oracle R Enterprise's
and ODM's results become
a data feed for OBIEE
Copyright 2011 Oracle Corporation
Benefits
• "R for the Enterprise"
• Oracle R Enterprise enables you to:
R
Open Source
• Run R to interactively explore and analyze data inside the Database
• Develop R scripts on big data stored as tables and views inside the Oracle
database and then deploy them within the enterprise—with no changes to your
code
• Leverage R’s familiar R console and open source R GUIs and IDEs to explore
and analyze data either in the database and stored as R data frames
• Develop R scripts and deploy them within the enterprise—with no changes to your
code
• Meet the statistical and advanced analytical requirements of the enterprise
• Exploit an information technology platform designed to support analytically-driven
applications.
• Leverage 30+ years of experience of ever advancing Oracle Database
technology.
Copyright 2011 Oracle Corporation
Summary
"R for the Enterprise"
•
•
•
•
•
Enables DBAs and LOB users to readily integrate R models into production
Enables R models to be integrated into BI dashboards
Enables R programmers/statisticians to work against database data without knowing SQL
Reduces the number of LOB help requests for SQL queries to obtain data
Removes the need to manage data outside Oracle Database
Save money on SA$!
• Use Oracle R Enterprise instead of Base SAS and reduce SA$ Annual Usage Fees
• Private analytical sandboxes for LOB/data analyst to work directly on database data in-database
Oracle in-Database Analytics for Big Data
•
•
•
•
Over 100 built-in statistical functions that are compatible with Base SAS
High performance in-database linear algebra
Data parallelism for open source R packages executing in-database
Develop your own algorithms for execution closer to the data, and leverage database parallelism
Copyright 2011 Oracle Corporation
Easier
Learn More
Copyright 2011 Oracle Corporation
Learn More
Oracle Data Mining on OTN
Oracle Data Mining Blog
Oracle Data Mining
Copyright 2011 Oracle Corporation
Learn More
Copyright 2011 Oracle Corporation
Copyright 2011 Oracle Corporation