© 2016 Pivotal Software, Inc. All rights reserved.

© 2016 Pivotal Software, Inc. All rights reserved.
1
Statistik im Job
Data Scientist bei Pivotal Software
Ronert Obst, Senior Data Scientist
© 2016 Pivotal Software, Inc. All rights reserved.
2
© 2016 Pivotal Software, Inc. All rights reserved.
3
Product Design
Continuous Improvement
Data Science
Engineering
Product Management
© 2016 Pivotal Software, Inc. All rights reserved.
4
Ich
2007 – 2010
Bachelor VWL FU Berlin
2011 – 2013
Master Statistik LMU München
2013 – 2015
Data Scientist Pivotal
2015 – jetzt
Senior Data Scientist Pivotal
© 2016 Pivotal Software, Inc. All rights reserved.
5
Agenda
Was ist Data Science?
Data Science Alltag
Beispielprojekte
Wie wird man Data Scientist?
© 2016 Pivotal Software, Inc. All rights reserved.
6
Was ist Data Science?
© 2016 Pivotal Software, Inc. All rights reserved.
7
Statistik Δ Data Science
•
Machine Learning lastig
– Random Forest
– SVM
– Deep Learning
•
Unterschiedlichere Datenquellen
– Bild + Video
– Audio
– Text
• Web Logs
• Freitext
•
•
•
Größere Datenmengen
P-Werte, AIC, BIC vs. Kreuzvalidierung
Informatik lastiger
– ETL
– Operationalisierung (API, Dashboard)
© 2016 Pivotal Software, Inc. All rights reserved.
8
Statistical Modeling: The Two Cultures
© 2016 Pivotal Software, Inc. All rights reserved.
9
Statistik ∩ Data Science
• Viele Gemeinsamkeiten
• "Those who ignore
Statistics are condemned
to reinvent it." – Brad Efron
• Statistikstudium von Vorteil
• Aber nicht hinreichend
– Informatik
– Maschinelles Lernen
© 2016 Pivotal Software, Inc. All rights reserved.
10
Data Science Alltag
© 2016 Pivotal Software, Inc. All rights reserved.
11
Problemstellung Konkretisieren
• Oft wissen Kunden überhaupt
nicht was sie genau wollen
• Was ist überhaupt mit den
Daten möglich
• Problem definieren
• Mehrwert und Auswirkung
maximieren
• Viele Meetings um überhaupt
die Anforderungen und das
Business zu verstehen
© 2016 Pivotal Software, Inc. All rights reserved.
12
Daten Extrahieren, Reinigen und Verstehen
• Extract, Transform, Load
• Daten sind meistens in
keinem sauberen Zustand
• An der Uni bekommt man
einen kleinen Datensatz mit
einer Hand voll Variablen
• In der Realität kriegt man 40
Tabellen über 5
Datenbanken verteilt
• Variablen mit kryptischen
Namen ohne Beschreibung
© 2016 Pivotal Software, Inc. All rights reserved.
13
Big Data Software und Hardware
• Weder Hardware noch
Software im Big Data
umfeld funktioniert
besonders gut und
zuverlässig
• Nichts ist wirklich
ausgereift
• Man verbringt viel Zeit
damit obskure
Fehlermeldungen zu
beheben
© 2016 Pivotal Software, Inc. All rights reserved.
14
Modellieren
•
•
•
•
•
•
~ 20 % der Arbeit
Vorhersagegüte aus der Kreuzvalidierung
meistens das Maß aller Dinge
Signifikante Schätzer und Prüfung der
Modellannahmen in der Praxis oft
unwichtig
Auswertung Unsupervised Learning
weniger offensichtlich
Feature Engineering
Vorteil von Modellen, die Interaktionen
selbst finden und Variablen automatisch
selektieren
– Random Forest
– Boosting
– Deep Learning
•
Baseline zum Vergleich
© 2016 Pivotal Software, Inc. All rights reserved.
15
Präsentieren
• Ergebnisse müssen
für Laien
verständlich
gemacht werden
– Visualisierungen
– Verständliche
Erklärungen
– Relevanz zeigen
• Firmenpolitik
beachten
• Häufigen Kontakt
pflegen
© 2016 Pivotal Software, Inc. All rights reserved.
16
Operationalisieren
•
•
•
•
•
•
•
•
Data Pipelines
Batch vs. Echtzeit
APIs
Smart Apps
Dashboards
Health Monitoring
Logging
Performance
© 2016 Pivotal Software, Inc. All rights reserved.
17
Praktiken aus der Softwareentwicklung
Pair Programming
Test Driven Development
Standups
Retros
Continuous Integration /
API First
Tracker
© 2016 Pivotal Software, Inc. All rights reserved.
18
Pairing
with clients
© 2016 Pivotal Software, Inc. All rights reserved.
19
TDD & CI
© 2016 Pivotal Software, Inc. All rights reserved.
20
Pivotal Tracker
© 2016 Pivotal Software, Inc. All rights reserved.
21
Retros
© 2016 Pivotal Software, Inc. All rights reserved.
22
Standups
© 2016 Pivotal Software, Inc. All rights reserved.
23
Beispielprojekte
© 2016 Pivotal Software, Inc. All rights reserved.
24
Automobilbranche
• Aquaplaning
• Garantiefall und
Werkstattaufenthaltsdauer
• Empfehlungssystem für Extras
• Vorhersage der Nachfrage
• Preisnachlassmodellierung im
Gebrauchtwagenmarkt
• Prädiktive Wartung für Lastwagen
© 2016 Pivotal Software, Inc. All rights reserved.
25
Logistik
• Ankunftszeitvorhersage
• Vorhersage der Nachfrage und
Kapazitätsplanung
• Vorhersage ob Pakete vom Zoll
festgehalten werden
• Kundensegmentierung
© 2016 Pivotal Software, Inc. All rights reserved.
26
Infosec
• Erkennen von Malware
und Botnetze
• Proxy Logs
• Connected Components
• Reputation Propagation
• Algorithmically
Generated Domain
Names
• Anomalous Usage
Patterns with PCA
© 2016 Pivotal Software, Inc. All rights reserved.
27
Wie wird man Data Scientist?
© 2016 Pivotal Software, Inc. All rights reserved.
28
Wie wird man Data Scientist?
• Statistikstudium alleine reicht nicht
– Machine Learning
– Informatik
• Persönlichkeit
– Data Science bewegt sich sehr schnell, man
muss selbst interessiert und motiviert sein um
auf dem Laufenden zu bleiben
• Übung
– Kaggle
– GitHub
© 2016 Pivotal Software, Inc. All rights reserved.
29
Machine Learning
http://deeplearning.net/
© 2016 Pivotal Software, Inc. All rights reserved.
30
Informatik
•
•
•
•
•
•
•
•
Python
Hadoop & Spark
SQL
Linux Kommandozeile
Versionskontrolle (git)
http://scikit-learn.org/
http://keras.io/
http://caffe.berkeleyvision.or
g/
• http://stanfordnlp.github.io/C
oreNLP/
• https://spacy.io/
© 2016 Pivotal Software, Inc. All rights reserved.
31
Üben
• www.kaggle.com
• Data Science
Wettbewerbe
• Firmen stellen Ihre Data
Science Probleme auf
Kaggle
• Preisgeld
• Zeitrahmen
• Entspricht aber nicht
ganz der Realität
© 2016 Pivotal Software, Inc. All rights reserved.
32
© 2016 Pivotal Software, Inc. All rights reserved.
33
© 2016 Pivotal Software, Inc. All rights reserved.
34