FH Apache Hadoop – der kleine Elefant

Werbung
© MATERNA GmbH 2014
www.materna.de
1
Agenda
 Herausforderungen BigData
 Größeres Pferd oder Pferdegespann?
 Apache Hadoop
 Geschichte, Versionen, Ökosystem
 Produkte




HDFS – Daten speichern und verteilen
Map/Reduce – parallelisieren der Verarbeitung
Pig,Hive – Verarbeitung von Daten
HBase – Verteilte und skalierbare Speicherung von Daten
 Benchmark: Wie schnell 1TB sortieren?
 BigData-UseCases mit ORACLE
 Fazit, Ausblick
© Materna GmbH 2014
www.materna.de
2
Eigenschaften von Big Data
Datenvolumen
Steigende
Nutzerzahlen
Geschwindigkeit der
Datenproduktion
© Materna GmbH 2014
Datenquellen
www.materna.de
3
Kennzeichen von Big Data: die vier Vs
Volumen
Value
Velocity
Variety
© Materna GmbH 2014
www.materna.de
4
Größeres Pferd oder Pferdegespann?
vertikal vs horizontal
© Materna GmbH 2014
www.materna.de
5
Agenda
 Herausforderungen BigData
 Größeres Pferd oder Pferdegespann?
 Apache Hadoop
 Geschichte, Versionen, Ökosystem
 Produkte




HDFS – Daten speichern und verteilen
Map/Reduce – parallelisieren der Verarbeitung
Pig,Hive – Verarbeitung von Daten
HBase – Verteilte und skalierbare Speicherung von Daten
 Benchmark: Wie schnell 1TB sortieren?
 BigData-UseCases mit ORACLE
 Fazit, Ausblick
© Materna GmbH 2014
www.materna.de
6
Der Elefant hat den Laden verlassen …
© Materna GmbH 2014
www.materna.de
7
Hadoop Entstehung, Hintergründe: Da hinter steckt ein Kopf…
 Doug Cutting Lucene-Suche 1997 SourceForge, 2001 ASF
 Nutch-Webcrawler Hadoop 2003, 2005 ASF
 Google Labs paper:
– The Google File System, October, 2003
– MapReduce algorithm, December 2004
– Bigtable: A Distributed Storage System for Structured Data,
November 2006
– H-store: a high-performance, distributed main memory
transaction processing system, August 2008
– Dremel: Interactive Analysis of WebScale Datasets,
September 2010
 Board of directors of the Apache Software Foundation July 2009
 Doug Cutting Cloudera August 2009
 ASF chairman September 2010
© Materna GmbH 2014
www.materna.de
8
Hadoop Releases
Feature
0.20/1.2
Secure authentication
X
Old configuration names
X
New configuration names
Old MapReduce API
New MapReduce API
MapReduce 1 runtime (Classic)
0.22
X
Deprecated Deprecated
X
X
X
(X)
X
X
X
X
X
X
MapReduce 2 runtime (YARN)
HDFS federation
X
X
HDFS high-availability
02.2013
08.2013
© Materna GmbH 2014
0.23/2.4
www.materna.de
12.2011
X
12.2013
04.2014
9
© Materna GmbH 2014
www.materna.de
10
Columnar NoSQL Store
HBase
Coordination
Zookeeper
Das Hadoop-Ökosystem (De-Facto-Standard)
Pig
Hive
Data Flow
SQL
MapReduce
Distributed Programming Framework
HCatalog
Table & Schema Management
HDFS
Hadoop Distributed File System
© Materna GmbH 2014
www.materna.de
11
Evolution von Hadoop
2006
•
•
HDFS
MapReduce
© Materna GmbH 2014
2008
•
•
•
•
HBase
Zookeeper
Pig
Hive
2009-10
•
•
•
•
•
•
www.materna.de
Flume
Avro
Whirr
Sqoop
Mahoot
Oozie
2011-12
•
•
•
•
HCatalog
Bigtop
Ambari
YARN
12
Wer entwickelt Hadoop?
(Quelle: Hadoop in Practice)
© Materna GmbH 2014
www.materna.de
13
Einsatzgebiet: Datenbereinigung
Applications
Business
Analytics
Custom
Applications
Enterprise
Applications
Collect Data and apply
a known algorithm to it
in trusted operational
process
1
3
Data
Systems
RDB
MS
EDW
MPP
Capture all data
Hadoop
Platform
2
Traditional Repos
2
3
Traditional Sources
New Sources
RDBMS, OLTP, OLAP
Web logs, email, sensors, social media
Process
Parse, cleans, apply
structure in all form
1
Data
Sources
Capture
Exchange
Push to existing data
warehouse for use with
existing analytic tools
Apache Hadoop Patterns of Use: Hortonworks 2013
© Materna GmbH 2014
www.materna.de
14
Einsatzgebiet: Datenauswertung
Custom
Applications
Applications
Enterprise
Applications
Collect Data, analyze
and present salient
results for online apps
3
Data
Systems
1
Hadoop
Platform
RDB
MS
EDW
MPP
Capture all data
NOSQL
2
Traditional Repos
2
3
Traditional Sources
New Sources
RDBMS, OLTP, OLAP
Web logs, email, sensors, social media
Process
Parse, cleans, apply
structure & transform
1
Data
Sources
Capture
Exchange
Incorporate data directly
into applications
Apache Hadoop Patterns of Use: Hortonworks 2013
© Materna GmbH 2014
www.materna.de
15
HDFS, MapReduce, NameNode, DataNode
JobTracker
MapReduce
YARN
TaskTracker 1
ResourceManager (RM)
NodeManager 1
AM 1
ApplicationMaster (AM)
HDFS
NameNode
DataNode 1
Worker 1
64 MB
© Materna GmbH 2014
64 MB
64 MB
TaskTracker N
NodeManager N
AM N
DataNode N
Worker N
18 MB
www.materna.de
16
Daten in HDFS schreiben: Rackawareness Replication
File 1
B1
B2
B3
n1
B1
n1
B2
n1
B1
n2
B1
n2
B2
n2
B3
n3
B2
n3
B3
n3
B3
n4
NameNode
Rack 1
© Materna GmbH 2014
n4
Rack 2
www.materna.de
n4
Rack 3
17
MapReduce-Verfahren
2 parallele Phasen, Pipes&Filter (UNIX), funktionale Programmierung, Fehlertoleranz
Eingabe
“Mary
had a
little
lamb,
It's
fleece
was
white
as
snow,
Every
where
that
Mary
Went,
The
little
lamb
was
sure
to go.”
Ausgabe
(key, value)
list(out_key, intermediate_value)
Map
© Materna GmbH 2014
Reduce
Map
list(out_value)
Reduce
Map
berechnete Werte
www.materna.de
18
Wie werden MapReduce Jobs ausgeführt?
MapReduce 1. run job
program
2. get new job ID
JobClient
5. initialize job
JobTracker
4. submit job
client JVM
jobtracker node
client node
3. copy job
resources
Distributed
file system
(e.g. HDFS)
6. retrieve
input splits
7. heartbeat
(returns task)
TaskTracker
8. retrieve
job resources
9. launch
child JVM
Child
10. run
MapTask
or
ReduceTask
tasktracker node
© Materna GmbH 2014
www.materna.de
19
MapReduce bei Hadoop
Client
Job
Hadoop
MapReduce
Job parts
Job parts
Reduce
Map
Output data
Input data
Map
Reduce
Map
(Quelle: Hadoop in Practice)
© Materna GmbH 2014
www.materna.de
20
Wie hängen Teile zusammen?
ZooKeeper
locate
Client
HBase
Master
read/write
HBase
region
HBase
region
HBase
region
HDFS
HDFS
HDFS
Partitionierung
Replikation
(Quelle: Hadoop in Practice)
© Materna GmbH 2014
www.materna.de
21
Wie hängen Teile zusammen?
Client
Client
PigLatin
Pig
HiveQL
submit job
Hadoop
MapReduce
read/write
submit job
Hive
read/write
Hadoop
HDFS
(Quelle: Hadoop in Practice)
© Materna GmbH 2014
www.materna.de
22
Pig und Hive im Vergleich  Tez
Characteristic
Pig
Hive
Developed by
Yahoo!
Facebook
Language
Pig Latin
HiveQL
Type of language
Data flow
Declarative (SQL dialect)
Data structures
Complex
Better suited for structured data
Schema
Optional
Not optional
© Materna GmbH 2014
www.materna.de
23
Pig Komponenten
Pig
Pig Latin
Compiler
Execution Environment
An operation
as a statement
… LOAD ‘input.txt’;
A command
as a statement
… ls *.txt
Logical Plan
…
… DUMP…
Compile
Local
Distributed
Three steps:
LOAD
Load data from HDFS
TRANSFORM
Translated to a set of map and reduce tasks
DUMP or STORE
Display or store result
© Materna GmbH 2014
www.materna.de
Physical
Plan
Execute
24
Hive Architektur
DDL
JDBC/ODBC
Queries
CLI
Metastore
(Relational
database
for metadata)
Web
Interface
Parser,
Planner
Optimizer
Hadoop
© Materna GmbH 2014
www.materna.de
25
Agenda
 Herausforderungen BigData
 Größeres Pferd oder Pferdegespann?
 Apache Hadoop
 Geschichte, Versionen, Ökosystem
 Produkte




HDFS – Daten speichern und verteilen
Map/Reduce – parallelisieren der Verarbeitung
Pig,Hive – Verarbeitung von Daten
HBase – Verteilte und skalierbare Speicherung von Daten
 Benchmark: Wie schnell 1TB sortieren?
 BigData-UseCases mit ORACLE
 Fazit, Ausblick
© Materna GmbH 2014
www.materna.de
27
Hadoop Cluster Hardwareanforderungen: eine Menge Blech …
Intel Cloud Builders Guide: Apache Hadoop
http://software.intel.com/en-us/articles/intel-benchmark-install-and-test-tool-intel-bitt-tools/
© Materna GmbH 2014
www.materna.de
28
Hardware components of Intel Hadoop cluster
Master
Map Reduce
Job Tracking and HDFS*/
Storage Metadata
Task Tracker
Slave 1
Slave N
Data Storage and Processing
Task Tracker
Task Tracker
Rackawareness
Replication
Zookeeper*
Hive*
Job Tracker
Pig*
Oozie*
HDFS
Name Node
Avro*
Data Node
R720/ C2 100
© Materna GmbH 2014
Data Node
Data Node
R720XD/C2100/C6100/C6105
www.materna.de
HDFS Client*
R720/ C2 100
29
Optimierungsbereiche Hadoop-Installationen
Encryption
Benchmark
tuning
Security
& API‘s
Network
Storage
Compute
HI-Tune
HI-Bench
Fast
fabric
10 GbE
SSDs
Non-volatible
memory
© Materna GmbH 2014
Disk
write/
memory
www.materna.de
30
Unterschiedliche Kompressionsverfahren
Size
(Mbytes)
Compression
Speed (sec)
Compression
memory used
(Mbytes)
Decompression
speed
Decompression
memory used
(Mbytes)
Split
table
Un
compressed
96
Gzip
23
10
0.7
1.3
0.5
N
Bzip2
19
22
8
5
4
Y
lzo
36
1
1
0.6
0
(Y)
Y
(HADOOP-1824) want InputFormat for zip files
Client hadoop-site.xml
<property><name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.LzopCodec</value>
</property>
© Materna GmbH 2014
www.materna.de
31
Wichtig: Teste deine Infrastruktur
 bin/hadoop jar hadoop-*test*.jar, Benchmarks
 time hadoop jar ../hadoop-examples-1.1.0.jar wordcount 2 4
 hadoop jar hadoop*examples*.jar wordcount
/user/hduser/gutenberg /user/hduser/gutenberg-output
 time hadoop jar ../hadoop-test-1.1.0.jar
Kontrolle über:
 http://localhost:50070/ – web UI of the NameNode daemon
 http://localhost:50030/ – web UI of the JobTracker daemon
 http://localhost:50060/ – web UI of the TaskTracker daemon
© Materna GmbH 2014
www.materna.de
32
HBase Master user interface
© Materna GmbH 2014
www.materna.de
33
ZooKeeper debugging HBase
© Materna GmbH 2014
www.materna.de
34
WordCount-Algorithmus mit MapReduce
Mary had a
little lamp
Its fleece was
white as snow
And everywhere
that Mary went
map
map
map
Mary1
had1
a1
little1
lamp1
Its1
fleece1
was1
white1
as1
snow1
And1
everywhere1
that1
Mary1
went1
The lamb was
sure to go
map
Map.class
The1
lamb1
was1
sure1
to1
go1
Reduce.class
reduce
had1
a1
little1
Lamp2
…
© Materna GmbH 2014
reduce
Mary2
was2
white1
snow1
…
www.materna.de
35
Example: WordCount Hadoop Tutorial
public class WordCount {
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
……
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
……
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
© Materna GmbH 2014
www.materna.de
36
Example: WordCount Hadoop Tutorial
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}}
© Materna GmbH 2014
www.materna.de
37
Example: Hive WordCount HQL
CREATE TABLE docs (line STRING);
LOAD DATA INPATH ‘/user/cloudera/wordcount/input/file'
OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) w
GROUP BY word
ORDER BY word;
© Materna GmbH 2014
www.materna.de
38
Example: Pig WordCount Script
input_lines = LOAD ‘/user/cloudera/wordcount/input/file' AS
(line:chararray);
words = FOREACH input_lines GENERATE
FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES '\\w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE
COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO
/user/cloudera/wordcount/output/part-00000 ';
© Materna GmbH 2014
www.materna.de
39
Example: WordCount Hadoop Tutorial
$ echo "Hello World Bye World" > file0
$ echo "Hello Hadoop Goodbye Hadoop" > file1
$ hadoop fs -mkdir /user/cloudera /user/cloudera/wordcount
/user/cloudera/wordcount/input
$ hadoop fs -put file* /user/cloudera/wordcount/input
$ hadoop fs -cat /user/cloudera/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
© Materna GmbH 2014
www.materna.de
40
Example: WordCount Hadoop Tutorial
first input map :
< Hello, 1> < World, 1> < Bye, 1> < World, 1>
second input map :
< Hello, 1> < Hadoop, 1> < Goodbye, 1> <
Hadoop, 1>
© Materna GmbH 2014
first output map:
< Bye, 1> < Hello, 1> < World, 2>
second output map:
< Goodbye, 1> < Hadoop, 2> < Hello, 1>
The Reducer sums up the values:
< Bye, 1> < Goodbye, 1> < Hadoop, 2> <
Hello, 2> < World, 2>
www.materna.de
41
Terasort benchmark Hadoop: Wie lange dauert es 1TB zu sortieren?
 http://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/ter
asort/package-summary.html
 hadoop jar hadoop-*examples*.jar terasort <input dir> <output dir>
 2008, 1TB 3,48 minutes
910 nodes x (4 dual-core processors, 4 disks, 8 GB memory)
 2009, 100 TB in 173 minutes
3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA)
 2012 100 TB sort in 10,369 seconds
IBM InfoSphere BigInsights 100 TB (1.000 virtual machines, 200 nodes,
2.400 Cores)
© Materna GmbH 2014
www.materna.de
42
Agenda
 Herausforderungen BigData
 Größeres Pferd oder Pferdegespann?
 Apache Hadoop
 Geschichte, Versionen, Ökosystem
 Produkte




HDFS – Daten speichern und verteilen
Map/Reduce – parallelisieren der Verarbeitung
Pig,Hive – Verarbeitung von Daten
HBase – Verteilte und skalierbare Speicherung von Daten
 Benchmark: Wie schnell 1TB sortieren?
 BigData-UseCases mit ORACLE
 Fazit, Ausblick
© Materna GmbH 2014
www.materna.de
43
The Forrester Wave: Big Data Hadoop Solutions, Q1 2014
© Materna GmbH 2014
www.materna.de
44
Das Hadoop Ecosystem Partnerbeziehungen
© Materna GmbH 2014
www.materna.de
45
Hortonworks Data Platform 2.1
© Materna GmbH 2014
www.materna.de
46
Historie Hortonworks Data Platform mit Komponentenversionen
© Materna GmbH 2014
www.materna.de
47
Hadoop als unternehmensweite Plattform
BITKOM-Leitfaden: Big-Data-Technologien - Wissen für Entscheider
© Materna GmbH 2014
www.materna.de
48
Anwendung der Hortonworks Data Platform für die Analyse von Twitter-Daten
BITKOM-Leitfaden: Big-Data-Technologien - Wissen für Entscheider
© Materna GmbH 2014
www.materna.de
49
Big-Data-Architektur bei Ebay, Stand 2011
BITKOM-Leitfaden: Big-Data-Technologien - Wissen für Entscheider
© Materna GmbH 2014
www.materna.de
50
Traditional vs Big Data Information Architecture Capabilities
Oracle: Big Data for the Enterprise, Whitepaper, 2012
© Materna GmbH 2014
www.materna.de
51
Oracle Integrated Information Architecture Capabilities
Oracle: Big Data for the Enterprise, Whitepaper, 2012
© Materna GmbH 2014
www.materna.de
52
Use Case #1: Initial Data Exploration
Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012
© Materna GmbH 2014
www.materna.de
53
Use Case #2: Big Data for Complex Event Processing
Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012
© Materna GmbH 2014
www.materna.de
54
Use Case #3: Big Data for Combined Analytics
Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012
© Materna GmbH 2014
www.materna.de
55
Use Case #4: Big Data for Combined Analytics
Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012
© Materna GmbH 2014
www.materna.de
56
Use Case #5: Einsatzmöglichkeit Oracle Big Data Appliance
Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012
© Materna GmbH 2014
www.materna.de
57
Use Case #5: Big Data for Combined Analytics
Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012
© Materna GmbH 2014
www.materna.de
58
Oracle integrated Big Data Solution
Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012
© Materna GmbH 2014
www.materna.de
59
Oracle Big Data Appliance
Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012
© Materna GmbH 2014
www.materna.de
60
Oracle NoSQL Database integrates into the data management
Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012
© Materna GmbH 2014
www.materna.de
61
Wo stehen wir: Der Sprung über den großen Graben
Crossing the Chasm: Geoffrey A. Moore
© Materna GmbH 2014
www.materna.de
62
„Crossing the Chasm“: Koexistenz & Kooperation
RDBMS
ORACLE
Hadoop
„Wenn ich die Menschen gefragt hätte, was sie
wollen, hätten sie gesagt schnellere Pferde“
Henry Ford
© Materna GmbH 2014
www.materna.de
63
Fazit









© Materna GmbH 2014
billige Standard-Hardware, Umgang mit Ausfällen
billiger Hauptspeicher günstiger als großes Cluster
Daten-Lokalitäts-Prinzip
Verteiltes parallelisiertes Dateisystem mit Replikation
Spezialisierte Datenspeicher (Spalten, Key/Value)
Divide-et-Impera, parallelisierter MapReduce-Algorithmus
Interaktive SQL-Abfrageengine für HDFS/HBase (Impala)
Mehr Realtime-Verarbeitung, weniger Batch
Betriebsthemen wichtiger: Update, Monitoring, Sicherheit
www.materna.de
64
Ausblick









© Materna GmbH 2014
Hadoop ist DeFacto-Standard für BigData-Processing
LINUX bleibt bevorzugte Hadoop-Plattform
Nur wenige Hadoop Distributionen werden überleben
Das Hadoop Ökosystem wird wachsen
Der Hadoop-Dienstleistungsmarkt wird wachsen
Hadoop Appliance reduzieren Kosten, Komplexität
Hybride RDBMS werden Lücke schließen
Benchmarks wichtig für Sizing, Tuning, Systemauswahl
„Keep your ecosystem simple!“
www.materna.de
65
Literatur
© Materna GmbH 2014
www.materna.de
66
Vielen Dank für Ihre/Eure Aufmerksamkeit!
MATERNA GmbH
Dipl. Inform. Frank Pientka
Senior Software Architect
Business Division Applications
Telefon: +49 231 5599-8854
Telefax: +49 231 5599-272
[email protected]
http://xing.to/frank_pientka
© Materna GmbH 2014
www.materna.de
67
Herunterladen