© MATERNA GmbH 2014 www.materna.de 1 Agenda Herausforderungen BigData Größeres Pferd oder Pferdegespann? Apache Hadoop Geschichte, Versionen, Ökosystem Produkte HDFS – Daten speichern und verteilen Map/Reduce – parallelisieren der Verarbeitung Pig,Hive – Verarbeitung von Daten HBase – Verteilte und skalierbare Speicherung von Daten Benchmark: Wie schnell 1TB sortieren? BigData-UseCases mit ORACLE Fazit, Ausblick © Materna GmbH 2014 www.materna.de 2 Eigenschaften von Big Data Datenvolumen Steigende Nutzerzahlen Geschwindigkeit der Datenproduktion © Materna GmbH 2014 Datenquellen www.materna.de 3 Kennzeichen von Big Data: die vier Vs Volumen Value Velocity Variety © Materna GmbH 2014 www.materna.de 4 Größeres Pferd oder Pferdegespann? vertikal vs horizontal © Materna GmbH 2014 www.materna.de 5 Agenda Herausforderungen BigData Größeres Pferd oder Pferdegespann? Apache Hadoop Geschichte, Versionen, Ökosystem Produkte HDFS – Daten speichern und verteilen Map/Reduce – parallelisieren der Verarbeitung Pig,Hive – Verarbeitung von Daten HBase – Verteilte und skalierbare Speicherung von Daten Benchmark: Wie schnell 1TB sortieren? BigData-UseCases mit ORACLE Fazit, Ausblick © Materna GmbH 2014 www.materna.de 6 Der Elefant hat den Laden verlassen … © Materna GmbH 2014 www.materna.de 7 Hadoop Entstehung, Hintergründe: Da hinter steckt ein Kopf… Doug Cutting Lucene-Suche 1997 SourceForge, 2001 ASF Nutch-Webcrawler Hadoop 2003, 2005 ASF Google Labs paper: – The Google File System, October, 2003 – MapReduce algorithm, December 2004 – Bigtable: A Distributed Storage System for Structured Data, November 2006 – H-store: a high-performance, distributed main memory transaction processing system, August 2008 – Dremel: Interactive Analysis of WebScale Datasets, September 2010 Board of directors of the Apache Software Foundation July 2009 Doug Cutting Cloudera August 2009 ASF chairman September 2010 © Materna GmbH 2014 www.materna.de 8 Hadoop Releases Feature 0.20/1.2 Secure authentication X Old configuration names X New configuration names Old MapReduce API New MapReduce API MapReduce 1 runtime (Classic) 0.22 X Deprecated Deprecated X X X (X) X X X X X X MapReduce 2 runtime (YARN) HDFS federation X X HDFS high-availability 02.2013 08.2013 © Materna GmbH 2014 0.23/2.4 www.materna.de 12.2011 X 12.2013 04.2014 9 © Materna GmbH 2014 www.materna.de 10 Columnar NoSQL Store HBase Coordination Zookeeper Das Hadoop-Ökosystem (De-Facto-Standard) Pig Hive Data Flow SQL MapReduce Distributed Programming Framework HCatalog Table & Schema Management HDFS Hadoop Distributed File System © Materna GmbH 2014 www.materna.de 11 Evolution von Hadoop 2006 • • HDFS MapReduce © Materna GmbH 2014 2008 • • • • HBase Zookeeper Pig Hive 2009-10 • • • • • • www.materna.de Flume Avro Whirr Sqoop Mahoot Oozie 2011-12 • • • • HCatalog Bigtop Ambari YARN 12 Wer entwickelt Hadoop? (Quelle: Hadoop in Practice) © Materna GmbH 2014 www.materna.de 13 Einsatzgebiet: Datenbereinigung Applications Business Analytics Custom Applications Enterprise Applications Collect Data and apply a known algorithm to it in trusted operational process 1 3 Data Systems RDB MS EDW MPP Capture all data Hadoop Platform 2 Traditional Repos 2 3 Traditional Sources New Sources RDBMS, OLTP, OLAP Web logs, email, sensors, social media Process Parse, cleans, apply structure in all form 1 Data Sources Capture Exchange Push to existing data warehouse for use with existing analytic tools Apache Hadoop Patterns of Use: Hortonworks 2013 © Materna GmbH 2014 www.materna.de 14 Einsatzgebiet: Datenauswertung Custom Applications Applications Enterprise Applications Collect Data, analyze and present salient results for online apps 3 Data Systems 1 Hadoop Platform RDB MS EDW MPP Capture all data NOSQL 2 Traditional Repos 2 3 Traditional Sources New Sources RDBMS, OLTP, OLAP Web logs, email, sensors, social media Process Parse, cleans, apply structure & transform 1 Data Sources Capture Exchange Incorporate data directly into applications Apache Hadoop Patterns of Use: Hortonworks 2013 © Materna GmbH 2014 www.materna.de 15 HDFS, MapReduce, NameNode, DataNode JobTracker MapReduce YARN TaskTracker 1 ResourceManager (RM) NodeManager 1 AM 1 ApplicationMaster (AM) HDFS NameNode DataNode 1 Worker 1 64 MB © Materna GmbH 2014 64 MB 64 MB TaskTracker N NodeManager N AM N DataNode N Worker N 18 MB www.materna.de 16 Daten in HDFS schreiben: Rackawareness Replication File 1 B1 B2 B3 n1 B1 n1 B2 n1 B1 n2 B1 n2 B2 n2 B3 n3 B2 n3 B3 n3 B3 n4 NameNode Rack 1 © Materna GmbH 2014 n4 Rack 2 www.materna.de n4 Rack 3 17 MapReduce-Verfahren 2 parallele Phasen, Pipes&Filter (UNIX), funktionale Programmierung, Fehlertoleranz Eingabe “Mary had a little lamb, It's fleece was white as snow, Every where that Mary Went, The little lamb was sure to go.” Ausgabe (key, value) list(out_key, intermediate_value) Map © Materna GmbH 2014 Reduce Map list(out_value) Reduce Map berechnete Werte www.materna.de 18 Wie werden MapReduce Jobs ausgeführt? MapReduce 1. run job program 2. get new job ID JobClient 5. initialize job JobTracker 4. submit job client JVM jobtracker node client node 3. copy job resources Distributed file system (e.g. HDFS) 6. retrieve input splits 7. heartbeat (returns task) TaskTracker 8. retrieve job resources 9. launch child JVM Child 10. run MapTask or ReduceTask tasktracker node © Materna GmbH 2014 www.materna.de 19 MapReduce bei Hadoop Client Job Hadoop MapReduce Job parts Job parts Reduce Map Output data Input data Map Reduce Map (Quelle: Hadoop in Practice) © Materna GmbH 2014 www.materna.de 20 Wie hängen Teile zusammen? ZooKeeper locate Client HBase Master read/write HBase region HBase region HBase region HDFS HDFS HDFS Partitionierung Replikation (Quelle: Hadoop in Practice) © Materna GmbH 2014 www.materna.de 21 Wie hängen Teile zusammen? Client Client PigLatin Pig HiveQL submit job Hadoop MapReduce read/write submit job Hive read/write Hadoop HDFS (Quelle: Hadoop in Practice) © Materna GmbH 2014 www.materna.de 22 Pig und Hive im Vergleich Tez Characteristic Pig Hive Developed by Yahoo! Facebook Language Pig Latin HiveQL Type of language Data flow Declarative (SQL dialect) Data structures Complex Better suited for structured data Schema Optional Not optional © Materna GmbH 2014 www.materna.de 23 Pig Komponenten Pig Pig Latin Compiler Execution Environment An operation as a statement … LOAD ‘input.txt’; A command as a statement … ls *.txt Logical Plan … … DUMP… Compile Local Distributed Three steps: LOAD Load data from HDFS TRANSFORM Translated to a set of map and reduce tasks DUMP or STORE Display or store result © Materna GmbH 2014 www.materna.de Physical Plan Execute 24 Hive Architektur DDL JDBC/ODBC Queries CLI Metastore (Relational database for metadata) Web Interface Parser, Planner Optimizer Hadoop © Materna GmbH 2014 www.materna.de 25 Agenda Herausforderungen BigData Größeres Pferd oder Pferdegespann? Apache Hadoop Geschichte, Versionen, Ökosystem Produkte HDFS – Daten speichern und verteilen Map/Reduce – parallelisieren der Verarbeitung Pig,Hive – Verarbeitung von Daten HBase – Verteilte und skalierbare Speicherung von Daten Benchmark: Wie schnell 1TB sortieren? BigData-UseCases mit ORACLE Fazit, Ausblick © Materna GmbH 2014 www.materna.de 27 Hadoop Cluster Hardwareanforderungen: eine Menge Blech … Intel Cloud Builders Guide: Apache Hadoop http://software.intel.com/en-us/articles/intel-benchmark-install-and-test-tool-intel-bitt-tools/ © Materna GmbH 2014 www.materna.de 28 Hardware components of Intel Hadoop cluster Master Map Reduce Job Tracking and HDFS*/ Storage Metadata Task Tracker Slave 1 Slave N Data Storage and Processing Task Tracker Task Tracker Rackawareness Replication Zookeeper* Hive* Job Tracker Pig* Oozie* HDFS Name Node Avro* Data Node R720/ C2 100 © Materna GmbH 2014 Data Node Data Node R720XD/C2100/C6100/C6105 www.materna.de HDFS Client* R720/ C2 100 29 Optimierungsbereiche Hadoop-Installationen Encryption Benchmark tuning Security & API‘s Network Storage Compute HI-Tune HI-Bench Fast fabric 10 GbE SSDs Non-volatible memory © Materna GmbH 2014 Disk write/ memory www.materna.de 30 Unterschiedliche Kompressionsverfahren Size (Mbytes) Compression Speed (sec) Compression memory used (Mbytes) Decompression speed Decompression memory used (Mbytes) Split table Un compressed 96 Gzip 23 10 0.7 1.3 0.5 N Bzip2 19 22 8 5 4 Y lzo 36 1 1 0.6 0 (Y) Y (HADOOP-1824) want InputFormat for zip files Client hadoop-site.xml <property><name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.LzopCodec</value> </property> © Materna GmbH 2014 www.materna.de 31 Wichtig: Teste deine Infrastruktur bin/hadoop jar hadoop-*test*.jar, Benchmarks time hadoop jar ../hadoop-examples-1.1.0.jar wordcount 2 4 hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output time hadoop jar ../hadoop-test-1.1.0.jar Kontrolle über: http://localhost:50070/ – web UI of the NameNode daemon http://localhost:50030/ – web UI of the JobTracker daemon http://localhost:50060/ – web UI of the TaskTracker daemon © Materna GmbH 2014 www.materna.de 32 HBase Master user interface © Materna GmbH 2014 www.materna.de 33 ZooKeeper debugging HBase © Materna GmbH 2014 www.materna.de 34 WordCount-Algorithmus mit MapReduce Mary had a little lamp Its fleece was white as snow And everywhere that Mary went map map map Mary1 had1 a1 little1 lamp1 Its1 fleece1 was1 white1 as1 snow1 And1 everywhere1 that1 Mary1 went1 The lamb was sure to go map Map.class The1 lamb1 was1 sure1 to1 go1 Reduce.class reduce had1 a1 little1 Lamp2 … © Materna GmbH 2014 reduce Mary2 was2 white1 snow1 … www.materna.de 35 Example: WordCount Hadoop Tutorial public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { …… } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { …… } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } © Materna GmbH 2014 www.materna.de 36 Example: WordCount Hadoop Tutorial public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } }} public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }} © Materna GmbH 2014 www.materna.de 37 Example: Hive WordCount HQL CREATE TABLE docs (line STRING); LOAD DATA INPATH ‘/user/cloudera/wordcount/input/file' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; © Materna GmbH 2014 www.materna.de 38 Example: Pig WordCount Script input_lines = LOAD ‘/user/cloudera/wordcount/input/file' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES '\\w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO /user/cloudera/wordcount/output/part-00000 '; © Materna GmbH 2014 www.materna.de 39 Example: WordCount Hadoop Tutorial $ echo "Hello World Bye World" > file0 $ echo "Hello Hadoop Goodbye Hadoop" > file1 $ hadoop fs -mkdir /user/cloudera /user/cloudera/wordcount /user/cloudera/wordcount/input $ hadoop fs -put file* /user/cloudera/wordcount/input $ hadoop fs -cat /user/cloudera/wordcount/output/part-00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2 © Materna GmbH 2014 www.materna.de 40 Example: WordCount Hadoop Tutorial first input map : < Hello, 1> < World, 1> < Bye, 1> < World, 1> second input map : < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> © Materna GmbH 2014 first output map: < Bye, 1> < Hello, 1> < World, 2> second output map: < Goodbye, 1> < Hadoop, 2> < Hello, 1> The Reducer sums up the values: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> www.materna.de 41 Terasort benchmark Hadoop: Wie lange dauert es 1TB zu sortieren? http://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/ter asort/package-summary.html hadoop jar hadoop-*examples*.jar terasort <input dir> <output dir> 2008, 1TB 3,48 minutes 910 nodes x (4 dual-core processors, 4 disks, 8 GB memory) 2009, 100 TB in 173 minutes 3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA) 2012 100 TB sort in 10,369 seconds IBM InfoSphere BigInsights 100 TB (1.000 virtual machines, 200 nodes, 2.400 Cores) © Materna GmbH 2014 www.materna.de 42 Agenda Herausforderungen BigData Größeres Pferd oder Pferdegespann? Apache Hadoop Geschichte, Versionen, Ökosystem Produkte HDFS – Daten speichern und verteilen Map/Reduce – parallelisieren der Verarbeitung Pig,Hive – Verarbeitung von Daten HBase – Verteilte und skalierbare Speicherung von Daten Benchmark: Wie schnell 1TB sortieren? BigData-UseCases mit ORACLE Fazit, Ausblick © Materna GmbH 2014 www.materna.de 43 The Forrester Wave: Big Data Hadoop Solutions, Q1 2014 © Materna GmbH 2014 www.materna.de 44 Das Hadoop Ecosystem Partnerbeziehungen © Materna GmbH 2014 www.materna.de 45 Hortonworks Data Platform 2.1 © Materna GmbH 2014 www.materna.de 46 Historie Hortonworks Data Platform mit Komponentenversionen © Materna GmbH 2014 www.materna.de 47 Hadoop als unternehmensweite Plattform BITKOM-Leitfaden: Big-Data-Technologien - Wissen für Entscheider © Materna GmbH 2014 www.materna.de 48 Anwendung der Hortonworks Data Platform für die Analyse von Twitter-Daten BITKOM-Leitfaden: Big-Data-Technologien - Wissen für Entscheider © Materna GmbH 2014 www.materna.de 49 Big-Data-Architektur bei Ebay, Stand 2011 BITKOM-Leitfaden: Big-Data-Technologien - Wissen für Entscheider © Materna GmbH 2014 www.materna.de 50 Traditional vs Big Data Information Architecture Capabilities Oracle: Big Data for the Enterprise, Whitepaper, 2012 © Materna GmbH 2014 www.materna.de 51 Oracle Integrated Information Architecture Capabilities Oracle: Big Data for the Enterprise, Whitepaper, 2012 © Materna GmbH 2014 www.materna.de 52 Use Case #1: Initial Data Exploration Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012 © Materna GmbH 2014 www.materna.de 53 Use Case #2: Big Data for Complex Event Processing Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012 © Materna GmbH 2014 www.materna.de 54 Use Case #3: Big Data for Combined Analytics Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012 © Materna GmbH 2014 www.materna.de 55 Use Case #4: Big Data for Combined Analytics Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012 © Materna GmbH 2014 www.materna.de 56 Use Case #5: Einsatzmöglichkeit Oracle Big Data Appliance Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012 © Materna GmbH 2014 www.materna.de 57 Use Case #5: Big Data for Combined Analytics Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012 © Materna GmbH 2014 www.materna.de 58 Oracle integrated Big Data Solution Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012 © Materna GmbH 2014 www.materna.de 59 Oracle Big Data Appliance Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012 © Materna GmbH 2014 www.materna.de 60 Oracle NoSQL Database integrates into the data management Oracle Information Architecture: An Architect’s Guide to Big Data, Whitepaper 2012 © Materna GmbH 2014 www.materna.de 61 Wo stehen wir: Der Sprung über den großen Graben Crossing the Chasm: Geoffrey A. Moore © Materna GmbH 2014 www.materna.de 62 „Crossing the Chasm“: Koexistenz & Kooperation RDBMS ORACLE Hadoop „Wenn ich die Menschen gefragt hätte, was sie wollen, hätten sie gesagt schnellere Pferde“ Henry Ford © Materna GmbH 2014 www.materna.de 63 Fazit © Materna GmbH 2014 billige Standard-Hardware, Umgang mit Ausfällen billiger Hauptspeicher günstiger als großes Cluster Daten-Lokalitäts-Prinzip Verteiltes parallelisiertes Dateisystem mit Replikation Spezialisierte Datenspeicher (Spalten, Key/Value) Divide-et-Impera, parallelisierter MapReduce-Algorithmus Interaktive SQL-Abfrageengine für HDFS/HBase (Impala) Mehr Realtime-Verarbeitung, weniger Batch Betriebsthemen wichtiger: Update, Monitoring, Sicherheit www.materna.de 64 Ausblick © Materna GmbH 2014 Hadoop ist DeFacto-Standard für BigData-Processing LINUX bleibt bevorzugte Hadoop-Plattform Nur wenige Hadoop Distributionen werden überleben Das Hadoop Ökosystem wird wachsen Der Hadoop-Dienstleistungsmarkt wird wachsen Hadoop Appliance reduzieren Kosten, Komplexität Hybride RDBMS werden Lücke schließen Benchmarks wichtig für Sizing, Tuning, Systemauswahl „Keep your ecosystem simple!“ www.materna.de 65 Literatur © Materna GmbH 2014 www.materna.de 66 Vielen Dank für Ihre/Eure Aufmerksamkeit! MATERNA GmbH Dipl. Inform. Frank Pientka Senior Software Architect Business Division Applications Telefon: +49 231 5599-8854 Telefax: +49 231 5599-272 [email protected] http://xing.to/frank_pientka © Materna GmbH 2014 www.materna.de 67