Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern Motivation MapReduce Program Base data Result data Bigtable / HBase DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 2 Motivation Base data DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern View Definition Materialized view 3 Motivation Base data incremental MapReduce MapReduce Program Program Result data Bigtable / HBase DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 4 Agenda • Related Work • Case study • Incremental view maintenance • Summary Delta Algorithm • Conclusion and future work DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 5 Related Work • Caching intermediate results • DryadInc • Incoop • Incremental programming models • Google Percolator • Continuous bulk processing (CBP) L. Popa, et al.: DryadInc: Reusing work in large-scale computations. HotCloud 2009 P. Bhatotia, et al.: Incoop: MapReduce for Incremental Computations. SoCC 2011 D. Peng and F. Dabek: Large-scale Incremental Processing Using Distributed Transactions and Notifications. OSDI 2010 D. Logothetis et al.: Stateful Bulk Processing for Incremental Analytics. SoCC 2010 DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 6 Challenges • Programming model • SQL / relational algebra vs. MapReduce • Efficient access paths • No secondary indexes in Hbase • Support for transactions • Only single-row transactions in Hbase DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 7 Case Study • Word histograms • Reverse web-link graphs • Term-vectors per host • Count of URL access frequency • Inverted Indexes J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004 DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 8 </html> Computing Reverse Web-Link<html> Graphs <htm ... </ht ... </html> <html> ... </html> ml> <html> ... </html> <html> ... </html> tml> html> .. /html> <html> ... </html> <html> ... DBAG Thomas Treffen Jörg,2011 Technische – Thomas Universität Jörg – TU Kaiserslautern Kaiserslautern </html> <html> ... </html> <html> ... <html> ... </html> 9 Sample Web-Link Graph a.htm b.htm <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html> DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 10 Computing Reverse Web-Link Graphs Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> b.htm, a.htm b.htm, a.htm b.htm, {a.htm, b.htm} b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html> a.htm, b.htm b.htm, b.htm DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern a.htm, {b.htm} 11 Summary Delta Algorithm CREATE VIEW Parts AS SELECT partID, AS revenue, SELECT SUM(qty*price) partID, SUM(revenue) AS revenue, COUNT(*) SUM(tplcnt) AS tplcnt AS tplcnt FROM Orders FROM ( GROUP BY partID (SELECT partID, SUM(qty*price) AS revenue, COUNT(*) as tplcnt FROM Orders_Insertions GROUP BY partID) UNION ALL (SELECT partID, -SUM(qty*price) AS revenue, -COUNT(*) as tplcnt FROM Orders_Deletions GROUP BY partID) I. S. Mumick et al.: Maintenance of Data Cubes and Summary Tables in a Warehouse. SIGMOD Conference 1997 W. Labio et al.: Performance Issues in)Incremental Warehouse Maintenance. VLDB 2000 GROUP BY partID DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 12 Computing Reverse Web-Link Graphs Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> b.htm, a.htm b.htm, a.htm b.htm, {a.htm, b.htm} b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html> a.htm, b.htm b.htm, b.htm DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern a.htm, {b.htm} 13 Achieving Self-Maintainability Map Shuffle Reduce a.htm <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> b.htm, [a.htm, 1] b.htm, [a.htm, 1] b.htm, {[a.htm, 2], [b.htm, 1]} b.htm <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html> a.htm, [b.htm, 1] b.htm, [b.htm, 1] DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern a.htm, {[b.htm, 1]} 14 Sample Web-Link Graph a.htm b.htm <html> <a href="b.htm"> ...</a> <a <ahref="b.htm"> href="a.htm"> ...</a> </html> <html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a> </html> DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 15 Summary Delta Algorithm in MapReduce a.htm (deleted) <html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a> </html> Map Reduce b.htm, [a.htm, -1] b.htm, [a.htm, -1] b.htm, {[a.htm, -1]} a.htm, {[a.htm, +1]} a.htm (inserted) <html> <a href="b.htm"> ...</a> <a href="a.htm"> ...</a> </html> Shuffle b.htm, [a.htm, +1] a.htm, [a.htm, +1] DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 16 Delta Installation Approaches MapReduce Base deltas Materialized view Increment Installation Materialized view MapReduce Base deltas Materialized view Overwrite Installation DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 17 Case Study – Lessons Learned • Numerical aggregation • Word histogram • URL access frequency • Set aggregation • Reverse web-link graph • Inverted index • Multiset aggregation • Term-vector per host DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 18 General Solution • Self-maintainable aggregates • Computed in three steps • Translation • Grouping • Aggregation • commutative and associative binary function • inverse elements • Abelian group DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 19 Case Study – Lessons Learned • Numerical aggregation • Word histogram • URL access frequency • Set aggregation Translation function: Translate web pages into (word, 1) Aggregation function: Abelian group (Natural numbers, +) • Reverse web-link graph • Inverted index • Multiset aggregation Translation function: Translate web pages into (link target, link source) • Term-vector per host Aggregation function: Abelian group (Power-multiset of URLs, multiset union) DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 20 Evaluation 64 256 32 128 16 64 8 32 4 16 2 Word histogram 1 8 Reverse web-link graph 4 0 25 50 75 100 0 64 64 32 32 16 16 8 8 4 4 2 URL access frequency 1 2 25 50 75 100 Term-vector per host y-axis: Elapsed time [min] x-axis: Updates in base documents [%] 1 0 25 50 75 100 DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 0 25 50 75 100 21 Conclusion & Future Work • View Maintenance in MapReduce • Case study • Summary delta algorithm • Self-maintainable aggregations • Future Work • Broader class of MapReduce programs • High-level MapReduce languages, e.g. Jaql or PigLatin DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 22