html

Werbung
Incremental Recomputations
in MapReduce
Thomas Jörg
University of Kaiserslautern
Motivation
MapReduce
Program
Base data
Result data
Bigtable / HBase
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
2
Motivation

Base data
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
View
Definition
Materialized view
3
Motivation

Base data
incremental
MapReduce
MapReduce
Program
Program
Result data
Bigtable / HBase
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
4
Agenda
• Related Work
• Case study
• Incremental view maintenance
• Summary Delta Algorithm
• Conclusion and future work
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
5
Related Work
• Caching intermediate results
• DryadInc
• Incoop
• Incremental programming models
• Google Percolator
• Continuous bulk processing (CBP)
L. Popa, et al.: DryadInc: Reusing work in large-scale computations. HotCloud 2009
P. Bhatotia, et al.: Incoop: MapReduce for Incremental Computations. SoCC 2011
D. Peng and F. Dabek: Large-scale Incremental Processing Using Distributed Transactions and Notifications. OSDI 2010
D. Logothetis et al.: Stateful Bulk Processing for Incremental Analytics. SoCC 2010
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
6
Challenges
• Programming model
• SQL / relational algebra vs. MapReduce
• Efficient access paths
• No secondary indexes in Hbase
• Support for transactions
• Only single-row transactions in Hbase
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
7
Case Study
• Word histograms
• Reverse web-link graphs
• Term-vectors per host
• Count of URL access frequency
• Inverted Indexes
J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
8
</html>
Computing Reverse Web-Link<html>
Graphs
<htm
...
</ht
...
</html>
<html>
...
</html>
ml>
<html>
...
</html>
<html>
...
</html>
tml>
html>
..
/html>
<html>
...
</html>
<html>
...
DBAG
Thomas
Treffen
Jörg,2011
Technische
– Thomas
Universität
Jörg – TU
Kaiserslautern
Kaiserslautern
</html>
<html>
...
</html>
<html>
...
<html>
...
</html>
9
Sample Web-Link Graph
a.htm
b.htm
<html>
<a href="b.htm">
...</a>
<a href="b.htm">
...</a>
</html>
<html>
<a href="a.htm">
...</a>
<a href="b.htm">
...</a>
</html>
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
10
Computing Reverse Web-Link Graphs
Map
Shuffle
Reduce
a.htm
<html>
<a href="b.htm">
...</a>
<a href="b.htm">
...</a>
</html>
b.htm, a.htm
b.htm, a.htm
b.htm, {a.htm, b.htm}
b.htm
<html>
<a href="a.htm">
...</a>
<a href="b.htm">
...</a>
</html>
a.htm, b.htm
b.htm, b.htm
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
a.htm, {b.htm}
11
Summary Delta Algorithm
CREATE VIEW Parts AS
SELECT partID,
AS revenue,
SELECT SUM(qty*price)
partID, SUM(revenue)
AS revenue,
COUNT(*) SUM(tplcnt)
AS tplcnt AS tplcnt
FROM Orders
FROM (
GROUP BY partID
(SELECT partID, SUM(qty*price) AS revenue,
COUNT(*) as tplcnt
FROM Orders_Insertions
GROUP BY partID)
UNION ALL
(SELECT partID, -SUM(qty*price) AS revenue,
-COUNT(*) as tplcnt
FROM Orders_Deletions
GROUP
BY partID)
I. S. Mumick et al.: Maintenance of Data Cubes
and Summary
Tables in a Warehouse. SIGMOD Conference 1997
W. Labio et al.: Performance Issues in)Incremental Warehouse Maintenance. VLDB 2000
GROUP BY partID
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
12
Computing Reverse Web-Link Graphs
Map
Shuffle
Reduce
a.htm
<html>
<a href="b.htm">
...</a>
<a href="b.htm">
...</a>
</html>
b.htm, a.htm
b.htm, a.htm
b.htm, {a.htm, b.htm}
b.htm
<html>
<a href="a.htm">
...</a>
<a href="b.htm">
...</a>
</html>
a.htm, b.htm
b.htm, b.htm
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
a.htm, {b.htm}
13
Achieving Self-Maintainability
Map
Shuffle
Reduce
a.htm
<html>
<a href="b.htm">
...</a>
<a href="b.htm">
...</a>
</html>
b.htm, [a.htm, 1]
b.htm, [a.htm, 1]
b.htm, {[a.htm, 2],
[b.htm, 1]}
b.htm
<html>
<a href="a.htm">
...</a>
<a href="b.htm">
...</a>
</html>
a.htm, [b.htm, 1]
b.htm, [b.htm, 1]
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
a.htm, {[b.htm, 1]}
14
Sample Web-Link Graph
a.htm
b.htm
<html>
<a href="b.htm">
...</a>
<a
<ahref="b.htm">
href="a.htm">
...</a>
</html>
<html>
<a href="a.htm">
...</a>
<a href="b.htm">
...</a>
</html>
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
15
Summary Delta Algorithm in MapReduce
a.htm (deleted)
<html>
<a href="b.htm">
...</a>
<a href="b.htm">
...</a>
</html>
Map
Reduce
b.htm, [a.htm, -1]
b.htm, [a.htm, -1]
b.htm, {[a.htm, -1]}
a.htm, {[a.htm, +1]}
a.htm (inserted)
<html>
<a href="b.htm">
...</a>
<a href="a.htm">
...</a>
</html>
Shuffle
b.htm, [a.htm, +1]
a.htm, [a.htm, +1]
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
16
Delta Installation Approaches


MapReduce
Base deltas
Materialized view
Increment Installation
Materialized view

MapReduce
Base deltas
Materialized view
Overwrite Installation
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
17
Case Study – Lessons Learned
• Numerical aggregation
• Word histogram
• URL access frequency
• Set aggregation
• Reverse web-link graph
• Inverted index
• Multiset aggregation
• Term-vector per host
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
18
General Solution
• Self-maintainable aggregates
• Computed in three steps
• Translation
• Grouping
• Aggregation
• commutative and associative binary function
• inverse elements
• Abelian group
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
19
Case Study – Lessons Learned
• Numerical aggregation
• Word histogram
• URL access frequency
• Set aggregation
Translation function:
Translate web pages into (word, 1)
Aggregation function:
Abelian group (Natural numbers, +)
• Reverse web-link graph
• Inverted index
• Multiset aggregation
Translation function:
Translate web pages into (link target, link source)
• Term-vector per host
Aggregation function:
Abelian group (Power-multiset of URLs, multiset union)
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
20
Evaluation
64
256
32
128
16
64
8
32
4
16
2
Word histogram
1
8
Reverse web-link graph
4
0
25
50
75
100
0
64
64
32
32
16
16
8
8
4
4
2
URL access frequency
1
2
25
50
75
100
Term-vector per host
y-axis: Elapsed time [min]
x-axis: Updates in base
documents [%]
1
0
25
50
75
100
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
0
25
50
75
100
21
Conclusion & Future Work
• View Maintenance in MapReduce
• Case study
• Summary delta algorithm
• Self-maintainable aggregations
• Future Work
• Broader class of MapReduce programs
• High-level MapReduce languages, e.g. Jaql or PigLatin
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern
22
Herunterladen