1 Arrays, Verkettete Listen und abgeleitete Datenstrukturen

Werbung
Sanders: Algorithm Engineering 18. Mai 2006
1 Arrays, Verkettete Listen und
abgeleitete Datenstrukturen
Bounded Arrays
Eingebaute Datenstruktur.
Größe muss von Anfang an bekannt sein
1
Sanders: Algorithm Engineering 18. Mai 2006
Unbounded Array
z.B. std::vector
pushBack: Element anhängen
popBack: Letztes Element löschen
Idee: verdopple wenn der Platz ausgeht
halbiere wenn Platz verschwendet wird
Wenn man das richtig macht, brauchen
n pushBack/popBack Operationen Zeit O (n)
Algorithmese: pushBack/popBack haben konstante amortisierte
Komplexität Was kann man falsch machen?
2
Sanders: Algorithm Engineering 18. Mai 2006
Doppelt verkettete Listen
3
Sanders: Algorithm Engineering 18. Mai 2006
4
Class Item of Element
// one link in a doubly linked list
e : Element
next : Handle //
prev : Handle
invariant next→prev=prev→next=this
-
Trick: Use a dummy header
-
⊥
-
···
···
Sanders: Algorithm Engineering 18. Mai 2006
5
Procedure splice(a,b,t : Handle)
assert b is not before a ∧ t 6∈ ha, . . . , bi
0
a
a
// Cut out ha, . . . , bi
- ···
a0 := a→prev
··· b0 := b→next
a0 →next := b0
//
- ···
0
0
··· b →prev := a
// Y
t
// insert ha, . . . , bi after t
t 0 := t→next
// a
b0
b
-
R
-
-
t0
b
R
-
Y
- ···
··· - ···
··· R
-
-
- ···
··· -
b→next := t’
a→prev := t
//
// Y
t →next := a
t 0 →prev := b
//
// Sanders: Algorithm Engineering 18. Mai 2006
Einfach verkettete Listen
...
Vergleich mit doppelt verketteten Listen
Weniger Speicherplatz
Platz ist oft auch Zeit
Eingeschränkter z.B. kein delete
Merkwürdige Benutzerschnittstelle, z.B. deleteAfter
6
Sanders: Algorithm Engineering 18. Mai 2006
7
Speicherverwaltung für Listen
kann leicht 90 % der Zeit kosten!
Lieber Elemente zwischen (Free)lists herschieben als echte
mallocs
Alloziere viele Items gleichzeitig
Am Ende alles freigeben?
Speichere „parasitär“. z.B. Graphen:
Knotenarray. Jeder Knoten speichert ein ListItem
Partition der Knoten kann als verkettete Listen
gespeichert werden
MST, shortest Path
Challenge: garbage collection, viele Datentypen
auch ein Software Engineering Problem
hier nicht
Sanders: Algorithm Engineering 18. Mai 2006
8
Beispiel: Stack
...
dynamisch
Platzverschwendung
2
1
1 2
...
SList
B-Array
U-Array
+
−
+
pointer
zu groß?
zu groß?
cache miss
+
umkopieren
(+)
+
−
freigeben?
Zeitverschwendung
worst case time
Wars das?
Hat jede Implementierung gravierende Schwächen?
Sanders: Algorithm Engineering 18. Mai 2006
9
The Best From Both Worlds
...
B
hybrid
dynamisch
+
Platzverschwendung
n/B + B
Zeitverschwendung
+
worst case time
+
Sanders: Algorithm Engineering 18. Mai 2006
10
Eine Variante
Directory
...
...
B
Elemente
− Reallozierung im top level
Zeit
nicht worst case konstante
+ Indizierter Zugrif auf S[i] in konstanter Zeit
Sanders: Algorithm Engineering 18. Mai 2006
Beyond Stacks
11
stack
...
FIFO queue
...
deque
...
popFront pushFront
n0
pushBack popBack
h
b
FIFO: BArray −→ cyclic array
Aufgabe: Ein Array, das “[i]” in konstanter Zeit und
√
einfügen/löschen von Elementen in Zeit O ( n) unterstützt
Aufgabe: Ein externer Stack, der n push/pop Operationen mit
O (n/B) I/Os unterstützt
t
Sanders: Algorithm Engineering 18. Mai 2006
12
Aufgabe: Tabelle für hybride Datenstrukturen vervollständigen
Operation List SList UArray CArray explanation of ‘∗ ’
[·]
n
n
1
1
|·|
1∗
1∗
1
1
not with inter-list splice
1
1
1
1
first
last
1
1
1
1
insert
1
1∗
n
n
insertAfter only
remove
1
1∗
n
n
removeAfter only
pushBack
1
1
1∗
1∗
amortized
pushFront
1
1
n
1∗
amortized
popBack
1
n
1∗
1∗
amortized
popFront
1
1
n
1∗
amortized
concat
1
1
n
n
splice
1
1
n
n
findNext,. . .
n
n
n∗
n∗
cache efficient
Sanders: Algorithm Engineering 18. Mai 2006
Was fehlt?
Fakten Fakten Fakten
Messungen für
Verschiedene Implementierungsvarianten
Verschiedene Architekturen
Verschiedene Eingabegrößen
Auswirkungen auf reale Anwendungen
Kurven dazu
Interpretation, ggf. Theoriebildung
Aufgabe: Array durchlaufen versus zufällig allozierte verkettete
Liste
13
Sanders: Algorithm Engineering 18. Mai 2006
14
Quicksort
Function quickSort(s : Sequence of Element) : Sequence of Element
if |s| ≤ 1 then return s
// base case
pick p ∈ s uniformly at random
// pivot key
a := he ∈ s : e < pi
b := he ∈ s : e = pi
c := he ∈ s : e > pi
return concatenate(quickSort(a), b, quickSort(c)
Sanders: Algorithm Engineering 18. Mai 2006
Engineering Quicksort
array
2-Wege-Vergleiche
sentinels für innere Schleife
inplace swaps
Rekursion auf kleinere Teilproblemgröße
→ O (log n) zusätzlichen Platz
Rekursionsabbruch für kleine (20–100) Eingaben, insertion
sort
(nicht ein großer insertion sort)
15
Sanders: Algorithm Engineering 18. Mai 2006
Procedure qSort(a : Array of Element; `, r : N) // Sort a[`..r]
// Use divide-and-conquer
while r − ` ≥ n0 do
j := pickPivotPos(a, l, r)
swap(a[`], a[ j])
// Helps to establish the invariant
p := a[`]
i := `; j := r
r
repeat
// a: `
i→ ← j
while a[i] < p do i++ // Scan over elements (A)
while a[ j] > p do j−− // on the correct side (B)
if i ≤ j then swap(a[i], a[ j]); i++ ; j−−
until i > j
// Done partitioning
if i < l+r
2 then qSort(a,`,j); ` := j + 1
else
qSort(a,i,r) ; r := i − 1
insertionSort(a[l..r])
// faster for small r − l
16
Sanders: Algorithm Engineering 18. Mai 2006
Picking Pivots Painstakingly — Theory
probabilistically: Expected 1.4n log n element comparisons
median of three: Expected 1.2n log n element comparisons
perfect: −→ n log n element comparisons
(approximate using large samples)
Practice
3GHz Pentium 4 Prescott, g++
17
Sanders: Algorithm Engineering 18. Mai 2006
18
Picking Pivots Painstakingly — Instructions
out sort Instructions / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11
12
random pivot
median of 3
exact median
skewed pivot n/10
skewed pivot n/11
11
instructions constant
10
9
8
7
6
10
12
14
16
18
n
20
22
24
26
Sanders: Algorithm Engineering 18. Mai 2006
19
Picking Pivots Painstakingly — Time
out sort Seconds / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11
8.4
random pivot
median of 3
exact median
skewed pivot n/10
skewed pivot n/11
8.2
nanosecs constant
8
7.8
7.6
7.4
7.2
7
6.8
10
12
14
16
18
n
20
22
24
26
Sanders: Algorithm Engineering 18. Mai 2006
20
Picking Pivots Painstakingly — Branch Misses
out sort Branch misses / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11
0.5
random pivot
median of 3
exact median
skewed pivot n/10
skewed pivot n/11
0.48
0.46
branch misses constant
0.44
0.42
0.4
0.38
0.36
0.34
0.32
0.3
10
12
14
16
18
n
20
22
24
26
Sanders: Algorithm Engineering 18. Mai 2006
Can We Do Better? Previous Work
Integer Keys
+ Can be 2 − 3 times faster than quicksort
− Naive ones are cache inefficient and slower than quicksort
− Simple ones are distribution dependent.
Cache efficient sorting
k-ary merge sort
[Nyberg et al. 94, Arge et al. 04, Ranade et al. 00, Brodal et al.
04]
+ Faktor log k less cache faults
− Only ≈ 20 % speedup, and only for laaarge inputs
21
Sanders: Algorithm Engineering 18. Mai 2006
22
Sample Sort
Function sampleSort(e = he1 , . . . , en i, k)
if n/k is “small” then return smallSort(e)
let S = hS1 , . . . , Sak−1 i denote a random sample of e
sort S
hs0 , s1 , s2 , . . . , sk−1 , sk i:=
h−∞, Sa , S2a , . . . , S(k−1)a , ∞i
binary
search?
for i := 1 to n do
s 1 s 2 s3 s 4 s5 s6 s7
find j ∈ {1, . . . , k}
such that s j−1 < ei ≤ s j
place ei in bucket b j
return concatenate(sampleSort(b1 ), . . . , sampleSort(bk )) buckets
Sanders: Algorithm Engineering 18. Mai 2006
Why Sample Sort?
traditionally: parallelizable on coarse grained machines
+ Cache efficient ≈ merge sort
− Binary search not much faster than merging
− complicated memory management
Super Scalar Sample Sort
Binary search −→ implicit search tree
Eliminate all conditional branches
Exploit instruction parallelism
Cache efficiency comes to bear
“steal” memory management from radix sort
23
Sanders: Algorithm Engineering 18. Mai 2006
24
Classifying Elements
t:= hsk/2 , sk/4 , s3k/4 , sk/8 , s3k/8 , s5k/8 , s7k/8 , . . .i
for i := 1 to n do
splitter array index
s4 1
j:= 1
>
<
*2
+1 decisions
repeat log k times // unroll
s
s6 3
2 2
j:= 2 j + (ai > t j )
decisions
< *2 >
< *2 >
+1
+1
j:= j − k + 1
s 1 4 s3 5
s5 6
s7 7
b j ++
> <
> <
> <
> decisions
<
o(i):= j
// oracle
buckets
Now the compiler should:
use predicated instructions
interleave for loop iterations (unrolling ∨ software pipelining)
Sanders: Algorithm Engineering 18. Mai 2006
25
template <class T>
void findOraclesAndCount(const T* const a,
const int n, const int k, const T* const s,
Oracle* const oracle, int* const bucket) {
{ for (int i = 0; i < n; i++)
int j = 1;
while (j < k) {
j = j*2 + (a[i] > s[j]);
}
int b = j-k;
bucket[b]++;
oracle[i] = b;
}
}
Sanders: Algorithm Engineering 18. Mai 2006
Predication
Hardware mechanism that allows instructions to be
conditionally executed
Boolean predicate registers (1–64) hold condition codes
predicate registers p are additional inputs of predicated
instructions I
At runtime, I is executed if and only if p is true
+ Avoids branch misprediction penalty
+ More flexible instruction scheduling
− Switched off instructions still take time
− Longer opcodes
− Complicated hardware design
26
Sanders: Algorithm Engineering 18. Mai 2006
27
Example (IA-64)
Translation of: if (r1 > 2) r3 := r3 + 4
With a conditional branch:
Via predication:
cmp.gt p6,p7=r1,r2
cmp.gt p6,p7=r1,r2
(p7) br.cond .label
(p6) add r3=4,r3
add r3=4,r3
.label:
(...)
Other Current Architectures:
Conditional moves only
Sanders: Algorithm Engineering 18. Mai 2006
Unrolling (k = 16)
template <class T>
void findOraclesAndCountUnrolled([...]){
for (int i = 0; i < n; i++)
int j = 1;
j = j*2 + (a[i] > s[j]);
j = j*2 + (a[i] > s[j]);
j = j*2 + (a[i] > s[j]);
j = j*2 + (a[i] > s[j]);
int b = j-k;
bucket[b]++;
oracle[i] = b;
} }
28
Sanders: Algorithm Engineering 18. Mai 2006
29
More Unrolling k = 16, n even
template <class T>
void findOraclesAndCountUnrolled2([...]){
for (int i = n & 1; i < n; i+=2) {\
int j0 = 1;
int j1 = 1;
T ai0 = a[i];
T ai1 = a[i+1];
j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);
j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);
j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);
j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);
int b0 = j0-k;
int b1 = j1-k;
bucket[b0]++;
bucket[b1]++;
oracle[i] = b0;
oracle[i+1] = b1;
} }
Sanders: Algorithm Engineering 18. Mai 2006
30
i
Distributing Elements
refer
for i := 1 to n do a0B[o(i)]++ := ai
a
o
Why Oracles?
refer
simplifies memory management
B
refer
no overflow tests or re-copying
simplifies software pipelining
a’
separates computation and memory access aspects
small (n bytes)
sequential, predictable memory access
can be hidden using prefetching / write buffering
move
Sanders: Algorithm Engineering 18. Mai 2006
31
Distributing Elements
template <class T> void distribute(
const T* const a0, T* const a1,
const int n, const int k,
const Oracle* const oracle, int* const bucket)
{ for (int i = 0, sum = 0; i <= k; i++) {
int t = bucket[i]; bucket[i] = sum; sum += t;
}
for (int i = 0; i < n; i++) {
a1[bucket[oracle[i]]++] = a0[i];
}
}
Sanders: Algorithm Engineering 18. Mai 2006
Aufgabe
Wie implementiert man 4-Wege Mischen (8-Wege-Mischen) mit
2n Speicherzugriffen und
2n (3n) Vergleichen
auf einer Maschine mit 8 (16) Registern?
Ist eine naive Implementierung mit 3n (7n) leichter
vorhersagbaren Vergleichen am Ende schneller?
32
Sanders: Algorithm Engineering 18. Mai 2006
Experiments: 1.4 GHz Itanium 2
restrict keyword from ANSI/ISO C99
to indicate nonaliasing
Intel’s C++ compiler v8.0 uses predicated instructions
automatically
Profiling gives 9% speedup
k = 256 splitters
Use stl:sort from g++ (n ≤ 1000)!
insertion sort for n ≤ 100
Random 32 bit integers in [0, 109 ]
33
Sanders: Algorithm Engineering 18. Mai 2006
34
Comparison with Quicksort
8
time / n log n [ns]
7
6
5
4
3
2
GCC STL qsort
sss-sort
1
0
4096 16384 65536
18
2
n
20
2
22
2
24
2
Sanders: Algorithm Engineering 18. Mai 2006
35
Breakdown of Execution Time
6
Total
findBuckets + distribute + smallSort
findBuckets + distribute
findBuckets
time / n log n [ns]
5
4
3
2
1
0
4096
16384
65536
2
18
2
n
20
22
2
24
2
Sanders: Algorithm Engineering 18. Mai 2006
36
A More Detailed View
dynamic
dynamic
instr. cycles IPC small n IPC n = 225
findBuckets,
1× outer loop
63
11
5.4
4.5
14
4
3.5
0.8
distribute,
one element
Sanders: Algorithm Engineering 18. Mai 2006
37
Comparison with Quicksort Pentium 4
14
Intel STL
GCC STL
sss-sort
time / n log n [ns]
12
10
8
6
4
2
0
4096
16384
65536
18
2
n
20
2
22
2
24
2
Problems: few registers, one condition code only, compiler needs “help”
Sanders: Algorithm Engineering 18. Mai 2006
38
Breakdown of Execution Time Pentium 4
Total
findBuckets + distribute + smallSort
findBuckets + distribute
findBuckets
8
time / n log n [ns]
7
6
5
4
3
2
1
0
4096
16384
65536
18
2
n
2
20
22
2
24
2
Sanders: Algorithm Engineering 18. Mai 2006
39
Analysis
mem. acc.branches data dep. I/Os registers instructions
k-way distribution:
sss-sort
n log k
quicksort log k lvls. 2n log k
O (1)
O (n)
3.5n/B 3×unroll O (log k)
n log k O (n log k)2 Bn log k
4
O (1)
k-way merging:
memory
n log k
n log k O (n log k) 2n/B
7
O (log k)
register
2n
n log k O (n log k) 2n/B
k
O (k)
2k0 + 2
O (k0 )
funnel k0 logk0 k
2n logk0 k n log k O (n log k) 2n/B
Sanders: Algorithm Engineering 18. Mai 2006
Conclusions
sss-sort up to twice as fast as quicksort on Itanium
comparisons 6= conditional branches
algorithm analysis is not just instructions and caches
40
Sanders: Algorithm Engineering 18. Mai 2006
Kritik I
Warum nur zufällige Schlüssel?
Antwort I
Sample sort ist kaum von der Verteilung der Eingabe abhängig
41
Sanders: Algorithm Engineering 18. Mai 2006
Kritik I´
Aber was wenn es viele gleiche Schlüssel gibt?
Die landen alle in einem Bucket
Antwort I´
Its not a bug its a feature:
Sei si = si+1 = · · · = s j Indikator für häufigen Schlüssel!
Setze si := max {x ∈ Key : x < si },
(optional: streiche si+2 , . . . s j )
Nun muss bucket i + 1 nicht mehr sortiert werden!
todo: implementieren
42
Sanders: Algorithm Engineering 18. Mai 2006
Kritik II
Quicksort ist inplace
Antwort II
Benutze hybride
√ Listen-Array Repräsentation von Folgen
braucht O kn extra Platz für k-way sample sort
43
Sanders: Algorithm Engineering 18. Mai 2006
Kritik II´
Ich will aber Arrays für Ein- und Ausgabe
Antwort II´
Inplace Konvertierung
Eingabe: einfach
Ausgabe: tricky. Aufgabe: grobe Idee entwickeln. Hinweis:
Permutation von Blöcken. Jede Permutation besteht aus
einem Produkt zyklischer Permutationen. Inplace zyklische
Permutation ist einfach
44
Sanders: Algorithm Engineering 18. Mai 2006
Future Work
high level fine-tuning, e.g., clever choice of k
other architectures, e.g., Opteron, PowerPC
almost in-place implementation
multilevel cache-aware or cache-oblivious generalization
(oracles help)
Studienarbeit, (Diplomarbeit), Hiwi-Job?
45
Sanders: Algorithm Engineering 18. Mai 2006
46
Sorting by Multiway Merging
Sort dn/Me runs with M elements each
2n/B I/Os
Merge M/B runs at a time
2n/B I/Os
l
m
× logM/B Mn merge phases
until only one run is left
————————————————l
2n n m
sort(n) :=
1 + logM/B
I/Os
B
M
In total
registers
ALU
1 word
capacity M
freely programmable
B words
large memory
...
internal buffers
...
...
...
...
...
Sanders: Algorithm Engineering 18. Mai 2006
47
Procedure twoPassSort(M : N; a : external Array [0..n − 1] of Element)
b : external Array [0..n − 1] of Element // auxiliary storage
formRuns(M, a, b)
mergeRuns(M, b, a)
// Sort runs of size M from f writing sorted runs to t
Procedure formRuns(M : N; f ,t : external Array [0..n − 1] of Element)
for i := 0 to n − 1 step M do
run := f [i..i + M − 1]
M
f
sort(run)
t[i..i + M − 1] := run
i=0
run
t
i=0
i=M
i=2M
sort internal
i=M
i=2M
Sanders: Algorithm Engineering 18. Mai 2006
48
// Merge n elements from f to t where f stores sorted runs of size L
Procedure mergeRuns(L : N; f ,t : external Array [0..n − 1] of Element)
k := dn/Le
// Number of runs
next : PriorityQueue of Key × N
runBuffer := Array [0..k − 1][0..B − 1] of Element
for i := 0 to k − 1 do
runBuffer[i] :=
f
i=0
i=1
i=2
f [iM..iM + B − 1]
1
0
2 run−
next.insert(
Buffer
key(runBuffer[i][0]), iL))
B
next
out : Array [0..B − 1] of Element
k
internal
out
t
i=0
i=1
i=2
Sanders: Algorithm Engineering 18. Mai 2006
49
// k-way merging
for i := 0 to n − 1 step B do
for j := 0 to B − 1 do
(x, `) := next.deleteMin
out[ j] := runBuffer[` div L][` mod B]
`++
// New input block
if ` mod B = 0 then
if ` mod L = 0 ∨ ` = n then
runBuffer[` div L][0] := ∞// sentinel for exhausted ru
else runBuffer[` div L] := f [`..` + B − 1]// refill buffer
next.insert((runBuffer[` div L][0], `))
// One output step
write out to t[i..i + B − 1]
Sanders: Algorithm Engineering 18. Mai 2006
50
Example, B = 2, run size = 6
f
f
make things as simple as possible but no simpler
__aeghikmnst__aaeilmpsss__bbeilopssu__eilmnoprst
runBuffer
st
ss
B
M
t
ps
next
k
st
internal
out ss
________aaabbeeeeghiiiiklllmmmnnooppprss
Sanders: Algorithm Engineering 18. Mai 2006
2 Super Scalar Sample Sort
Comparison Based Sorting
Large data sets (thousands to millions)
Theory: Lots of algorithms with n log n + O (n) comparisons
Practice: Quicksort
+ n log n + O (n) expected comparisons
using sample based pivot seletction
+ sufficiently cache efficient
n
O B log n cache misses on all levels
+ Simple
+ Compiler writers are aware of it
Can we nevertheless beat quicksort?
51
Sanders: Algorithm Engineering 18. Mai 2006
52
3 Sorting with Parallel Disks
I/O Step := Access to a single physical block per disk
Theory: Balance Sort [Nodine Vitter 93].
Deterministic, complex
O (·)
asymptotically optimal
M
B
Multiway merging
“Usually” factor 10? less I/Os.
Not asymptotically optimal. 42%
Basic Approach: Improve Multiway Merging
1
2
...
D
independent disks
[Vitter Shriver 94]
Sanders: Algorithm Engineering 18. Mai 2006
Striping
53
emulated disk
logical block
1
2
D
physical blocks
...
...
But what about merging?
internal buffers
???
...
That takes care of run formation
...
and writing the output
...
...
Sanders: Algorithm Engineering 18. Mai 2006
54
Prediction
Prefetch buffers
allow parallel access
of next blocks
...
...
...
...
...
...
...
prediction sequence
...
prefetch buffers
[Folklore, Knuth]
Smallest Element
of each block
triggers fetch.
...
...
...
...
...
...
...
...
controls
internal buffers
...
Sanders: Algorithm Engineering 18. Mai 2006
55
Warmup: Multihead Model
...
...
...
...
prediction sequence
...
...
...
...
M
1
2
B
D
...
...
...
prefetch buffers
...
...
...
Multihead Model
[Aggarwal Vitter 88]
...
...
D prefetch buffers yield an optimal algorithm
l
2n n m
sort(n) :=
1 + logM/B
I/Os
DB
M
controls
internal buffers
...
Sanders: Algorithm Engineering 18. Mai 2006
56
Bigger Prefetch Buffer
prefetch buffers prediction sequence
...
...
...
...
...
...
Dk
1 internal buffers
2
..
.
k
good deterministic performance
O (D) would yield an optimal algoritihm.
Possible?
...
Sanders: Algorithm Engineering 18. Mai 2006
57
Randomized Cycling
[Vitter Hutchinson 01]
Block i of stripe j goes to disk π j (i) for a random permutation
πj
prefetch buffers prediction sequence
...
...
internal buffers
...
...
...
...
Good for naive prefetching and Ω (D log D) buffers
...
Sanders: Algorithm Engineering 18. Mai 2006
58
Buffered Writing
[Sanders-Egner-Korst SODA00, Hutchinson Sanders Vitter
ESA 01]
write whenever
Sequence of blocks Σ
otherwise, output
one of W
one block from
buffers is free
each nonempty queue
randomized
mapping
...
...
queues
1
2
3 ...
W/D
Theorem:
Buffered Writing
is optimal
...
But
how good is optimal?
D
Theorem: Randomized cycling achieves efficiency
1 − O (D/W ).
Sanders: Algorithm Engineering 18. Mai 2006
Analysis: negative association of random variables,
application of queueing theory to a “throttled” Alg.
59
Sanders: Algorithm Engineering 18. Mai 2006
60
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
r q p o l i f
Σ
r q p o n ml k j i h g f e d c b a
ΣR
order of writing
Puffer
order of reading
n
h g e
mk j
d c b a
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 18. Mai 2006
61
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
Σ
r q p o n ml k j i h g f e d c b a
R
Σ
order of writing
Puffer
order of reading
r
q
p
o
n
m
r
n
m
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 18. Mai 2006
62
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
Σ
r q p o n ml k j i h g f e d c b a
R
Σ
order of writing
Puffer
order of reading
j
q
p
o
l
k
r q
n
mk
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 18. Mai 2006
63
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
Σ
r q p o n ml k j i h g f e d c b a
R
Σ
order of writing
Puffer
order of reading
j
h
p
o
l
i
r q p
n
h
mk j
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 18. Mai 2006
64
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
Σ
r q p o n ml k j i h g f e d c b a
R
Σ
order of writing
Puffer
order of reading
g
f
e
o
l
i
r q p o
n
h g
mk j
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 18. Mai 2006
65
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
Σ
r q p o n ml k j i h g f e d c b a
R
Σ
order of writing
Puffer
order of reading
d
f
e
c
l
i
r q p o l
n
h g e
mk j
d
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 18. Mai 2006
66
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
Σ
r q p o n ml k j i h g f e d c b a
R
Σ
order of writing
Puffer
order of reading
b
f
a
c
i
r q p o l i
n
h g e
mk j
d c
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 18. Mai 2006
67
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
Σ
r q p o n ml k j i h g f e d c b a
R
Σ
order of writing
Puffer
order of reading
b
f
a
r q p o l i f
n
h g e
mk j
d c b
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 18. Mai 2006
68
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
r q p o l i f
Σ
r q p o n ml k j i h g f e d c b a
ΣR
order of writing
Puffer
order of reading
a
n
h g e
mk j
d c b a
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 18. Mai 2006
Synthesis
Multiway merging
[60s Folklore]
+ prediction
+optimal (randomized) writing
[Sanders Egner Korst SODA
2000]
+randomized cycling
[Vitter Hutchinson 2001]
+optimal prefetching [Hutchinson Sanders Vitter ESA 2002]
(1 + o(1))·sort(n) I/Os
“answers” question in [Knuth 98]; difficulty 48 on a 1..50
scale.
69
Sanders: Algorithm Engineering 18. Mai 2006
We are not done yet!
Internal work
Overlapping I/O and computation
Reasonable hardware
Interfacing with the Operating System
Parameter Tuning
Software engineering
Pipelining
70
Sanders: Algorithm Engineering 18. Mai 2006
71
Key Sorting
The I/O bandwidth of our machine is about 1/3 of its main
memory bandwidth
If key size element size
sort key pointer pairs to save memory bandwidth during run
formation
a
3
b
1
4
c
1
e
5
d
extract
3
1
1
b
4
1
1
5
d
sort
3
a
1
1
3
4
5
permute
4
c
5
e
Sanders: Algorithm Engineering 18. Mai 2006
72
Tournament Trees for Multiway Merging
Assume k = 2K runs
K level complete binary tree
Leaves: smallest current element of each run
Internal nodes: loser of a competition for being smallest
Above root: global winner
deleteMin+
insertNext
1
2
4
7
6
3
4
6
4
2
2
4
9
7
9
7
1
3
4
4
6
7
4
7
6
2
9
7
9
7
3
4
7
Sanders: Algorithm Engineering 18. Mai 2006
73
Why Tournament Trees
Exactly log k element comparisons
Implicit layout in an array
(shifts)
simple index arithmetics
Predictable load instructions and index computations
(Unrollable) inner loop:
for (int i=(winnerIndex+kReg)>>1; i>0; i>>=1){
currentPos = entry + i;
currentKey = currentPos->key;
if (currentKey < winnerKey) {
currentIndex
= currentPos->index;
currentPos->key
= winnerKey;
currentPos->index = winnerIndex;
winnerKey
= currentKey;
winnerIndex
= currentIndex;}}
Sanders: Algorithm Engineering 18. Mai 2006
Overlapping I/O and Computation
One thread for each disk (or asynchronous I/O)
Possibly additional threads
Blocks filled with elements are passed by references
between different buffers
74
Sanders: Algorithm Engineering 18. Mai 2006
75
Overlapping During Run Formation
First post read requests for runs 1 and 2
Thread A: Loop { wait-read i; sort i; post-write i};
Thread B: Loop { wait-write i; post-read i + 2};
control flow in thread A
control flow in thread B
time
1
2
3
4
read
1
sort
2
3
1
write
1
4
2
3
2
3
4
1
2
3
write
1
2
4
4
k−1
k
k
k−1
...
4
I/O
bound
case
k
...
k−1
...
3
k
k−1
...
read
sort
k−1
...
...
k
k−1
k
compute
bound
case
Sanders: Algorithm Engineering 18. Mai 2006
76
Overlapping During Merging
Bad example:
1B−1 2 3B−1 4 5B−1 6 · · ·
···
1B−1 2 3B−1 4 5B−1 6 · · ·
Sanders: Algorithm Engineering 18. Mai 2006
77
Overlapping During Merging
read buffers
−ping disk scheduling
I/O Threads: Writing has priority over reading
..
..
merge
D blocks
buffers
k+3D
overlap buffers
merge
O(D) prefetch buffers
2D write buffers
..
..
1
k
merging
overlap−
Sanders: Algorithm Engineering 18. Mai 2006
78
y = # of elements
merged during
one I/O step.
I/O bound
y > DB
2
y ≤ DB
#elements in overlap+merge buffers
I/O bound case: The I/O thread never blocks
r
kB+3DB
kB+2DB+y
blocking
kB+2DB
kB+DB+y
output
fetch
kB+DB
DB−y
DB 2DB−y
w
2DB
#elements in output buffers
Sanders: Algorithm Engineering 18. Mai 2006
kB+3DB
#elements in overlap+merge buffers
Compute bound case:
The merging thread
never blocks
79
r
block I/O
kB+2DB
output
fetch
kB+DB
kB+2y
kB+y
kB
block merging
DB
w
2DB−y2DB
#elements in output buffers
Sanders: Algorithm Engineering 18. Mai 2006
80
Hardware (mid 2002)
2x Xeon
4 Threads
Linux
Intel
E7500
Chipset
Several 66 MHz PCI-buses
(SuperMicro P4DPE3)
2x64x66 Mb/s
Several fast IDE controllers
4x2x100
(4 × Promise Ultra100 TX2) MB/s
Many fast IDE disks
8x45
(8× IBM IC35L080AVVA07)MB/s
cost effective I/O-bandwidth
128
400x64 Mb/s
(2 × 2GHz Xeon × 2 Threads)
1 GB
DDR
RAM
PCI−Busses
Controller
Channels
8x80
GB
(real 360 MB/s for ≈ 3000 ¤)
TXXL
typed block, block manager, buffered streams,
block prefetcher, buffered block writer
files, I/O requests, disk queues,
completion handlers
Block management layer
Asynchronous I/O primitives layer
S
Operating System
Pipelined sorting,
zero−I/O scanning
vector, stack, set
priority_queue, map
sort, for_each, merge
Containers:
Algorithms:
Applications
Streaming layer
STL−user layer
81
Sanders: Algorithm Engineering 18. Mai 2006
Software Interface
Goals: efficient + simple + compatible
Sanders: Algorithm Engineering 18. Mai 2006
82
Default Measurement Parameters
t := number of available buffer blocks
Input Size: 16 GByte
400x64 Mb/s
Keys: Random 32 bit integers
Run Size: 256 MByte
Block size B: 2 MByte
Compiler: g++ 3.2 -O6
2x64x66 Mb/s
4x2x100
MB/s
8x45
MB/s
Write Buffers: max(t/4, 2D)
3
Prefetch Buffers: 2D + (t − w − 2D)
10
Intel
E7500
Chipset
128
Element Size: 128 Byte
2x Xeon
4 Threads
1 GB
DDR
RAM
PCI−Busses
Controller
Channels
8x80
GB
Sanders: Algorithm Engineering 18. Mai 2006
83
Element sizes (16 GByte, 8 disks)
400
run formation
merging
I/O wait in merge phase
I/O wait in run formation phase
350
300
time [s]
250
200
150
100
50
0
16
parallel disks
32
64
128
256
element size [byte]
bandwidth “for free”
512
1024
internal work, overlapping are relev
Sanders: Algorithm Engineering 18. Mai 2006
84
Earlier Academic Implementations
Single Disk, at most 2 GByte, old measurements use artificial M
800
LEDA-SM
TPIE
<stxxl> comparison based
700
sort time [s]
600
500
400
300
200
100
0
16
32
64
128
256
element size [byte]
512
1024
Sanders: Algorithm Engineering 18. Mai 2006
85
Earlier Academic Implementations: Multiple
Disks
616
500
400
LEDA-SM Soft-RAID
TPIE Soft-RAID
<stxxl> Soft-RAID
<stxxl>
sort time [s]
300
200
100
75
50
40
30
26
16
32
64
128
256
element size [byte]
512
1024
Sanders: Algorithm Engineering 18. Mai 2006
86
What are good block sizes (8 disks)?
26
128 GBytes 1x merge
128 GBytes 2x merge
16 GBytes
sort time [ns/byte]
24
22
20
18
16
14
12
128
256
512
1024
2048
block size [KByte]
B is not a technology constant
4096
8192
Sanders: Algorithm Engineering 18. Mai 2006
87
Optimal Versus Naive Prefetching
Total merge time
176 read buffers
112 read buffers
48 read buffers
difference to naive prefetching [%]
1
0
-1
-2
-3
0
20
40
60
80
fraction of prefetch buffers [%]
100
Sanders: Algorithm Engineering 18. Mai 2006
Impact of Prefetch and Overlap Buffers
150
no prefetch buffer
heuristic schedule
145
merging time [s]
140
135
130
125
120
115
16
32
48
64
80
96 112 128 144 160 172
number of read buffers
88
Sanders: Algorithm Engineering 18. Mai 2006
89
Tradeoff: Write Buffer Size Versus Read Buffer
Size
145
140
merge time [s]
135
130
125
120
115
110
0
50
100
150
number of read buffers
200
Sanders: Algorithm Engineering 18. Mai 2006
90
Scalability
sort time [ns/byte]
20
15
10
5
128-byte elements
512-byte elements
4N/DB bulk I/Os
0
16
32
64
input size [GByte]
128
Sanders: Algorithm Engineering 18. Mai 2006
Discussion
Theory and practice harmonize
No expensive server hardware necessary (SCSI,. . . )
No need to work with artificial M
No 2/4 GByte limits
Faster than academic implementations
(Must be) as fast as commercial implementations but with
performance guarantees
Blocks are much larger than often assumed. Not a
technology constant
Parallel disks
bandwidth “for free”
don’t negliect internal costs
91
Sanders: Algorithm Engineering 18. Mai 2006
More Parallel Disk Sorting?
Pipelining: Input does not come from disk but from a logical
input stream. Output goes to a logical output stream
only half the I/Os for sorting
todo: better overlapping
often no I/Os for scanning
Parallelism: This is the only way to go for really many disks
Tuning and Special Cases: ssssort, permutations, balance work
between merging and run formation?. . .
Longer Runs: not done with guaranteed overlapping, fast
internal sorting !
Distribution Sorting: Better for seeks etc.?
Inplace Sorting: Could also be faster
92
Sanders: Algorithm Engineering 18. Mai 2006
Determinism: A practical and theoretically efficient algorithm?
93
Sanders: Algorithm Engineering 18. Mai 2006
Procedure formLongRuns
q, q0 : PriorityQueue
for i := 1 to M do q.insert(readElement)
invariant |q| + |q0 | = M
loop
while q 6= 0/
writeElement(e:= q.deleteMin)
if input exhausted then break outer loop
if e0 := readElement < e then q0 .insert(e0 )
else q.insert(e0 )
q:= q0 ; q0 := 0/
output q in sorted order; output q0 in sorted order
Knuth: average run length 2M
todo: cache-effiziente Implementierung
94
Herunterladen