1 Arrays, Verkettete Listen und abgeleitete Datenstrukturen

Werbung
Sanders: Algorithm Engineering 8. Juni 2006
1 Arrays, Verkettete Listen und
abgeleitete Datenstrukturen
Bounded Arrays
Eingebaute Datenstruktur.
Größe muss von Anfang an bekannt sein
1
Sanders: Algorithm Engineering 8. Juni 2006
Unbounded Array
z.B. std::vector
pushBack: Element anhängen
popBack: Letztes Element löschen
Idee: verdopple wenn der Platz ausgeht
halbiere wenn Platz verschwendet wird
Wenn man das richtig macht, brauchen
n pushBack/popBack Operationen Zeit O (n)
Algorithmese: pushBack/popBack haben konstante amortisierte
Komplexität Was kann man falsch machen?
2
Sanders: Algorithm Engineering 8. Juni 2006
3
Doppelt verkettete Listen
Sanders: Algorithm Engineering 8. Juni 2006
4
Class Item of Element
// one link in a doubly linked list
e : Element
next : Handle
//
prev : Handle
invariant next→prev=prev→next=this
Trick: Use a dummy header
-
⊥
-
···
···
-
Sanders: Algorithm Engineering 8. Juni 2006
5
Procedure splice(a,b,t : Handle)
assert b is not before a ∧ t 6∈ ha, . . . , bi
′
a
a
// Cut out ha, . . . , bi
a′ := a→prev
b′ := b→next
a′ →next := b′
//
b′ →prev := a′
// Y
t
// insert ha, . . . , bi after t
t ′ := t→next
// a
b′
b
- ···
··· -
- ···
··· R
-
-
t′
b
R
-
Y
- ···
··· - ···
··· R
-
-
- ···
··· -
b→next := t’
a→prev := t
//
// Y
t →next := a
t ′ →prev := b
//
// Sanders: Algorithm Engineering 8. Juni 2006
6
Einfach verkettete Listen
...
Vergleich mit doppelt verketteten Listen
Weniger Speicherplatz
Platz ist oft auch Zeit
Eingeschränkter z.B. kein delete
Merkwürdige Benutzerschnittstelle, z.B. deleteAfter
Sanders: Algorithm Engineering 8. Juni 2006
7
Speicherverwaltung für Listen
kann leicht 90 % der Zeit kosten!
Lieber Elemente zwischen (Free)lists herschieben als echte
mallocs
Alloziere viele Items gleichzeitig
Am Ende alles freigeben?
Speichere „parasitär“. z.B. Graphen:
Knotenarray. Jeder Knoten speichert ein ListItem
Partition der Knoten kann als verkettete Listen gespeichert
werden
MST, shortest Path
Challenge: garbage collection, viele Datentypen
auch ein Software Engineering Problem
hier nicht
Sanders: Algorithm Engineering 8. Juni 2006
8
Beispiel: Stack
...
dynamisch
Platzverschwendung
2
1
1 2
...
SList
B-Array
U-Array
+
−
+
pointer
zu groß?
zu groß?
cache miss
+
umkopieren
(+)
+
−
freigeben?
Zeitverschwendung
worst case time
Wars das?
Hat jede Implementierung gravierende Schwächen?
Sanders: Algorithm Engineering 8. Juni 2006
9
The Best From Both Worlds
...
B
hybrid
dynamisch
+
Platzverschwendung
n/B + B
Zeitverschwendung
+
worst case time
+
Sanders: Algorithm Engineering 8. Juni 2006
10
Eine Variante
Directory
...
...
B
Elemente
− Reallozierung im top level
nicht worst case konstante Zeit
+ Indizierter Zugrif auf S[i] in konstanter Zeit
Sanders: Algorithm Engineering 8. Juni 2006
Beyond Stacks
11
stack
...
FIFO queue
...
deque
...
popFront pushFront
n0
pushBack popBack
h
t
b
FIFO: BArray −→ cyclic array
Aufgabe: Ein Array, das “[i]” in konstanter Zeit und einfügen/löschen von
√
Elementen in Zeit O ( n) unterstützt
Aufgabe: Ein externer Stack, der n push/pop Operationen mit O (n/B)
I/Os unterstützt
Sanders: Algorithm Engineering 8. Juni 2006
12
Aufgabe: Tabelle für hybride Datenstrukturen vervollständigen
Operation
List SList UArray CArray explanation of ‘∗ ’
[·]
|·|
first
last
insert
remove
pushBack
pushFront
popBack
popFront
concat
splice
findNext,. . .
n
1∗
1
1
1
1
1
1
1
1
1
1
n
n
1∗
1
1
1∗
1∗
1
1
n
1
1
1
n
1
1
1
1
n
n
1∗
n
1∗
n
n
n
n∗
1
1
1
1
n
n
1∗
1∗
1∗
1∗
n
n
n∗
not with inter-list splice
insertAfter only
removeAfter only
amortized
amortized
amortized
amortized
cache efficient
Sanders: Algorithm Engineering 8. Juni 2006
Was fehlt?
Fakten Fakten Fakten
Messungen für
Verschiedene Implementierungsvarianten
Verschiedene Architekturen
Verschiedene Eingabegrößen
Auswirkungen auf reale Anwendungen
Kurven dazu
Interpretation, ggf. Theoriebildung
Aufgabe: Array durchlaufen versus zufällig allozierte verkettete Liste
13
Sanders: Algorithm Engineering 8. Juni 2006
Quicksort
Function quickSort(s : Sequence of Element) : Sequence of Element
if |s| ≤ 1 then return s
// base case
pick p ∈ s uniformly at random
// pivot key
a := he ∈ s : e < pi
b := he ∈ s : e = pi
c := he ∈ s : e > pi
return concatenate(quickSort(a), b, quickSort(c)
14
Sanders: Algorithm Engineering 8. Juni 2006
Engineering Quicksort
array
2-Wege-Vergleiche
sentinels für innere Schleife
inplace swaps
Rekursion auf kleinere Teilproblemgröße
→ O (log n) zusätzlichen Platz
Rekursionsabbruch für kleine (20–100) Eingaben, insertion sort
(nicht ein großer insertion sort)
15
Sanders: Algorithm Engineering 8. Juni 2006
Procedure qSort(a : Array of Element; ℓ, r : N)
// Sort a[ℓ..r]
while r − ℓ ≥ n0 do
// Use divide-and-conquer
j := pickPivotPos(a, l, r)
swap(a[ℓ], a[ j])
// Helps to establish the invariant
p := a[ℓ]
i := ℓ; j := r
r
repeat
// a: ℓ
i→ ← j
while a[i] < p do i++
// Scan over elements (A)
while a[ j] > p do j−−
// on the correct side (B)
if i ≤ j then swap(a[i], a[ j]); i++ ; j−−
until i > j
// Done partitioning
if i < l+r
2 then qSort(a,ℓ,j); ℓ := j + 1
else
qSort(a,i,r) ; r := i − 1
insertionSort(a[l..r])
// faster for small r − l
16
Sanders: Algorithm Engineering 8. Juni 2006
Picking Pivots Painstakingly — Theory
probabilistically: Expected 1.4n log n element comparisons
median of three: Expected 1.2n log n element comparisons
perfect:
−→ n log n element comparisons
(approximate using large samples)
Practice
3GHz Pentium 4 Prescott, g++
17
Sanders: Algorithm Engineering 8. Juni 2006
18
Picking Pivots Painstakingly — Instructions
out sort Instructions / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11
12
random pivot
median of 3
exact median
skewed pivot n/10
skewed pivot n/11
11
instructions constant
10
9
8
7
6
10
12
14
16
18
n
20
22
24
26
Sanders: Algorithm Engineering 8. Juni 2006
19
Picking Pivots Painstakingly — Time
out sort Seconds / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11
8.4
random pivot
median of 3
exact median
skewed pivot n/10
skewed pivot n/11
8.2
nanosecs constant
8
7.8
7.6
7.4
7.2
7
6.8
10
12
14
16
18
n
20
22
24
26
Sanders: Algorithm Engineering 8. Juni 2006
20
Picking Pivots Painstakingly — Branch Misses
out sort Branch misses / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11
0.5
random pivot
median of 3
exact median
skewed pivot n/10
skewed pivot n/11
0.48
0.46
branch misses constant
0.44
0.42
0.4
0.38
0.36
0.34
0.32
0.3
10
12
14
16
18
n
20
22
24
26
Sanders: Algorithm Engineering 8. Juni 2006
Can We Do Better? Previous Work
Integer Keys
+ Can be 2 − 3 times faster than quicksort
− Naive ones are cache inefficient and slower than quicksort
− Simple ones are distribution dependent.
Cache efficient sorting
k-ary merge sort
[Nyberg et al. 94, Arge et al. 04, Ranade et al. 00, Brodal et al. 04]
+ Faktor log k less cache faults
− Only ≈ 20 % speedup, and only for laaarge inputs
21
Sanders: Algorithm Engineering 8. Juni 2006
22
Sample Sort
Function sampleSort(e = he1 , . . . , en i, k)
if n/k is “small” then return smallSort(e)
let S = hS1 , . . . , Sak−1 i denote a random sample of e
sort S
hs0 , s1 , s2 , . . . , sk−1 , sk i:=
h−∞, Sa , S2a , . . . , S(k−1)a , ∞i
binary
search?
for i := 1 to n do
s 1 s 2 s3 s 4 s5 s6 s7
find j ∈ {1, . . . , k}
such that s j−1 < ei ≤ s j
place ei in bucket b j
return concatenate(sampleSort(b1 ), . . . , sampleSort(bk )) buckets
Sanders: Algorithm Engineering 8. Juni 2006
Why Sample Sort?
traditionally: parallelizable on coarse grained machines
+ Cache efficient ≈ merge sort
− Binary search not much faster than merging
− complicated memory management
Super Scalar Sample Sort
Binary search −→ implicit search tree
Eliminate all conditional branches
Exploit instruction parallelism
Cache efficiency comes to bear
“steal” memory management from radix sort
23
Sanders: Algorithm Engineering 8. Juni 2006
24
Classifying Elements
t:= hsk/2 , sk/4 , s3k/4 , sk/8 , s3k/8 , s5k/8 , s7k/8 , . . .i
for i := 1 to n do
splitter array index
s4 1
j:= 1
<
*2 > +1 decisions
repeat log k times // unroll
s
s6 3
2 2
j:= 2 j + (ai > t j )
decisions
< *2 >
< *2 >
+1
+1
j:= j − k + 1
s 1 4 s3 5
s5 6
s7 7
b j ++
> <
> <
> <
> decisions
<
o(i):= j
// oracle
buckets
Now the compiler should:
use predicated instructions
interleave for loop iterations (unrolling ∨ software pipelining)
Sanders: Algorithm Engineering 8. Juni 2006
25
template <class T>
void findOraclesAndCount(const T* const a,
const int n, const int k, const T* const s,
Oracle* const oracle, int* const bucket) {
{ for (int i = 0; i < n; i++)
int j = 1;
while (j < k) {
j = j*2 + (a[i] > s[j]);
}
int b = j-k;
bucket[b]++;
oracle[i] = b;
}
}
Sanders: Algorithm Engineering 8. Juni 2006
Predication
Hardware mechanism that allows instructions to be conditionally
executed
Boolean predicate registers (1–64) hold condition codes
predicate registers p are additional inputs of predicated
instructions I
At runtime, I is executed if and only if p is true
+ Avoids branch misprediction penalty
+ More flexible instruction scheduling
− Switched off instructions still take time
− Longer opcodes
− Complicated hardware design
26
Sanders: Algorithm Engineering 8. Juni 2006
27
Example (IA-64)
Translation of: if (r1 > 2) r3 := r3 + 4
With a conditional branch:
Via predication:
cmp.gt p6,p7=r1,r2
cmp.gt p6,p7=r1,r2
(p7) br.cond .label
(p6) add r3=4,r3
add r3=4,r3
.label:
(...)
Other Current Architectures:
Conditional moves only
Sanders: Algorithm Engineering 8. Juni 2006
Unrolling (k = 16)
template <class T>
void findOraclesAndCountUnrolled([...]){
for (int i = 0; i < n; i++)
int j = 1;
j = j*2 + (a[i] > s[j]);
j = j*2 + (a[i] > s[j]);
j = j*2 + (a[i] > s[j]);
j = j*2 + (a[i] > s[j]);
int b = j-k;
bucket[b]++;
oracle[i] = b;
} }
28
Sanders: Algorithm Engineering 8. Juni 2006
29
More Unrolling k = 16, n even
template <class T>
void findOraclesAndCountUnrolled2([...]){
for (int i = n & 1; i < n; i+=2) {\
int j0 = 1;
int j1 = 1;
T ai0 = a[i];
T ai1 = a[i+1];
j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);
j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);
j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);
j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);
int b0 = j0-k;
int b1 = j1-k;
bucket[b0]++;
bucket[b1]++;
oracle[i] = b0;
oracle[i+1] = b1;
} }
Sanders: Algorithm Engineering 8. Juni 2006
30
i
Distributing Elements
for i := 1 to n do a′B[o(i)]++ := ai
Why Oracles?
simplifies memory management
refer
a
o
refer
B
refer
no overflow tests or re-copying
simplifies software pipelining
a’
separates computation and memory access aspects
small (n bytes)
sequential, predictable memory access
can be hidden using prefetching / write buffering
move
Sanders: Algorithm Engineering 8. Juni 2006
31
Distributing Elements
template <class T> void distribute(
const T* const a0, T* const a1,
const int n, const int k,
const Oracle* const oracle, int* const bucket)
{ for (int i = 0, sum = 0; i <= k; i++) {
int t = bucket[i]; bucket[i] = sum; sum += t;
}
for (int i = 0; i < n; i++) {
a1[bucket[oracle[i]]++] = a0[i];
}
}
Sanders: Algorithm Engineering 8. Juni 2006
Aufgabe
Wie implementiert man 4-Wege Mischen (8-Wege-Mischen) mit
2n Speicherzugriffen und
2n (3n) Vergleichen
auf einer Maschine mit 8 (16) Registern?
Ist eine naive Implementierung mit 3n (7n) leichter vorhersagbaren
Vergleichen am Ende schneller?
32
Sanders: Algorithm Engineering 8. Juni 2006
Experiments: 1.4 GHz Itanium 2
restrict keyword from ANSI/ISO C99
to indicate nonaliasing
Intel’s C++ compiler v8.0 uses predicated instructions
automatically
Profiling gives 9% speedup
k = 256 splitters
Use stl:sort from g++ (n ≤ 1000)!
insertion sort for n ≤ 100
Random 32 bit integers in [0, 109 ]
33
Sanders: Algorithm Engineering 8. Juni 2006
34
Comparison with Quicksort
8
time / n log n [ns]
7
6
5
4
3
2
1
0
4096 16384 65536
GCC STL qsort
sss-sort
218
n
220
222
224
Sanders: Algorithm Engineering 8. Juni 2006
35
Breakdown of Execution Time
6
Total
findBuckets + distribute + smallSort
findBuckets + distribute
findBuckets
time / n log n [ns]
5
4
3
2
1
0
4096
16384
65536
218
220
n
222
224
Sanders: Algorithm Engineering 8. Juni 2006
36
A More Detailed View
dynamic
dynamic
instr.
cycles
IPC small n
IPC n = 225
63
11
5.4
4.5
14
4
3.5
0.8
findBuckets,
1× outer loop
distribute,
one element
Sanders: Algorithm Engineering 8. Juni 2006
37
Comparison with Quicksort Pentium 4
14
Intel STL
GCC STL
sss-sort
time / n log n [ns]
12
10
8
6
4
2
0
4096
16384
65536
218
n
220
222
224
Problems: few registers, one condition code only, compiler needs “help”
Sanders: Algorithm Engineering 8. Juni 2006
38
Breakdown of Execution Time Pentium 4
Total
findBuckets + distribute + smallSort
findBuckets + distribute
findBuckets
8
time / n log n [ns]
7
6
5
4
3
2
1
0
4096
16384
65536
218
n
220
222
224
Sanders: Algorithm Engineering 8. Juni 2006
39
Analysis
mem. acc.branches data dep.
I/Os
registers instructions
k-way distribution:
n log k
sss-sort
quicksort log k lvls.
O (1)
O (n)
3.5n/B 3×unroll O (log k)
2n log k n log k O (n log k)2 Bn log k
4
O (1)
k-way merging:
memory
n log k
n log k O (n log k) 2n/B
7
O (log k)
register
2n
n log k O (n log k) 2n/B
k
O (k)
funnel k′
logk′ k
2n logk′ k n log k O (n log k) 2n/B 2k′ + 2
O (k′ )
Sanders: Algorithm Engineering 8. Juni 2006
Conclusions
sss-sort up to twice as fast as quicksort on Itanium
comparisons 6= conditional branches
algorithm analysis is not just instructions and caches
40
Sanders: Algorithm Engineering 8. Juni 2006
Kritik I
Warum nur zufällige Schlüssel?
Antwort I
Sample sort ist kaum von der Verteilung der Eingabe abhängig
41
Sanders: Algorithm Engineering 8. Juni 2006
Kritik I´
Aber was wenn es viele gleiche Schlüssel gibt?
Die landen alle in einem Bucket
Antwort I´
Its not a bug its a feature:
= si+1 = · · · = s j Indikator für häufigen Schlüssel!
Setze si := max {x ∈ Key : x < si },
(optional: streiche si+2 , . . . s j )
Nun muss bucket i + 1 nicht mehr sortiert werden!
Sei si
todo: implementieren
42
Sanders: Algorithm Engineering 8. Juni 2006
Kritik II
Quicksort ist inplace
Antwort II
Benutze hybride
Listen-Array Repräsentation von Folgen
braucht O
√
kn extra Platz für k-way sample sort
43
Sanders: Algorithm Engineering 8. Juni 2006
Kritik II´
Ich will aber Arrays für Ein- und Ausgabe
Antwort II´
Inplace Konvertierung
Eingabe: einfach
Ausgabe: tricky. Aufgabe: grobe Idee entwickeln. Hinweis: Permutation
von Blöcken. Jede Permutation besteht aus einem Produkt
zyklischer Permutationen. Inplace zyklische Permutation ist einfach
44
Sanders: Algorithm Engineering 8. Juni 2006
Future Work
high level fine-tuning, e.g., clever choice of k
other architectures, e.g., Opteron, PowerPC
almost in-place implementation
multilevel cache-aware or cache-oblivious generalization (oracles
help)
Studienarbeit, (Diplomarbeit), Hiwi-Job?
45
Sanders: Algorithm Engineering 8. Juni 2006
46
Sorting by Multiway Merging
Sort ⌈n/M⌉ runs with M elements each
2n/B I/Os
Merge M/B runs at a time
2n/B I/Os
l
m
× logM/B Mn merge phases
until only one run is left
————————————————
l
m
2n
n
sort(n) :=
1 + logM/B
B
M
In total
registers
ALU
1 word
capacity M
freely programmable
B words
large memory
...
I/Os
internal buffers
...
...
...
...
...
Sanders: Algorithm Engineering 8. Juni 2006
47
Procedure twoPassSort(M : N; a : external Array [0..n − 1] of Element)
b : external Array [0..n − 1] of Element
// auxiliary storage
formRuns(M, a, b)
mergeRuns(M, b, a)
// Sort runs of size M from f writing sorted runs to t
Procedure formRuns(M : N; f ,t : external Array [0..n − 1] of Element)
for i := 0 to n − 1 step M do
run :=
f [i..i + M − 1]
M
f
sort(run)
t[i..i + M − 1] := run
i=0
run
t
i=0
i=M
i=2M
sort internal
i=M
i=2M
Sanders: Algorithm Engineering 8. Juni 2006
48
// Merge n elements from f to t where f stores sorted runs of size L
Procedure mergeRuns(L : N; f ,t : external Array [0..n − 1] of Element)
k := ⌈n/L⌉
// Number of runs
next : PriorityQueue of Key × N
runBuffer := Array [0..k − 1][0..B − 1] of Element
for i := 0 to k − 1 do
runBuffer[i] :=
f
i=0
i=1
i=2
f [iM..iM + B − 1]
1
0
2 run−
next.insert(
Buffer
key(runBuffer[i][0]), iL))
B
next
out :
Array [0..B − 1] of Element
out
t
i=0
k
internal
i=1
i=2
Sanders: Algorithm Engineering 8. Juni 2006
49
// k-way merging
for i := 0 to n − 1 step B do
for j := 0 to B − 1 do
(x, ℓ) := next.deleteMin
out[ j] := runBuffer[ℓ div L][ℓ mod B]
ℓ++
if ℓ mod B = 0 then
// New input block
if ℓ mod L = 0 ∨ ℓ = n then
runBuffer[ℓ div L][0] := ∞// sentinel for exhausted run
else runBuffer[ℓ div L] := f [ℓ..ℓ + B − 1]// refill buffer
next.insert((runBuffer[ℓ div L][0], ℓ))
write out to t[i..i + B − 1]
// One output step
Sanders: Algorithm Engineering 8. Juni 2006
50
Example, B = 2, run size = 6
f
f
make things as simple as possible but no simpler
__aeghikmnst__aaeilmpsss__bbeilopssu__eilmnoprst
runBuffer
st
ss
B
M
t
ps
next
k
st
internal
out ss
________aaabbeeeeghiiiiklllmmmnnooppprss
Sanders: Algorithm Engineering 8. Juni 2006
51
2 Sorting with Parallel Disks
I/O Step := Access to a single physical block per disk
Theory: Balance Sort [Nodine Vitter 93].
Deterministic, complex
asymptotically optimal
O (·)
M
B
1
“Usually” factor 10? less I/Os.
independent disks
[Vitter Shriver 94]
Not asymptotically optimal.
42%
Basic Approach: Improve Multiway Merging
2
...
Multiway merging
D
Sanders: Algorithm Engineering 8. Juni 2006
Striping
52
emulated disk
logical block
1
2
D
physical blocks
That takes care of run formation
and writing the output
...
internal buffers
...
But what about merging?
...
...
???
...
...
Sanders: Algorithm Engineering 8. Juni 2006
53
Prediction
...
...
Smallest Element
of each block
triggers fetch.
Prefetch buffers
...
...
...
...
...
...
...
...
...
allow parallel access
...
of next blocks
...
...
...
prediction sequence
...
prefetch buffers
[Folklore, Knuth]
controls
internal buffers
...
Sanders: Algorithm Engineering 8. Juni 2006
54
Warmup: Multihead Model
...
...
...
...
prediction sequence
...
...
...
...
M
1
2
B
D
...
...
...
Multihead Model
[Aggarwal Vitter 88]
...
...
D prefetch buffers yield an optimal algorithm
l
2n n m
1 + logM/B
I/Os
sort(n) :=
DB
M
...
prefetch buffers
...
...
controls
internal buffers
...
Sanders: Algorithm Engineering 8. Juni 2006
55
Bigger Prefetch Buffer
prefetch buffers prediction sequence
...
...
...
...
...
...
Dk
good deterministic performance
O (D) would yield an optimal algoritihm.
Possible?
1 internal buffers
2
..
.
k
...
Sanders: Algorithm Engineering 8. Juni 2006
56
Randomized Cycling
[Vitter Hutchinson 01]
Block i of stripe
j goes to disk π j (i) for a random permutation π j
prefetch buffers prediction sequence
...
...
internal buffers
...
...
...
...
Good for naive prefetching and Ω (D log D) buffers
...
Sanders: Algorithm Engineering 8. Juni 2006
57
Buffered Writing
[Sanders-Egner-Korst SODA00, Hutchinson Sanders Vitter ESA 01]
Sequence of blocks Σ
Theorem:
otherwise, output
write whenever
Buffered Writing
one block from
one of W
buffers is free
randomized
mapping
...
queues
2
3 ...
...
But
...
1
is optimal
each nonempty queue
W/D
how good is optimal?
D
Theorem: Randomized cycling achieves efficiency 1 − O (D/W ).
Analysis: negative association of random variables,
application of queueing theory to a “throttled” Alg.
Sanders: Algorithm Engineering 8. Juni 2006
58
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
r q p o l i f
Σ
r q p o n ml k j i h g f e d c b a
R
Σ
order of writing
Puffer
order of reading
n
h g e
mk j
d c b a
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 8. Juni 2006
59
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
Σ
r q p o n ml k j i h g f e d c b a
R
Σ
order of writing
Puffer
order of reading
r
q
p
o
n
m
r
n
m
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 8. Juni 2006
60
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
Σ
r q p o n ml k j i h g f e d c b a
R
Σ
order of writing
Puffer
order of reading
j
q
p
o
l
k
r q
n
mk
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 8. Juni 2006
61
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
Σ
r q p o n ml k j i h g f e d c b a
R
Σ
order of writing
Puffer
order of reading
j
h
p
o
l
i
r q p
n
h
mk j
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 8. Juni 2006
62
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
Σ
r q p o n ml k j i h g f e d c b a
R
Σ
order of writing
Puffer
order of reading
g
f
e
o
l
i
r q p o
n
h g
mk j
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 8. Juni 2006
63
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
Σ
r q p o n ml k j i h g f e d c b a
R
Σ
order of writing
Puffer
order of reading
d
f
e
c
l
i
r q p o l
n
h g e
mk j
d
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 8. Juni 2006
64
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
Σ
r q p o n ml k j i h g f e d c b a
R
Σ
order of writing
Puffer
order of reading
b
f
a
c
i
r q p o l i
n
h g e
mk j
d c
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 8. Juni 2006
65
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
Σ
r q p o n ml k j i h g f e d c b a
ΣR
order of writing
Puffer
order of reading
b
f
a
r q p o l i f
n
h g e
mk j
d c b
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 8. Juni 2006
66
Optimal Offline Prefetching
Theorem:
For buffer size W :
∃ (offline) prefetching schedule for Σ with T input steps
⇔
∃ (online) write schedule for ΣR with T output steps
input step 8 7 6 5 4 3 2 1
r q p o l i f
Σ
r q p o n ml k j i h g f e d c b a
ΣR
order of writing
Puffer
order of reading
a
n
h g e
mk j
d c b a
output step 1 2 3 4 5 6 7 8
Sanders: Algorithm Engineering 8. Juni 2006
Synthesis
Multiway merging
+ prediction
[60s Folklore]
+optimal (randomized) writing
[Sanders Egner Korst SODA 2000]
+randomized cycling
[Vitter Hutchinson 2001]
+optimal prefetching
[Hutchinson Sanders Vitter ESA 2002]
(1 + o(1))·sort(n) I/Os
“answers” question in [Knuth 98]; difficulty 48 on a 1..50 scale.
67
Sanders: Algorithm Engineering 8. Juni 2006
We are not done yet!
Internal work
Overlapping I/O and computation
Reasonable hardware
Interfacing with the Operating System
Parameter Tuning
Software engineering
Pipelining
68
Sanders: Algorithm Engineering 8. Juni 2006
69
Key Sorting
The I/O bandwidth of our machine is about 1/3 of its main memory
bandwidth
If key size ≪ element size
sort key pointer pairs to save memory bandwidth during run formation
a
3
b
1
4
c
1
e
5
d
extract
3
1
1
b
4
1
1
5
d
sort
3
a
1
1
3
4
5
permute
4
c
5
e
Sanders: Algorithm Engineering 8. Juni 2006
70
Tournament Trees for Multiway Merging
Assume k = 2K runs
K level complete binary tree
Leaves: smallest current element of each run
Internal nodes: loser of a competition for being smallest
Above root: global winner
deleteMin+
insertNext
1
2
4
7
6
3
4
6
4
2
2
4
9
7
9
7
1
3
4
4
6
7
4
7
6
2
9
7
9
7
3
4
7
Sanders: Algorithm Engineering 8. Juni 2006
71
Why Tournament Trees
Exactly log k element comparisons
Implicit layout in an array
simple index arithmetics (shifts)
Predictable load instructions and index computations
(Unrollable) inner loop:
for (int i=(winnerIndex+kReg)>>1; i>0; i>>=1){
currentPos = entry + i;
currentKey = currentPos->key;
if (currentKey < winnerKey) {
currentIndex
= currentPos->index;
currentPos->key
= winnerKey;
currentPos->index = winnerIndex;
winnerKey
= currentKey;
winnerIndex
= currentIndex;}}
Sanders: Algorithm Engineering 8. Juni 2006
Overlapping I/O and Computation
One thread for each disk (or asynchronous I/O)
Possibly additional threads
Blocks filled with elements are passed by references between
different buffers
72
Sanders: Algorithm Engineering 8. Juni 2006
73
Overlapping During Run Formation
First post read requests for runs 1 and 2
Thread A: Loop { wait-read i; sort i; post-write i};
Thread B: Loop { wait-write i; post-read i + 2};
control flow in thread A
control flow in thread B
time
1
2
3
4
read
1
sort
2
3
1
write
1
4
2
3
2
3
4
1
2
3
write
1
2
4
k−1
k−1
4
k
k
k−1
...
4
I/O
bound
case
k
...
...
3
k
k−1
...
read
sort
k−1
...
...
compute
bound
case
k
k−1
k
Sanders: Algorithm Engineering 8. Juni 2006
74
Overlapping During Merging
Bad example:
1B−1 2 3B−1 4 5B−1 6 · · ·
···
1B−1 2 3B−1 4 5B−1 6 · · ·
Sanders: Algorithm Engineering 8. Juni 2006
75
Overlapping During Merging
read buffers
−ping disk scheduling
I/O Threads: Writing has priority over reading
..
..
merge
D blocks
buffers
k+3D
overlap buffers
merge
O(D) prefetch buffers
2D write buffers
..
..
1
k
merging
overlap−
Sanders: Algorithm Engineering 8. Juni 2006
76
y = # of elements
merged during
one I/O step.
I/O bound
y > DB
2
y ≤ DB
#elements in overlap+merge buffers
I/O bound case: The I/O thread never blocks
r
kB+3DB
kB+2DB+y
blocking
output
kB+2DB
kB+DB+y
fetch
kB+DB
DB−y
DB 2DB−y
w
2DB
#elements in output buffers
Sanders: Algorithm Engineering 8. Juni 2006
kB+3DB
#elements in overlap+merge buffers
Compute bound case:
The merging thread
never blocks
77
r
block I/O
kB+2DB
output
fetch
kB+DB
kB+2y
kB+y
kB
block merging
DB
w
2DB−y2DB
#elements in output buffers
Sanders: Algorithm Engineering 8. Juni 2006
78
Hardware (mid 2002)
2x Xeon
4 Threads
Linux
400x64 Mb/s
Intel
E7500
Chipset
Several 66 MHz PCI-buses
(SuperMicro P4DPE3)
Several fast IDE controllers
(4 × Promise Ultra100 TX2)
Many fast IDE disks
(8× IBM IC35L080AVVA07)
cost effective I/O-bandwidth
2x64x66 Mb/s
4x2x100
MB/s
8x45
MB/s
128
(2 × 2GHz Xeon × 2 Threads)
1 GB
DDR
RAM
PCI−Busses
Controller
Channels
8x80
GB
(real 360 MB/s for ≈ 3000 ¤)
Sanders: Algorithm Engineering 8. Juni 2006
Software Interface
Goals: efficient + simple + compatible
TXXL
1111111111111111111111111111111
0000000000000000000000000000000
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
Applications
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
000000000000000000
111111111111111111
00000000000000
11111111111111
000000000000000000
111111111111111111
00000000000000
11111111111111
000000000000000000
111111111111111111
00000000000000
11111111111111
000000000000000000
111111111111111111
00000000000000
11111111111111
Streaming layer
STL−user layer
000000000000000000
111111111111111111
00000000000000
11111111111111
000000000000000000
111111111111111111
00000000000000
11111111111111
000000000000000000
111111111111111111
00000000000000
11111111111111
vector, stack, set
000000000000000000
111111111111111111
00000000000000
11111111111111
Containers: priority_queue,
Pipelined sorting,
map 11111111111111
000000000000000000
111111111111111111
00000000000000
000000000000000000
111111111111111111
00000000000000
11111111111111
Algorithms: sort, for_each, merge 11111111111111
zero−I/O scanning
000000000000000000
111111111111111111
00000000000000
000000000000000000
111111111111111111
00000000000000
11111111111111
000000000000000000
111111111111111111
00000000000000
11111111111111
000000000000000000
111111111111111111
00000000000000
11111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
Block management layer
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
typed block, block manager, buffered streams,
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
block prefetcher, buffered block writer
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
Asynchronous I/O primitives layer
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
files, I/O requests, disk queues,
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
completion handlers
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
0000000000000000000000000000000
1111111111111111111111111111111
000000000000000000000000000000
111111111111111111111111111111
000000000000000000000000000000
111111111111111111111111111111
000000000000000000000000000000
111111111111111111111111111111
Operating System
000000000000000000000000000000
111111111111111111111111111111
000000000000000000000000000000
111111111111111111111111111111
000000000000000000000000000000
111111111111111111111111111111
79
S
Sanders: Algorithm Engineering 8. Juni 2006
80
Default Measurement Parameters
t := number of available buffer blocks
Input Size: 16 GByte
400x64 Mb/s
Intel
E7500
Chipset
Keys: Random 32 bit integers
Run Size: 256 MByte
Block size B: 2 MByte
Compiler: g++ 3.2 -O6
Write Buffers:
max(t/4, 2D)
2x64x66 Mb/s
4x2x100
MB/s
8x45
MB/s
3
Prefetch Buffers: 2D +
(t − w − 2D)
10
128
Element Size: 128 Byte
2x Xeon
4 Threads
1 GB
DDR
RAM
PCI−Busses
Controller
Channels
8x80
GB
Sanders: Algorithm Engineering 8. Juni 2006
81
Element sizes (16 GByte, 8 disks)
400
run formation
merging
I/O wait in merge phase
I/O wait in run formation phase
350
300
time [s]
250
200
150
100
50
0
16
parallel disks
32
64
128
256
element size [byte]
bandwidth “for free”
512
1024
internal work, overlapping are relevant
Sanders: Algorithm Engineering 8. Juni 2006
82
Earlier Academic Implementations
Single Disk, at most 2 GByte, old measurements use artificial M
800
LEDA-SM
TPIE
<stxxl> comparison based
700
sort time [s]
600
500
400
300
200
100
0
16
32
64
128
256
element size [byte]
512
1024
Sanders: Algorithm Engineering 8. Juni 2006
83
Earlier Academic Implementations: Multiple
Disks
616
500
400
LEDA-SM Soft-RAID
TPIE Soft-RAID
<stxxl> Soft-RAID
<stxxl>
sort time [s]
300
200
100
75
50
40
30
26
16
32
64
128
256
element size [byte]
512
1024
Sanders: Algorithm Engineering 8. Juni 2006
84
What are good block sizes (8 disks)?
26
128 GBytes 1x merge
128 GBytes 2x merge
16 GBytes
sort time [ns/byte]
24
22
20
18
16
14
12
128
256
512
1024
2048
block size [KByte]
B is not a technology constant
4096
8192
Sanders: Algorithm Engineering 8. Juni 2006
85
Optimal Versus Naive Prefetching
Total merge time
176 read buffers
112 read buffers
48 read buffers
difference to naive prefetching [%]
1
0
-1
-2
-3
0
20
40
60
80
fraction of prefetch buffers [%]
100
Sanders: Algorithm Engineering 8. Juni 2006
Impact of Prefetch and Overlap Buffers
150
no prefetch buffer
heuristic schedule
145
merging time [s]
140
135
130
125
120
115
16
32
48
64
80
96 112 128 144 160 172
number of read buffers
86
Sanders: Algorithm Engineering 8. Juni 2006
87
Tradeoff: Write Buffer Size Versus Read Buffer
Size
145
140
merge time [s]
135
130
125
120
115
110
0
50
100
150
number of read buffers
200
Sanders: Algorithm Engineering 8. Juni 2006
88
Scalability
sort time [ns/byte]
20
15
10
5
128-byte elements
512-byte elements
4N/DB bulk I/Os
0
16
32
64
input size [GByte]
128
Sanders: Algorithm Engineering 8. Juni 2006
Discussion
Theory and practice harmonize
No expensive server hardware necessary (SCSI,. . . )
No need to work with artificial M
No 2/4 GByte limits
Faster than academic implementations
(Must be) as fast as commercial implementations but with
performance guarantees
Blocks are much larger than often assumed. Not a technology
constant
Parallel disks
bandwidth “for free”
don’t negliect internal costs
89
Sanders: Algorithm Engineering 8. Juni 2006
90
More Parallel Disk Sorting?
Pipelining: Input does not come from disk but from a logical input
stream. Output goes to a logical output stream
only half the I/Os for sorting
often no I/Os for scanning
todo: better overlapping
Parallelism: This is the only way to go for really many disks
Tuning and Special Cases: ssssort, permutations, balance work
between merging and run formation?. . .
Longer Runs: not done with guaranteed overlapping, fast internal
sorting !
Distribution Sorting: Better for seeks etc.?
Inplace Sorting: Could also be faster
Determinism: A practical and theoretically efficient algorithm?
Sanders: Algorithm Engineering 8. Juni 2006
Procedure formLongRuns
q, q′ : PriorityQueue
for i := 1 to M do q.insert(readElement)
invariant |q| + |q′ | = M
loop
while q 6= 0/
writeElement(e:= q.deleteMin)
if input exhausted then break outer loop
if e′ := readElement < e then q′ .insert(e′ )
else q.insert(e′ )
q:= q′ ; q′ := 0/
output q in sorted order; output q′ in sorted order
Knuth: average run length 2M
todo: cache-effiziente Implementierung
91
Herunterladen