Sanders: Algorithm Engineering 8. Juni 2006 1 Arrays, Verkettete Listen und abgeleitete Datenstrukturen Bounded Arrays Eingebaute Datenstruktur. Größe muss von Anfang an bekannt sein 1 Sanders: Algorithm Engineering 8. Juni 2006 Unbounded Array z.B. std::vector pushBack: Element anhängen popBack: Letztes Element löschen Idee: verdopple wenn der Platz ausgeht halbiere wenn Platz verschwendet wird Wenn man das richtig macht, brauchen n pushBack/popBack Operationen Zeit O (n) Algorithmese: pushBack/popBack haben konstante amortisierte Komplexität Was kann man falsch machen? 2 Sanders: Algorithm Engineering 8. Juni 2006 3 Doppelt verkettete Listen Sanders: Algorithm Engineering 8. Juni 2006 4 Class Item of Element // one link in a doubly linked list e : Element next : Handle // prev : Handle invariant next→prev=prev→next=this Trick: Use a dummy header - ⊥ - ··· ··· - Sanders: Algorithm Engineering 8. Juni 2006 5 Procedure splice(a,b,t : Handle) assert b is not before a ∧ t 6∈ ha, . . . , bi ′ a a // Cut out ha, . . . , bi a′ := a→prev b′ := b→next a′ →next := b′ // b′ →prev := a′ // Y t // insert ha, . . . , bi after t t ′ := t→next // a b′ b - ··· ··· - - ··· ··· R - - t′ b R - Y - ··· ··· - ··· ··· R - - - ··· ··· - b→next := t’ a→prev := t // // Y t →next := a t ′ →prev := b // // Sanders: Algorithm Engineering 8. Juni 2006 6 Einfach verkettete Listen ... Vergleich mit doppelt verketteten Listen Weniger Speicherplatz Platz ist oft auch Zeit Eingeschränkter z.B. kein delete Merkwürdige Benutzerschnittstelle, z.B. deleteAfter Sanders: Algorithm Engineering 8. Juni 2006 7 Speicherverwaltung für Listen kann leicht 90 % der Zeit kosten! Lieber Elemente zwischen (Free)lists herschieben als echte mallocs Alloziere viele Items gleichzeitig Am Ende alles freigeben? Speichere „parasitär“. z.B. Graphen: Knotenarray. Jeder Knoten speichert ein ListItem Partition der Knoten kann als verkettete Listen gespeichert werden MST, shortest Path Challenge: garbage collection, viele Datentypen auch ein Software Engineering Problem hier nicht Sanders: Algorithm Engineering 8. Juni 2006 8 Beispiel: Stack ... dynamisch Platzverschwendung 2 1 1 2 ... SList B-Array U-Array + − + pointer zu groß? zu groß? cache miss + umkopieren (+) + − freigeben? Zeitverschwendung worst case time Wars das? Hat jede Implementierung gravierende Schwächen? Sanders: Algorithm Engineering 8. Juni 2006 9 The Best From Both Worlds ... B hybrid dynamisch + Platzverschwendung n/B + B Zeitverschwendung + worst case time + Sanders: Algorithm Engineering 8. Juni 2006 10 Eine Variante Directory ... ... B Elemente − Reallozierung im top level nicht worst case konstante Zeit + Indizierter Zugrif auf S[i] in konstanter Zeit Sanders: Algorithm Engineering 8. Juni 2006 Beyond Stacks 11 stack ... FIFO queue ... deque ... popFront pushFront n0 pushBack popBack h t b FIFO: BArray −→ cyclic array Aufgabe: Ein Array, das “[i]” in konstanter Zeit und einfügen/löschen von √ Elementen in Zeit O ( n) unterstützt Aufgabe: Ein externer Stack, der n push/pop Operationen mit O (n/B) I/Os unterstützt Sanders: Algorithm Engineering 8. Juni 2006 12 Aufgabe: Tabelle für hybride Datenstrukturen vervollständigen Operation List SList UArray CArray explanation of ‘∗ ’ [·] |·| first last insert remove pushBack pushFront popBack popFront concat splice findNext,. . . n 1∗ 1 1 1 1 1 1 1 1 1 1 n n 1∗ 1 1 1∗ 1∗ 1 1 n 1 1 1 n 1 1 1 1 n n 1∗ n 1∗ n n n n∗ 1 1 1 1 n n 1∗ 1∗ 1∗ 1∗ n n n∗ not with inter-list splice insertAfter only removeAfter only amortized amortized amortized amortized cache efficient Sanders: Algorithm Engineering 8. Juni 2006 Was fehlt? Fakten Fakten Fakten Messungen für Verschiedene Implementierungsvarianten Verschiedene Architekturen Verschiedene Eingabegrößen Auswirkungen auf reale Anwendungen Kurven dazu Interpretation, ggf. Theoriebildung Aufgabe: Array durchlaufen versus zufällig allozierte verkettete Liste 13 Sanders: Algorithm Engineering 8. Juni 2006 Quicksort Function quickSort(s : Sequence of Element) : Sequence of Element if |s| ≤ 1 then return s // base case pick p ∈ s uniformly at random // pivot key a := he ∈ s : e < pi b := he ∈ s : e = pi c := he ∈ s : e > pi return concatenate(quickSort(a), b, quickSort(c) 14 Sanders: Algorithm Engineering 8. Juni 2006 Engineering Quicksort array 2-Wege-Vergleiche sentinels für innere Schleife inplace swaps Rekursion auf kleinere Teilproblemgröße → O (log n) zusätzlichen Platz Rekursionsabbruch für kleine (20–100) Eingaben, insertion sort (nicht ein großer insertion sort) 15 Sanders: Algorithm Engineering 8. Juni 2006 Procedure qSort(a : Array of Element; ℓ, r : N) // Sort a[ℓ..r] while r − ℓ ≥ n0 do // Use divide-and-conquer j := pickPivotPos(a, l, r) swap(a[ℓ], a[ j]) // Helps to establish the invariant p := a[ℓ] i := ℓ; j := r r repeat // a: ℓ i→ ← j while a[i] < p do i++ // Scan over elements (A) while a[ j] > p do j−− // on the correct side (B) if i ≤ j then swap(a[i], a[ j]); i++ ; j−− until i > j // Done partitioning if i < l+r 2 then qSort(a,ℓ,j); ℓ := j + 1 else qSort(a,i,r) ; r := i − 1 insertionSort(a[l..r]) // faster for small r − l 16 Sanders: Algorithm Engineering 8. Juni 2006 Picking Pivots Painstakingly — Theory probabilistically: Expected 1.4n log n element comparisons median of three: Expected 1.2n log n element comparisons perfect: −→ n log n element comparisons (approximate using large samples) Practice 3GHz Pentium 4 Prescott, g++ 17 Sanders: Algorithm Engineering 8. Juni 2006 18 Picking Pivots Painstakingly — Instructions out sort Instructions / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11 12 random pivot median of 3 exact median skewed pivot n/10 skewed pivot n/11 11 instructions constant 10 9 8 7 6 10 12 14 16 18 n 20 22 24 26 Sanders: Algorithm Engineering 8. Juni 2006 19 Picking Pivots Painstakingly — Time out sort Seconds / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11 8.4 random pivot median of 3 exact median skewed pivot n/10 skewed pivot n/11 8.2 nanosecs constant 8 7.8 7.6 7.4 7.2 7 6.8 10 12 14 16 18 n 20 22 24 26 Sanders: Algorithm Engineering 8. Juni 2006 20 Picking Pivots Painstakingly — Branch Misses out sort Branch misses / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11 0.5 random pivot median of 3 exact median skewed pivot n/10 skewed pivot n/11 0.48 0.46 branch misses constant 0.44 0.42 0.4 0.38 0.36 0.34 0.32 0.3 10 12 14 16 18 n 20 22 24 26 Sanders: Algorithm Engineering 8. Juni 2006 Can We Do Better? Previous Work Integer Keys + Can be 2 − 3 times faster than quicksort − Naive ones are cache inefficient and slower than quicksort − Simple ones are distribution dependent. Cache efficient sorting k-ary merge sort [Nyberg et al. 94, Arge et al. 04, Ranade et al. 00, Brodal et al. 04] + Faktor log k less cache faults − Only ≈ 20 % speedup, and only for laaarge inputs 21 Sanders: Algorithm Engineering 8. Juni 2006 22 Sample Sort Function sampleSort(e = he1 , . . . , en i, k) if n/k is “small” then return smallSort(e) let S = hS1 , . . . , Sak−1 i denote a random sample of e sort S hs0 , s1 , s2 , . . . , sk−1 , sk i:= h−∞, Sa , S2a , . . . , S(k−1)a , ∞i binary search? for i := 1 to n do s 1 s 2 s3 s 4 s5 s6 s7 find j ∈ {1, . . . , k} such that s j−1 < ei ≤ s j place ei in bucket b j return concatenate(sampleSort(b1 ), . . . , sampleSort(bk )) buckets Sanders: Algorithm Engineering 8. Juni 2006 Why Sample Sort? traditionally: parallelizable on coarse grained machines + Cache efficient ≈ merge sort − Binary search not much faster than merging − complicated memory management Super Scalar Sample Sort Binary search −→ implicit search tree Eliminate all conditional branches Exploit instruction parallelism Cache efficiency comes to bear “steal” memory management from radix sort 23 Sanders: Algorithm Engineering 8. Juni 2006 24 Classifying Elements t:= hsk/2 , sk/4 , s3k/4 , sk/8 , s3k/8 , s5k/8 , s7k/8 , . . .i for i := 1 to n do splitter array index s4 1 j:= 1 < *2 > +1 decisions repeat log k times // unroll s s6 3 2 2 j:= 2 j + (ai > t j ) decisions < *2 > < *2 > +1 +1 j:= j − k + 1 s 1 4 s3 5 s5 6 s7 7 b j ++ > < > < > < > decisions < o(i):= j // oracle buckets Now the compiler should: use predicated instructions interleave for loop iterations (unrolling ∨ software pipelining) Sanders: Algorithm Engineering 8. Juni 2006 25 template <class T> void findOraclesAndCount(const T* const a, const int n, const int k, const T* const s, Oracle* const oracle, int* const bucket) { { for (int i = 0; i < n; i++) int j = 1; while (j < k) { j = j*2 + (a[i] > s[j]); } int b = j-k; bucket[b]++; oracle[i] = b; } } Sanders: Algorithm Engineering 8. Juni 2006 Predication Hardware mechanism that allows instructions to be conditionally executed Boolean predicate registers (1–64) hold condition codes predicate registers p are additional inputs of predicated instructions I At runtime, I is executed if and only if p is true + Avoids branch misprediction penalty + More flexible instruction scheduling − Switched off instructions still take time − Longer opcodes − Complicated hardware design 26 Sanders: Algorithm Engineering 8. Juni 2006 27 Example (IA-64) Translation of: if (r1 > 2) r3 := r3 + 4 With a conditional branch: Via predication: cmp.gt p6,p7=r1,r2 cmp.gt p6,p7=r1,r2 (p7) br.cond .label (p6) add r3=4,r3 add r3=4,r3 .label: (...) Other Current Architectures: Conditional moves only Sanders: Algorithm Engineering 8. Juni 2006 Unrolling (k = 16) template <class T> void findOraclesAndCountUnrolled([...]){ for (int i = 0; i < n; i++) int j = 1; j = j*2 + (a[i] > s[j]); j = j*2 + (a[i] > s[j]); j = j*2 + (a[i] > s[j]); j = j*2 + (a[i] > s[j]); int b = j-k; bucket[b]++; oracle[i] = b; } } 28 Sanders: Algorithm Engineering 8. Juni 2006 29 More Unrolling k = 16, n even template <class T> void findOraclesAndCountUnrolled2([...]){ for (int i = n & 1; i < n; i+=2) {\ int j0 = 1; int j1 = 1; T ai0 = a[i]; T ai1 = a[i+1]; j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); int b0 = j0-k; int b1 = j1-k; bucket[b0]++; bucket[b1]++; oracle[i] = b0; oracle[i+1] = b1; } } Sanders: Algorithm Engineering 8. Juni 2006 30 i Distributing Elements for i := 1 to n do a′B[o(i)]++ := ai Why Oracles? simplifies memory management refer a o refer B refer no overflow tests or re-copying simplifies software pipelining a’ separates computation and memory access aspects small (n bytes) sequential, predictable memory access can be hidden using prefetching / write buffering move Sanders: Algorithm Engineering 8. Juni 2006 31 Distributing Elements template <class T> void distribute( const T* const a0, T* const a1, const int n, const int k, const Oracle* const oracle, int* const bucket) { for (int i = 0, sum = 0; i <= k; i++) { int t = bucket[i]; bucket[i] = sum; sum += t; } for (int i = 0; i < n; i++) { a1[bucket[oracle[i]]++] = a0[i]; } } Sanders: Algorithm Engineering 8. Juni 2006 Aufgabe Wie implementiert man 4-Wege Mischen (8-Wege-Mischen) mit 2n Speicherzugriffen und 2n (3n) Vergleichen auf einer Maschine mit 8 (16) Registern? Ist eine naive Implementierung mit 3n (7n) leichter vorhersagbaren Vergleichen am Ende schneller? 32 Sanders: Algorithm Engineering 8. Juni 2006 Experiments: 1.4 GHz Itanium 2 restrict keyword from ANSI/ISO C99 to indicate nonaliasing Intel’s C++ compiler v8.0 uses predicated instructions automatically Profiling gives 9% speedup k = 256 splitters Use stl:sort from g++ (n ≤ 1000)! insertion sort for n ≤ 100 Random 32 bit integers in [0, 109 ] 33 Sanders: Algorithm Engineering 8. Juni 2006 34 Comparison with Quicksort 8 time / n log n [ns] 7 6 5 4 3 2 1 0 4096 16384 65536 GCC STL qsort sss-sort 218 n 220 222 224 Sanders: Algorithm Engineering 8. Juni 2006 35 Breakdown of Execution Time 6 Total findBuckets + distribute + smallSort findBuckets + distribute findBuckets time / n log n [ns] 5 4 3 2 1 0 4096 16384 65536 218 220 n 222 224 Sanders: Algorithm Engineering 8. Juni 2006 36 A More Detailed View dynamic dynamic instr. cycles IPC small n IPC n = 225 63 11 5.4 4.5 14 4 3.5 0.8 findBuckets, 1× outer loop distribute, one element Sanders: Algorithm Engineering 8. Juni 2006 37 Comparison with Quicksort Pentium 4 14 Intel STL GCC STL sss-sort time / n log n [ns] 12 10 8 6 4 2 0 4096 16384 65536 218 n 220 222 224 Problems: few registers, one condition code only, compiler needs “help” Sanders: Algorithm Engineering 8. Juni 2006 38 Breakdown of Execution Time Pentium 4 Total findBuckets + distribute + smallSort findBuckets + distribute findBuckets 8 time / n log n [ns] 7 6 5 4 3 2 1 0 4096 16384 65536 218 n 220 222 224 Sanders: Algorithm Engineering 8. Juni 2006 39 Analysis mem. acc.branches data dep. I/Os registers instructions k-way distribution: n log k sss-sort quicksort log k lvls. O (1) O (n) 3.5n/B 3×unroll O (log k) 2n log k n log k O (n log k)2 Bn log k 4 O (1) k-way merging: memory n log k n log k O (n log k) 2n/B 7 O (log k) register 2n n log k O (n log k) 2n/B k O (k) funnel k′ logk′ k 2n logk′ k n log k O (n log k) 2n/B 2k′ + 2 O (k′ ) Sanders: Algorithm Engineering 8. Juni 2006 Conclusions sss-sort up to twice as fast as quicksort on Itanium comparisons 6= conditional branches algorithm analysis is not just instructions and caches 40 Sanders: Algorithm Engineering 8. Juni 2006 Kritik I Warum nur zufällige Schlüssel? Antwort I Sample sort ist kaum von der Verteilung der Eingabe abhängig 41 Sanders: Algorithm Engineering 8. Juni 2006 Kritik I´ Aber was wenn es viele gleiche Schlüssel gibt? Die landen alle in einem Bucket Antwort I´ Its not a bug its a feature: = si+1 = · · · = s j Indikator für häufigen Schlüssel! Setze si := max {x ∈ Key : x < si }, (optional: streiche si+2 , . . . s j ) Nun muss bucket i + 1 nicht mehr sortiert werden! Sei si todo: implementieren 42 Sanders: Algorithm Engineering 8. Juni 2006 Kritik II Quicksort ist inplace Antwort II Benutze hybride Listen-Array Repräsentation von Folgen braucht O √ kn extra Platz für k-way sample sort 43 Sanders: Algorithm Engineering 8. Juni 2006 Kritik II´ Ich will aber Arrays für Ein- und Ausgabe Antwort II´ Inplace Konvertierung Eingabe: einfach Ausgabe: tricky. Aufgabe: grobe Idee entwickeln. Hinweis: Permutation von Blöcken. Jede Permutation besteht aus einem Produkt zyklischer Permutationen. Inplace zyklische Permutation ist einfach 44 Sanders: Algorithm Engineering 8. Juni 2006 Future Work high level fine-tuning, e.g., clever choice of k other architectures, e.g., Opteron, PowerPC almost in-place implementation multilevel cache-aware or cache-oblivious generalization (oracles help) Studienarbeit, (Diplomarbeit), Hiwi-Job? 45 Sanders: Algorithm Engineering 8. Juni 2006 46 Sorting by Multiway Merging Sort ⌈n/M⌉ runs with M elements each 2n/B I/Os Merge M/B runs at a time 2n/B I/Os l m × logM/B Mn merge phases until only one run is left ———————————————— l m 2n n sort(n) := 1 + logM/B B M In total registers ALU 1 word capacity M freely programmable B words large memory ... I/Os internal buffers ... ... ... ... ... Sanders: Algorithm Engineering 8. Juni 2006 47 Procedure twoPassSort(M : N; a : external Array [0..n − 1] of Element) b : external Array [0..n − 1] of Element // auxiliary storage formRuns(M, a, b) mergeRuns(M, b, a) // Sort runs of size M from f writing sorted runs to t Procedure formRuns(M : N; f ,t : external Array [0..n − 1] of Element) for i := 0 to n − 1 step M do run := f [i..i + M − 1] M f sort(run) t[i..i + M − 1] := run i=0 run t i=0 i=M i=2M sort internal i=M i=2M Sanders: Algorithm Engineering 8. Juni 2006 48 // Merge n elements from f to t where f stores sorted runs of size L Procedure mergeRuns(L : N; f ,t : external Array [0..n − 1] of Element) k := ⌈n/L⌉ // Number of runs next : PriorityQueue of Key × N runBuffer := Array [0..k − 1][0..B − 1] of Element for i := 0 to k − 1 do runBuffer[i] := f i=0 i=1 i=2 f [iM..iM + B − 1] 1 0 2 run− next.insert( Buffer key(runBuffer[i][0]), iL)) B next out : Array [0..B − 1] of Element out t i=0 k internal i=1 i=2 Sanders: Algorithm Engineering 8. Juni 2006 49 // k-way merging for i := 0 to n − 1 step B do for j := 0 to B − 1 do (x, ℓ) := next.deleteMin out[ j] := runBuffer[ℓ div L][ℓ mod B] ℓ++ if ℓ mod B = 0 then // New input block if ℓ mod L = 0 ∨ ℓ = n then runBuffer[ℓ div L][0] := ∞// sentinel for exhausted run else runBuffer[ℓ div L] := f [ℓ..ℓ + B − 1]// refill buffer next.insert((runBuffer[ℓ div L][0], ℓ)) write out to t[i..i + B − 1] // One output step Sanders: Algorithm Engineering 8. Juni 2006 50 Example, B = 2, run size = 6 f f make things as simple as possible but no simpler __aeghikmnst__aaeilmpsss__bbeilopssu__eilmnoprst runBuffer st ss B M t ps next k st internal out ss ________aaabbeeeeghiiiiklllmmmnnooppprss Sanders: Algorithm Engineering 8. Juni 2006 51 2 Sorting with Parallel Disks I/O Step := Access to a single physical block per disk Theory: Balance Sort [Nodine Vitter 93]. Deterministic, complex asymptotically optimal O (·) M B 1 “Usually” factor 10? less I/Os. independent disks [Vitter Shriver 94] Not asymptotically optimal. 42% Basic Approach: Improve Multiway Merging 2 ... Multiway merging D Sanders: Algorithm Engineering 8. Juni 2006 Striping 52 emulated disk logical block 1 2 D physical blocks That takes care of run formation and writing the output ... internal buffers ... But what about merging? ... ... ??? ... ... Sanders: Algorithm Engineering 8. Juni 2006 53 Prediction ... ... Smallest Element of each block triggers fetch. Prefetch buffers ... ... ... ... ... ... ... ... ... allow parallel access ... of next blocks ... ... ... prediction sequence ... prefetch buffers [Folklore, Knuth] controls internal buffers ... Sanders: Algorithm Engineering 8. Juni 2006 54 Warmup: Multihead Model ... ... ... ... prediction sequence ... ... ... ... M 1 2 B D ... ... ... Multihead Model [Aggarwal Vitter 88] ... ... D prefetch buffers yield an optimal algorithm l 2n n m 1 + logM/B I/Os sort(n) := DB M ... prefetch buffers ... ... controls internal buffers ... Sanders: Algorithm Engineering 8. Juni 2006 55 Bigger Prefetch Buffer prefetch buffers prediction sequence ... ... ... ... ... ... Dk good deterministic performance O (D) would yield an optimal algoritihm. Possible? 1 internal buffers 2 .. . k ... Sanders: Algorithm Engineering 8. Juni 2006 56 Randomized Cycling [Vitter Hutchinson 01] Block i of stripe j goes to disk π j (i) for a random permutation π j prefetch buffers prediction sequence ... ... internal buffers ... ... ... ... Good for naive prefetching and Ω (D log D) buffers ... Sanders: Algorithm Engineering 8. Juni 2006 57 Buffered Writing [Sanders-Egner-Korst SODA00, Hutchinson Sanders Vitter ESA 01] Sequence of blocks Σ Theorem: otherwise, output write whenever Buffered Writing one block from one of W buffers is free randomized mapping ... queues 2 3 ... ... But ... 1 is optimal each nonempty queue W/D how good is optimal? D Theorem: Randomized cycling achieves efficiency 1 − O (D/W ). Analysis: negative association of random variables, application of queueing theory to a “throttled” Alg. Sanders: Algorithm Engineering 8. Juni 2006 58 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 r q p o l i f Σ r q p o n ml k j i h g f e d c b a R Σ order of writing Puffer order of reading n h g e mk j d c b a output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 8. Juni 2006 59 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 Σ r q p o n ml k j i h g f e d c b a R Σ order of writing Puffer order of reading r q p o n m r n m output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 8. Juni 2006 60 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 Σ r q p o n ml k j i h g f e d c b a R Σ order of writing Puffer order of reading j q p o l k r q n mk output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 8. Juni 2006 61 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 Σ r q p o n ml k j i h g f e d c b a R Σ order of writing Puffer order of reading j h p o l i r q p n h mk j output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 8. Juni 2006 62 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 Σ r q p o n ml k j i h g f e d c b a R Σ order of writing Puffer order of reading g f e o l i r q p o n h g mk j output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 8. Juni 2006 63 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 Σ r q p o n ml k j i h g f e d c b a R Σ order of writing Puffer order of reading d f e c l i r q p o l n h g e mk j d output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 8. Juni 2006 64 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 Σ r q p o n ml k j i h g f e d c b a R Σ order of writing Puffer order of reading b f a c i r q p o l i n h g e mk j d c output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 8. Juni 2006 65 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 Σ r q p o n ml k j i h g f e d c b a ΣR order of writing Puffer order of reading b f a r q p o l i f n h g e mk j d c b output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 8. Juni 2006 66 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 r q p o l i f Σ r q p o n ml k j i h g f e d c b a ΣR order of writing Puffer order of reading a n h g e mk j d c b a output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 8. Juni 2006 Synthesis Multiway merging + prediction [60s Folklore] +optimal (randomized) writing [Sanders Egner Korst SODA 2000] +randomized cycling [Vitter Hutchinson 2001] +optimal prefetching [Hutchinson Sanders Vitter ESA 2002] (1 + o(1))·sort(n) I/Os “answers” question in [Knuth 98]; difficulty 48 on a 1..50 scale. 67 Sanders: Algorithm Engineering 8. Juni 2006 We are not done yet! Internal work Overlapping I/O and computation Reasonable hardware Interfacing with the Operating System Parameter Tuning Software engineering Pipelining 68 Sanders: Algorithm Engineering 8. Juni 2006 69 Key Sorting The I/O bandwidth of our machine is about 1/3 of its main memory bandwidth If key size ≪ element size sort key pointer pairs to save memory bandwidth during run formation a 3 b 1 4 c 1 e 5 d extract 3 1 1 b 4 1 1 5 d sort 3 a 1 1 3 4 5 permute 4 c 5 e Sanders: Algorithm Engineering 8. Juni 2006 70 Tournament Trees for Multiway Merging Assume k = 2K runs K level complete binary tree Leaves: smallest current element of each run Internal nodes: loser of a competition for being smallest Above root: global winner deleteMin+ insertNext 1 2 4 7 6 3 4 6 4 2 2 4 9 7 9 7 1 3 4 4 6 7 4 7 6 2 9 7 9 7 3 4 7 Sanders: Algorithm Engineering 8. Juni 2006 71 Why Tournament Trees Exactly log k element comparisons Implicit layout in an array simple index arithmetics (shifts) Predictable load instructions and index computations (Unrollable) inner loop: for (int i=(winnerIndex+kReg)>>1; i>0; i>>=1){ currentPos = entry + i; currentKey = currentPos->key; if (currentKey < winnerKey) { currentIndex = currentPos->index; currentPos->key = winnerKey; currentPos->index = winnerIndex; winnerKey = currentKey; winnerIndex = currentIndex;}} Sanders: Algorithm Engineering 8. Juni 2006 Overlapping I/O and Computation One thread for each disk (or asynchronous I/O) Possibly additional threads Blocks filled with elements are passed by references between different buffers 72 Sanders: Algorithm Engineering 8. Juni 2006 73 Overlapping During Run Formation First post read requests for runs 1 and 2 Thread A: Loop { wait-read i; sort i; post-write i}; Thread B: Loop { wait-write i; post-read i + 2}; control flow in thread A control flow in thread B time 1 2 3 4 read 1 sort 2 3 1 write 1 4 2 3 2 3 4 1 2 3 write 1 2 4 k−1 k−1 4 k k k−1 ... 4 I/O bound case k ... ... 3 k k−1 ... read sort k−1 ... ... compute bound case k k−1 k Sanders: Algorithm Engineering 8. Juni 2006 74 Overlapping During Merging Bad example: 1B−1 2 3B−1 4 5B−1 6 · · · ··· 1B−1 2 3B−1 4 5B−1 6 · · · Sanders: Algorithm Engineering 8. Juni 2006 75 Overlapping During Merging read buffers −ping disk scheduling I/O Threads: Writing has priority over reading .. .. merge D blocks buffers k+3D overlap buffers merge O(D) prefetch buffers 2D write buffers .. .. 1 k merging overlap− Sanders: Algorithm Engineering 8. Juni 2006 76 y = # of elements merged during one I/O step. I/O bound y > DB 2 y ≤ DB #elements in overlap+merge buffers I/O bound case: The I/O thread never blocks r kB+3DB kB+2DB+y blocking output kB+2DB kB+DB+y fetch kB+DB DB−y DB 2DB−y w 2DB #elements in output buffers Sanders: Algorithm Engineering 8. Juni 2006 kB+3DB #elements in overlap+merge buffers Compute bound case: The merging thread never blocks 77 r block I/O kB+2DB output fetch kB+DB kB+2y kB+y kB block merging DB w 2DB−y2DB #elements in output buffers Sanders: Algorithm Engineering 8. Juni 2006 78 Hardware (mid 2002) 2x Xeon 4 Threads Linux 400x64 Mb/s Intel E7500 Chipset Several 66 MHz PCI-buses (SuperMicro P4DPE3) Several fast IDE controllers (4 × Promise Ultra100 TX2) Many fast IDE disks (8× IBM IC35L080AVVA07) cost effective I/O-bandwidth 2x64x66 Mb/s 4x2x100 MB/s 8x45 MB/s 128 (2 × 2GHz Xeon × 2 Threads) 1 GB DDR RAM PCI−Busses Controller Channels 8x80 GB (real 360 MB/s for ≈ 3000 ¤) Sanders: Algorithm Engineering 8. Juni 2006 Software Interface Goals: efficient + simple + compatible TXXL 1111111111111111111111111111111 0000000000000000000000000000000 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 Applications 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000000000000 111111111111111111 00000000000000 11111111111111 000000000000000000 111111111111111111 00000000000000 11111111111111 000000000000000000 111111111111111111 00000000000000 11111111111111 000000000000000000 111111111111111111 00000000000000 11111111111111 Streaming layer STL−user layer 000000000000000000 111111111111111111 00000000000000 11111111111111 000000000000000000 111111111111111111 00000000000000 11111111111111 000000000000000000 111111111111111111 00000000000000 11111111111111 vector, stack, set 000000000000000000 111111111111111111 00000000000000 11111111111111 Containers: priority_queue, Pipelined sorting, map 11111111111111 000000000000000000 111111111111111111 00000000000000 000000000000000000 111111111111111111 00000000000000 11111111111111 Algorithms: sort, for_each, merge 11111111111111 zero−I/O scanning 000000000000000000 111111111111111111 00000000000000 000000000000000000 111111111111111111 00000000000000 11111111111111 000000000000000000 111111111111111111 00000000000000 11111111111111 000000000000000000 111111111111111111 00000000000000 11111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 Block management layer 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 typed block, block manager, buffered streams, 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 block prefetcher, buffered block writer 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 Asynchronous I/O primitives layer 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 files, I/O requests, disk queues, 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 completion handlers 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 0000000000000000000000000000000 1111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 Operating System 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000 111111111111111111111111111111 79 S Sanders: Algorithm Engineering 8. Juni 2006 80 Default Measurement Parameters t := number of available buffer blocks Input Size: 16 GByte 400x64 Mb/s Intel E7500 Chipset Keys: Random 32 bit integers Run Size: 256 MByte Block size B: 2 MByte Compiler: g++ 3.2 -O6 Write Buffers: max(t/4, 2D) 2x64x66 Mb/s 4x2x100 MB/s 8x45 MB/s 3 Prefetch Buffers: 2D + (t − w − 2D) 10 128 Element Size: 128 Byte 2x Xeon 4 Threads 1 GB DDR RAM PCI−Busses Controller Channels 8x80 GB Sanders: Algorithm Engineering 8. Juni 2006 81 Element sizes (16 GByte, 8 disks) 400 run formation merging I/O wait in merge phase I/O wait in run formation phase 350 300 time [s] 250 200 150 100 50 0 16 parallel disks 32 64 128 256 element size [byte] bandwidth “for free” 512 1024 internal work, overlapping are relevant Sanders: Algorithm Engineering 8. Juni 2006 82 Earlier Academic Implementations Single Disk, at most 2 GByte, old measurements use artificial M 800 LEDA-SM TPIE <stxxl> comparison based 700 sort time [s] 600 500 400 300 200 100 0 16 32 64 128 256 element size [byte] 512 1024 Sanders: Algorithm Engineering 8. Juni 2006 83 Earlier Academic Implementations: Multiple Disks 616 500 400 LEDA-SM Soft-RAID TPIE Soft-RAID <stxxl> Soft-RAID <stxxl> sort time [s] 300 200 100 75 50 40 30 26 16 32 64 128 256 element size [byte] 512 1024 Sanders: Algorithm Engineering 8. Juni 2006 84 What are good block sizes (8 disks)? 26 128 GBytes 1x merge 128 GBytes 2x merge 16 GBytes sort time [ns/byte] 24 22 20 18 16 14 12 128 256 512 1024 2048 block size [KByte] B is not a technology constant 4096 8192 Sanders: Algorithm Engineering 8. Juni 2006 85 Optimal Versus Naive Prefetching Total merge time 176 read buffers 112 read buffers 48 read buffers difference to naive prefetching [%] 1 0 -1 -2 -3 0 20 40 60 80 fraction of prefetch buffers [%] 100 Sanders: Algorithm Engineering 8. Juni 2006 Impact of Prefetch and Overlap Buffers 150 no prefetch buffer heuristic schedule 145 merging time [s] 140 135 130 125 120 115 16 32 48 64 80 96 112 128 144 160 172 number of read buffers 86 Sanders: Algorithm Engineering 8. Juni 2006 87 Tradeoff: Write Buffer Size Versus Read Buffer Size 145 140 merge time [s] 135 130 125 120 115 110 0 50 100 150 number of read buffers 200 Sanders: Algorithm Engineering 8. Juni 2006 88 Scalability sort time [ns/byte] 20 15 10 5 128-byte elements 512-byte elements 4N/DB bulk I/Os 0 16 32 64 input size [GByte] 128 Sanders: Algorithm Engineering 8. Juni 2006 Discussion Theory and practice harmonize No expensive server hardware necessary (SCSI,. . . ) No need to work with artificial M No 2/4 GByte limits Faster than academic implementations (Must be) as fast as commercial implementations but with performance guarantees Blocks are much larger than often assumed. Not a technology constant Parallel disks bandwidth “for free” don’t negliect internal costs 89 Sanders: Algorithm Engineering 8. Juni 2006 90 More Parallel Disk Sorting? Pipelining: Input does not come from disk but from a logical input stream. Output goes to a logical output stream only half the I/Os for sorting often no I/Os for scanning todo: better overlapping Parallelism: This is the only way to go for really many disks Tuning and Special Cases: ssssort, permutations, balance work between merging and run formation?. . . Longer Runs: not done with guaranteed overlapping, fast internal sorting ! Distribution Sorting: Better for seeks etc.? Inplace Sorting: Could also be faster Determinism: A practical and theoretically efficient algorithm? Sanders: Algorithm Engineering 8. Juni 2006 Procedure formLongRuns q, q′ : PriorityQueue for i := 1 to M do q.insert(readElement) invariant |q| + |q′ | = M loop while q 6= 0/ writeElement(e:= q.deleteMin) if input exhausted then break outer loop if e′ := readElement < e then q′ .insert(e′ ) else q.insert(e′ ) q:= q′ ; q′ := 0/ output q in sorted order; output q′ in sorted order Knuth: average run length 2M todo: cache-effiziente Implementierung 91