Sanders: Algorithm Engineering 18. Mai 2006 1 Arrays, Verkettete Listen und abgeleitete Datenstrukturen Bounded Arrays Eingebaute Datenstruktur. Größe muss von Anfang an bekannt sein 1 Sanders: Algorithm Engineering 18. Mai 2006 Unbounded Array z.B. std::vector pushBack: Element anhängen popBack: Letztes Element löschen Idee: verdopple wenn der Platz ausgeht halbiere wenn Platz verschwendet wird Wenn man das richtig macht, brauchen n pushBack/popBack Operationen Zeit O (n) Algorithmese: pushBack/popBack haben konstante amortisierte Komplexität Was kann man falsch machen? 2 Sanders: Algorithm Engineering 18. Mai 2006 Doppelt verkettete Listen 3 Sanders: Algorithm Engineering 18. Mai 2006 4 Class Item of Element // one link in a doubly linked list e : Element next : Handle // prev : Handle invariant next→prev=prev→next=this - Trick: Use a dummy header - ⊥ - ··· ··· Sanders: Algorithm Engineering 18. Mai 2006 5 Procedure splice(a,b,t : Handle) assert b is not before a ∧ t 6∈ ha, . . . , bi 0 a a // Cut out ha, . . . , bi - ··· a0 := a→prev ··· b0 := b→next a0 →next := b0 // - ··· 0 0 ··· b →prev := a // Y t // insert ha, . . . , bi after t t 0 := t→next // a b0 b - R - - t0 b R - Y - ··· ··· - ··· ··· R - - - ··· ··· - b→next := t’ a→prev := t // // Y t →next := a t 0 →prev := b // // Sanders: Algorithm Engineering 18. Mai 2006 Einfach verkettete Listen ... Vergleich mit doppelt verketteten Listen Weniger Speicherplatz Platz ist oft auch Zeit Eingeschränkter z.B. kein delete Merkwürdige Benutzerschnittstelle, z.B. deleteAfter 6 Sanders: Algorithm Engineering 18. Mai 2006 7 Speicherverwaltung für Listen kann leicht 90 % der Zeit kosten! Lieber Elemente zwischen (Free)lists herschieben als echte mallocs Alloziere viele Items gleichzeitig Am Ende alles freigeben? Speichere „parasitär“. z.B. Graphen: Knotenarray. Jeder Knoten speichert ein ListItem Partition der Knoten kann als verkettete Listen gespeichert werden MST, shortest Path Challenge: garbage collection, viele Datentypen auch ein Software Engineering Problem hier nicht Sanders: Algorithm Engineering 18. Mai 2006 8 Beispiel: Stack ... dynamisch Platzverschwendung 2 1 1 2 ... SList B-Array U-Array + − + pointer zu groß? zu groß? cache miss + umkopieren (+) + − freigeben? Zeitverschwendung worst case time Wars das? Hat jede Implementierung gravierende Schwächen? Sanders: Algorithm Engineering 18. Mai 2006 9 The Best From Both Worlds ... B hybrid dynamisch + Platzverschwendung n/B + B Zeitverschwendung + worst case time + Sanders: Algorithm Engineering 18. Mai 2006 10 Eine Variante Directory ... ... B Elemente − Reallozierung im top level Zeit nicht worst case konstante + Indizierter Zugrif auf S[i] in konstanter Zeit Sanders: Algorithm Engineering 18. Mai 2006 Beyond Stacks 11 stack ... FIFO queue ... deque ... popFront pushFront n0 pushBack popBack h b FIFO: BArray −→ cyclic array Aufgabe: Ein Array, das “[i]” in konstanter Zeit und √ einfügen/löschen von Elementen in Zeit O ( n) unterstützt Aufgabe: Ein externer Stack, der n push/pop Operationen mit O (n/B) I/Os unterstützt t Sanders: Algorithm Engineering 18. Mai 2006 12 Aufgabe: Tabelle für hybride Datenstrukturen vervollständigen Operation List SList UArray CArray explanation of ‘∗ ’ [·] n n 1 1 |·| 1∗ 1∗ 1 1 not with inter-list splice 1 1 1 1 first last 1 1 1 1 insert 1 1∗ n n insertAfter only remove 1 1∗ n n removeAfter only pushBack 1 1 1∗ 1∗ amortized pushFront 1 1 n 1∗ amortized popBack 1 n 1∗ 1∗ amortized popFront 1 1 n 1∗ amortized concat 1 1 n n splice 1 1 n n findNext,. . . n n n∗ n∗ cache efficient Sanders: Algorithm Engineering 18. Mai 2006 Was fehlt? Fakten Fakten Fakten Messungen für Verschiedene Implementierungsvarianten Verschiedene Architekturen Verschiedene Eingabegrößen Auswirkungen auf reale Anwendungen Kurven dazu Interpretation, ggf. Theoriebildung Aufgabe: Array durchlaufen versus zufällig allozierte verkettete Liste 13 Sanders: Algorithm Engineering 18. Mai 2006 14 Quicksort Function quickSort(s : Sequence of Element) : Sequence of Element if |s| ≤ 1 then return s // base case pick p ∈ s uniformly at random // pivot key a := he ∈ s : e < pi b := he ∈ s : e = pi c := he ∈ s : e > pi return concatenate(quickSort(a), b, quickSort(c) Sanders: Algorithm Engineering 18. Mai 2006 Engineering Quicksort array 2-Wege-Vergleiche sentinels für innere Schleife inplace swaps Rekursion auf kleinere Teilproblemgröße → O (log n) zusätzlichen Platz Rekursionsabbruch für kleine (20–100) Eingaben, insertion sort (nicht ein großer insertion sort) 15 Sanders: Algorithm Engineering 18. Mai 2006 Procedure qSort(a : Array of Element; `, r : N) // Sort a[`..r] // Use divide-and-conquer while r − ` ≥ n0 do j := pickPivotPos(a, l, r) swap(a[`], a[ j]) // Helps to establish the invariant p := a[`] i := `; j := r r repeat // a: ` i→ ← j while a[i] < p do i++ // Scan over elements (A) while a[ j] > p do j−− // on the correct side (B) if i ≤ j then swap(a[i], a[ j]); i++ ; j−− until i > j // Done partitioning if i < l+r 2 then qSort(a,`,j); ` := j + 1 else qSort(a,i,r) ; r := i − 1 insertionSort(a[l..r]) // faster for small r − l 16 Sanders: Algorithm Engineering 18. Mai 2006 Picking Pivots Painstakingly — Theory probabilistically: Expected 1.4n log n element comparisons median of three: Expected 1.2n log n element comparisons perfect: −→ n log n element comparisons (approximate using large samples) Practice 3GHz Pentium 4 Prescott, g++ 17 Sanders: Algorithm Engineering 18. Mai 2006 18 Picking Pivots Painstakingly — Instructions out sort Instructions / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11 12 random pivot median of 3 exact median skewed pivot n/10 skewed pivot n/11 11 instructions constant 10 9 8 7 6 10 12 14 16 18 n 20 22 24 26 Sanders: Algorithm Engineering 18. Mai 2006 19 Picking Pivots Painstakingly — Time out sort Seconds / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11 8.4 random pivot median of 3 exact median skewed pivot n/10 skewed pivot n/11 8.2 nanosecs constant 8 7.8 7.6 7.4 7.2 7 6.8 10 12 14 16 18 n 20 22 24 26 Sanders: Algorithm Engineering 18. Mai 2006 20 Picking Pivots Painstakingly — Branch Misses out sort Branch misses / n lg n for algs: random pivot - median of 3 - exact median - skewed pivot n/10 - n/11 0.5 random pivot median of 3 exact median skewed pivot n/10 skewed pivot n/11 0.48 0.46 branch misses constant 0.44 0.42 0.4 0.38 0.36 0.34 0.32 0.3 10 12 14 16 18 n 20 22 24 26 Sanders: Algorithm Engineering 18. Mai 2006 Can We Do Better? Previous Work Integer Keys + Can be 2 − 3 times faster than quicksort − Naive ones are cache inefficient and slower than quicksort − Simple ones are distribution dependent. Cache efficient sorting k-ary merge sort [Nyberg et al. 94, Arge et al. 04, Ranade et al. 00, Brodal et al. 04] + Faktor log k less cache faults − Only ≈ 20 % speedup, and only for laaarge inputs 21 Sanders: Algorithm Engineering 18. Mai 2006 22 Sample Sort Function sampleSort(e = he1 , . . . , en i, k) if n/k is “small” then return smallSort(e) let S = hS1 , . . . , Sak−1 i denote a random sample of e sort S hs0 , s1 , s2 , . . . , sk−1 , sk i:= h−∞, Sa , S2a , . . . , S(k−1)a , ∞i binary search? for i := 1 to n do s 1 s 2 s3 s 4 s5 s6 s7 find j ∈ {1, . . . , k} such that s j−1 < ei ≤ s j place ei in bucket b j return concatenate(sampleSort(b1 ), . . . , sampleSort(bk )) buckets Sanders: Algorithm Engineering 18. Mai 2006 Why Sample Sort? traditionally: parallelizable on coarse grained machines + Cache efficient ≈ merge sort − Binary search not much faster than merging − complicated memory management Super Scalar Sample Sort Binary search −→ implicit search tree Eliminate all conditional branches Exploit instruction parallelism Cache efficiency comes to bear “steal” memory management from radix sort 23 Sanders: Algorithm Engineering 18. Mai 2006 24 Classifying Elements t:= hsk/2 , sk/4 , s3k/4 , sk/8 , s3k/8 , s5k/8 , s7k/8 , . . .i for i := 1 to n do splitter array index s4 1 j:= 1 > < *2 +1 decisions repeat log k times // unroll s s6 3 2 2 j:= 2 j + (ai > t j ) decisions < *2 > < *2 > +1 +1 j:= j − k + 1 s 1 4 s3 5 s5 6 s7 7 b j ++ > < > < > < > decisions < o(i):= j // oracle buckets Now the compiler should: use predicated instructions interleave for loop iterations (unrolling ∨ software pipelining) Sanders: Algorithm Engineering 18. Mai 2006 25 template <class T> void findOraclesAndCount(const T* const a, const int n, const int k, const T* const s, Oracle* const oracle, int* const bucket) { { for (int i = 0; i < n; i++) int j = 1; while (j < k) { j = j*2 + (a[i] > s[j]); } int b = j-k; bucket[b]++; oracle[i] = b; } } Sanders: Algorithm Engineering 18. Mai 2006 Predication Hardware mechanism that allows instructions to be conditionally executed Boolean predicate registers (1–64) hold condition codes predicate registers p are additional inputs of predicated instructions I At runtime, I is executed if and only if p is true + Avoids branch misprediction penalty + More flexible instruction scheduling − Switched off instructions still take time − Longer opcodes − Complicated hardware design 26 Sanders: Algorithm Engineering 18. Mai 2006 27 Example (IA-64) Translation of: if (r1 > 2) r3 := r3 + 4 With a conditional branch: Via predication: cmp.gt p6,p7=r1,r2 cmp.gt p6,p7=r1,r2 (p7) br.cond .label (p6) add r3=4,r3 add r3=4,r3 .label: (...) Other Current Architectures: Conditional moves only Sanders: Algorithm Engineering 18. Mai 2006 Unrolling (k = 16) template <class T> void findOraclesAndCountUnrolled([...]){ for (int i = 0; i < n; i++) int j = 1; j = j*2 + (a[i] > s[j]); j = j*2 + (a[i] > s[j]); j = j*2 + (a[i] > s[j]); j = j*2 + (a[i] > s[j]); int b = j-k; bucket[b]++; oracle[i] = b; } } 28 Sanders: Algorithm Engineering 18. Mai 2006 29 More Unrolling k = 16, n even template <class T> void findOraclesAndCountUnrolled2([...]){ for (int i = n & 1; i < n; i+=2) {\ int j0 = 1; int j1 = 1; T ai0 = a[i]; T ai1 = a[i+1]; j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]); int b0 = j0-k; int b1 = j1-k; bucket[b0]++; bucket[b1]++; oracle[i] = b0; oracle[i+1] = b1; } } Sanders: Algorithm Engineering 18. Mai 2006 30 i Distributing Elements refer for i := 1 to n do a0B[o(i)]++ := ai a o Why Oracles? refer simplifies memory management B refer no overflow tests or re-copying simplifies software pipelining a’ separates computation and memory access aspects small (n bytes) sequential, predictable memory access can be hidden using prefetching / write buffering move Sanders: Algorithm Engineering 18. Mai 2006 31 Distributing Elements template <class T> void distribute( const T* const a0, T* const a1, const int n, const int k, const Oracle* const oracle, int* const bucket) { for (int i = 0, sum = 0; i <= k; i++) { int t = bucket[i]; bucket[i] = sum; sum += t; } for (int i = 0; i < n; i++) { a1[bucket[oracle[i]]++] = a0[i]; } } Sanders: Algorithm Engineering 18. Mai 2006 Aufgabe Wie implementiert man 4-Wege Mischen (8-Wege-Mischen) mit 2n Speicherzugriffen und 2n (3n) Vergleichen auf einer Maschine mit 8 (16) Registern? Ist eine naive Implementierung mit 3n (7n) leichter vorhersagbaren Vergleichen am Ende schneller? 32 Sanders: Algorithm Engineering 18. Mai 2006 Experiments: 1.4 GHz Itanium 2 restrict keyword from ANSI/ISO C99 to indicate nonaliasing Intel’s C++ compiler v8.0 uses predicated instructions automatically Profiling gives 9% speedup k = 256 splitters Use stl:sort from g++ (n ≤ 1000)! insertion sort for n ≤ 100 Random 32 bit integers in [0, 109 ] 33 Sanders: Algorithm Engineering 18. Mai 2006 34 Comparison with Quicksort 8 time / n log n [ns] 7 6 5 4 3 2 GCC STL qsort sss-sort 1 0 4096 16384 65536 18 2 n 20 2 22 2 24 2 Sanders: Algorithm Engineering 18. Mai 2006 35 Breakdown of Execution Time 6 Total findBuckets + distribute + smallSort findBuckets + distribute findBuckets time / n log n [ns] 5 4 3 2 1 0 4096 16384 65536 2 18 2 n 20 22 2 24 2 Sanders: Algorithm Engineering 18. Mai 2006 36 A More Detailed View dynamic dynamic instr. cycles IPC small n IPC n = 225 findBuckets, 1× outer loop 63 11 5.4 4.5 14 4 3.5 0.8 distribute, one element Sanders: Algorithm Engineering 18. Mai 2006 37 Comparison with Quicksort Pentium 4 14 Intel STL GCC STL sss-sort time / n log n [ns] 12 10 8 6 4 2 0 4096 16384 65536 18 2 n 20 2 22 2 24 2 Problems: few registers, one condition code only, compiler needs “help” Sanders: Algorithm Engineering 18. Mai 2006 38 Breakdown of Execution Time Pentium 4 Total findBuckets + distribute + smallSort findBuckets + distribute findBuckets 8 time / n log n [ns] 7 6 5 4 3 2 1 0 4096 16384 65536 18 2 n 2 20 22 2 24 2 Sanders: Algorithm Engineering 18. Mai 2006 39 Analysis mem. acc.branches data dep. I/Os registers instructions k-way distribution: sss-sort n log k quicksort log k lvls. 2n log k O (1) O (n) 3.5n/B 3×unroll O (log k) n log k O (n log k)2 Bn log k 4 O (1) k-way merging: memory n log k n log k O (n log k) 2n/B 7 O (log k) register 2n n log k O (n log k) 2n/B k O (k) 2k0 + 2 O (k0 ) funnel k0 logk0 k 2n logk0 k n log k O (n log k) 2n/B Sanders: Algorithm Engineering 18. Mai 2006 Conclusions sss-sort up to twice as fast as quicksort on Itanium comparisons 6= conditional branches algorithm analysis is not just instructions and caches 40 Sanders: Algorithm Engineering 18. Mai 2006 Kritik I Warum nur zufällige Schlüssel? Antwort I Sample sort ist kaum von der Verteilung der Eingabe abhängig 41 Sanders: Algorithm Engineering 18. Mai 2006 Kritik I´ Aber was wenn es viele gleiche Schlüssel gibt? Die landen alle in einem Bucket Antwort I´ Its not a bug its a feature: Sei si = si+1 = · · · = s j Indikator für häufigen Schlüssel! Setze si := max {x ∈ Key : x < si }, (optional: streiche si+2 , . . . s j ) Nun muss bucket i + 1 nicht mehr sortiert werden! todo: implementieren 42 Sanders: Algorithm Engineering 18. Mai 2006 Kritik II Quicksort ist inplace Antwort II Benutze hybride √ Listen-Array Repräsentation von Folgen braucht O kn extra Platz für k-way sample sort 43 Sanders: Algorithm Engineering 18. Mai 2006 Kritik II´ Ich will aber Arrays für Ein- und Ausgabe Antwort II´ Inplace Konvertierung Eingabe: einfach Ausgabe: tricky. Aufgabe: grobe Idee entwickeln. Hinweis: Permutation von Blöcken. Jede Permutation besteht aus einem Produkt zyklischer Permutationen. Inplace zyklische Permutation ist einfach 44 Sanders: Algorithm Engineering 18. Mai 2006 Future Work high level fine-tuning, e.g., clever choice of k other architectures, e.g., Opteron, PowerPC almost in-place implementation multilevel cache-aware or cache-oblivious generalization (oracles help) Studienarbeit, (Diplomarbeit), Hiwi-Job? 45 Sanders: Algorithm Engineering 18. Mai 2006 46 Sorting by Multiway Merging Sort dn/Me runs with M elements each 2n/B I/Os Merge M/B runs at a time 2n/B I/Os l m × logM/B Mn merge phases until only one run is left ————————————————l 2n n m sort(n) := 1 + logM/B I/Os B M In total registers ALU 1 word capacity M freely programmable B words large memory ... internal buffers ... ... ... ... ... Sanders: Algorithm Engineering 18. Mai 2006 47 Procedure twoPassSort(M : N; a : external Array [0..n − 1] of Element) b : external Array [0..n − 1] of Element // auxiliary storage formRuns(M, a, b) mergeRuns(M, b, a) // Sort runs of size M from f writing sorted runs to t Procedure formRuns(M : N; f ,t : external Array [0..n − 1] of Element) for i := 0 to n − 1 step M do run := f [i..i + M − 1] M f sort(run) t[i..i + M − 1] := run i=0 run t i=0 i=M i=2M sort internal i=M i=2M Sanders: Algorithm Engineering 18. Mai 2006 48 // Merge n elements from f to t where f stores sorted runs of size L Procedure mergeRuns(L : N; f ,t : external Array [0..n − 1] of Element) k := dn/Le // Number of runs next : PriorityQueue of Key × N runBuffer := Array [0..k − 1][0..B − 1] of Element for i := 0 to k − 1 do runBuffer[i] := f i=0 i=1 i=2 f [iM..iM + B − 1] 1 0 2 run− next.insert( Buffer key(runBuffer[i][0]), iL)) B next out : Array [0..B − 1] of Element k internal out t i=0 i=1 i=2 Sanders: Algorithm Engineering 18. Mai 2006 49 // k-way merging for i := 0 to n − 1 step B do for j := 0 to B − 1 do (x, `) := next.deleteMin out[ j] := runBuffer[` div L][` mod B] `++ // New input block if ` mod B = 0 then if ` mod L = 0 ∨ ` = n then runBuffer[` div L][0] := ∞// sentinel for exhausted ru else runBuffer[` div L] := f [`..` + B − 1]// refill buffer next.insert((runBuffer[` div L][0], `)) // One output step write out to t[i..i + B − 1] Sanders: Algorithm Engineering 18. Mai 2006 50 Example, B = 2, run size = 6 f f make things as simple as possible but no simpler __aeghikmnst__aaeilmpsss__bbeilopssu__eilmnoprst runBuffer st ss B M t ps next k st internal out ss ________aaabbeeeeghiiiiklllmmmnnooppprss Sanders: Algorithm Engineering 18. Mai 2006 2 Super Scalar Sample Sort Comparison Based Sorting Large data sets (thousands to millions) Theory: Lots of algorithms with n log n + O (n) comparisons Practice: Quicksort + n log n + O (n) expected comparisons using sample based pivot seletction + sufficiently cache efficient n O B log n cache misses on all levels + Simple + Compiler writers are aware of it Can we nevertheless beat quicksort? 51 Sanders: Algorithm Engineering 18. Mai 2006 52 3 Sorting with Parallel Disks I/O Step := Access to a single physical block per disk Theory: Balance Sort [Nodine Vitter 93]. Deterministic, complex O (·) asymptotically optimal M B Multiway merging “Usually” factor 10? less I/Os. Not asymptotically optimal. 42% Basic Approach: Improve Multiway Merging 1 2 ... D independent disks [Vitter Shriver 94] Sanders: Algorithm Engineering 18. Mai 2006 Striping 53 emulated disk logical block 1 2 D physical blocks ... ... But what about merging? internal buffers ??? ... That takes care of run formation ... and writing the output ... ... Sanders: Algorithm Engineering 18. Mai 2006 54 Prediction Prefetch buffers allow parallel access of next blocks ... ... ... ... ... ... ... prediction sequence ... prefetch buffers [Folklore, Knuth] Smallest Element of each block triggers fetch. ... ... ... ... ... ... ... ... controls internal buffers ... Sanders: Algorithm Engineering 18. Mai 2006 55 Warmup: Multihead Model ... ... ... ... prediction sequence ... ... ... ... M 1 2 B D ... ... ... prefetch buffers ... ... ... Multihead Model [Aggarwal Vitter 88] ... ... D prefetch buffers yield an optimal algorithm l 2n n m sort(n) := 1 + logM/B I/Os DB M controls internal buffers ... Sanders: Algorithm Engineering 18. Mai 2006 56 Bigger Prefetch Buffer prefetch buffers prediction sequence ... ... ... ... ... ... Dk 1 internal buffers 2 .. . k good deterministic performance O (D) would yield an optimal algoritihm. Possible? ... Sanders: Algorithm Engineering 18. Mai 2006 57 Randomized Cycling [Vitter Hutchinson 01] Block i of stripe j goes to disk π j (i) for a random permutation πj prefetch buffers prediction sequence ... ... internal buffers ... ... ... ... Good for naive prefetching and Ω (D log D) buffers ... Sanders: Algorithm Engineering 18. Mai 2006 58 Buffered Writing [Sanders-Egner-Korst SODA00, Hutchinson Sanders Vitter ESA 01] write whenever Sequence of blocks Σ otherwise, output one of W one block from buffers is free each nonempty queue randomized mapping ... ... queues 1 2 3 ... W/D Theorem: Buffered Writing is optimal ... But how good is optimal? D Theorem: Randomized cycling achieves efficiency 1 − O (D/W ). Sanders: Algorithm Engineering 18. Mai 2006 Analysis: negative association of random variables, application of queueing theory to a “throttled” Alg. 59 Sanders: Algorithm Engineering 18. Mai 2006 60 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 r q p o l i f Σ r q p o n ml k j i h g f e d c b a ΣR order of writing Puffer order of reading n h g e mk j d c b a output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 18. Mai 2006 61 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 Σ r q p o n ml k j i h g f e d c b a R Σ order of writing Puffer order of reading r q p o n m r n m output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 18. Mai 2006 62 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 Σ r q p o n ml k j i h g f e d c b a R Σ order of writing Puffer order of reading j q p o l k r q n mk output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 18. Mai 2006 63 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 Σ r q p o n ml k j i h g f e d c b a R Σ order of writing Puffer order of reading j h p o l i r q p n h mk j output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 18. Mai 2006 64 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 Σ r q p o n ml k j i h g f e d c b a R Σ order of writing Puffer order of reading g f e o l i r q p o n h g mk j output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 18. Mai 2006 65 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 Σ r q p o n ml k j i h g f e d c b a R Σ order of writing Puffer order of reading d f e c l i r q p o l n h g e mk j d output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 18. Mai 2006 66 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 Σ r q p o n ml k j i h g f e d c b a R Σ order of writing Puffer order of reading b f a c i r q p o l i n h g e mk j d c output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 18. Mai 2006 67 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 Σ r q p o n ml k j i h g f e d c b a R Σ order of writing Puffer order of reading b f a r q p o l i f n h g e mk j d c b output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 18. Mai 2006 68 Optimal Offline Prefetching Theorem: For buffer size W : ∃ (offline) prefetching schedule for Σ with T input steps ⇔ ∃ (online) write schedule for ΣR with T output steps input step 8 7 6 5 4 3 2 1 r q p o l i f Σ r q p o n ml k j i h g f e d c b a ΣR order of writing Puffer order of reading a n h g e mk j d c b a output step 1 2 3 4 5 6 7 8 Sanders: Algorithm Engineering 18. Mai 2006 Synthesis Multiway merging [60s Folklore] + prediction +optimal (randomized) writing [Sanders Egner Korst SODA 2000] +randomized cycling [Vitter Hutchinson 2001] +optimal prefetching [Hutchinson Sanders Vitter ESA 2002] (1 + o(1))·sort(n) I/Os “answers” question in [Knuth 98]; difficulty 48 on a 1..50 scale. 69 Sanders: Algorithm Engineering 18. Mai 2006 We are not done yet! Internal work Overlapping I/O and computation Reasonable hardware Interfacing with the Operating System Parameter Tuning Software engineering Pipelining 70 Sanders: Algorithm Engineering 18. Mai 2006 71 Key Sorting The I/O bandwidth of our machine is about 1/3 of its main memory bandwidth If key size element size sort key pointer pairs to save memory bandwidth during run formation a 3 b 1 4 c 1 e 5 d extract 3 1 1 b 4 1 1 5 d sort 3 a 1 1 3 4 5 permute 4 c 5 e Sanders: Algorithm Engineering 18. Mai 2006 72 Tournament Trees for Multiway Merging Assume k = 2K runs K level complete binary tree Leaves: smallest current element of each run Internal nodes: loser of a competition for being smallest Above root: global winner deleteMin+ insertNext 1 2 4 7 6 3 4 6 4 2 2 4 9 7 9 7 1 3 4 4 6 7 4 7 6 2 9 7 9 7 3 4 7 Sanders: Algorithm Engineering 18. Mai 2006 73 Why Tournament Trees Exactly log k element comparisons Implicit layout in an array (shifts) simple index arithmetics Predictable load instructions and index computations (Unrollable) inner loop: for (int i=(winnerIndex+kReg)>>1; i>0; i>>=1){ currentPos = entry + i; currentKey = currentPos->key; if (currentKey < winnerKey) { currentIndex = currentPos->index; currentPos->key = winnerKey; currentPos->index = winnerIndex; winnerKey = currentKey; winnerIndex = currentIndex;}} Sanders: Algorithm Engineering 18. Mai 2006 Overlapping I/O and Computation One thread for each disk (or asynchronous I/O) Possibly additional threads Blocks filled with elements are passed by references between different buffers 74 Sanders: Algorithm Engineering 18. Mai 2006 75 Overlapping During Run Formation First post read requests for runs 1 and 2 Thread A: Loop { wait-read i; sort i; post-write i}; Thread B: Loop { wait-write i; post-read i + 2}; control flow in thread A control flow in thread B time 1 2 3 4 read 1 sort 2 3 1 write 1 4 2 3 2 3 4 1 2 3 write 1 2 4 4 k−1 k k k−1 ... 4 I/O bound case k ... k−1 ... 3 k k−1 ... read sort k−1 ... ... k k−1 k compute bound case Sanders: Algorithm Engineering 18. Mai 2006 76 Overlapping During Merging Bad example: 1B−1 2 3B−1 4 5B−1 6 · · · ··· 1B−1 2 3B−1 4 5B−1 6 · · · Sanders: Algorithm Engineering 18. Mai 2006 77 Overlapping During Merging read buffers −ping disk scheduling I/O Threads: Writing has priority over reading .. .. merge D blocks buffers k+3D overlap buffers merge O(D) prefetch buffers 2D write buffers .. .. 1 k merging overlap− Sanders: Algorithm Engineering 18. Mai 2006 78 y = # of elements merged during one I/O step. I/O bound y > DB 2 y ≤ DB #elements in overlap+merge buffers I/O bound case: The I/O thread never blocks r kB+3DB kB+2DB+y blocking kB+2DB kB+DB+y output fetch kB+DB DB−y DB 2DB−y w 2DB #elements in output buffers Sanders: Algorithm Engineering 18. Mai 2006 kB+3DB #elements in overlap+merge buffers Compute bound case: The merging thread never blocks 79 r block I/O kB+2DB output fetch kB+DB kB+2y kB+y kB block merging DB w 2DB−y2DB #elements in output buffers Sanders: Algorithm Engineering 18. Mai 2006 80 Hardware (mid 2002) 2x Xeon 4 Threads Linux Intel E7500 Chipset Several 66 MHz PCI-buses (SuperMicro P4DPE3) 2x64x66 Mb/s Several fast IDE controllers 4x2x100 (4 × Promise Ultra100 TX2) MB/s Many fast IDE disks 8x45 (8× IBM IC35L080AVVA07)MB/s cost effective I/O-bandwidth 128 400x64 Mb/s (2 × 2GHz Xeon × 2 Threads) 1 GB DDR RAM PCI−Busses Controller Channels 8x80 GB (real 360 MB/s for ≈ 3000 ¤) TXXL typed block, block manager, buffered streams, block prefetcher, buffered block writer files, I/O requests, disk queues, completion handlers Block management layer Asynchronous I/O primitives layer S Operating System Pipelined sorting, zero−I/O scanning vector, stack, set priority_queue, map sort, for_each, merge Containers: Algorithms: Applications Streaming layer STL−user layer 81 Sanders: Algorithm Engineering 18. Mai 2006 Software Interface Goals: efficient + simple + compatible Sanders: Algorithm Engineering 18. Mai 2006 82 Default Measurement Parameters t := number of available buffer blocks Input Size: 16 GByte 400x64 Mb/s Keys: Random 32 bit integers Run Size: 256 MByte Block size B: 2 MByte Compiler: g++ 3.2 -O6 2x64x66 Mb/s 4x2x100 MB/s 8x45 MB/s Write Buffers: max(t/4, 2D) 3 Prefetch Buffers: 2D + (t − w − 2D) 10 Intel E7500 Chipset 128 Element Size: 128 Byte 2x Xeon 4 Threads 1 GB DDR RAM PCI−Busses Controller Channels 8x80 GB Sanders: Algorithm Engineering 18. Mai 2006 83 Element sizes (16 GByte, 8 disks) 400 run formation merging I/O wait in merge phase I/O wait in run formation phase 350 300 time [s] 250 200 150 100 50 0 16 parallel disks 32 64 128 256 element size [byte] bandwidth “for free” 512 1024 internal work, overlapping are relev Sanders: Algorithm Engineering 18. Mai 2006 84 Earlier Academic Implementations Single Disk, at most 2 GByte, old measurements use artificial M 800 LEDA-SM TPIE <stxxl> comparison based 700 sort time [s] 600 500 400 300 200 100 0 16 32 64 128 256 element size [byte] 512 1024 Sanders: Algorithm Engineering 18. Mai 2006 85 Earlier Academic Implementations: Multiple Disks 616 500 400 LEDA-SM Soft-RAID TPIE Soft-RAID <stxxl> Soft-RAID <stxxl> sort time [s] 300 200 100 75 50 40 30 26 16 32 64 128 256 element size [byte] 512 1024 Sanders: Algorithm Engineering 18. Mai 2006 86 What are good block sizes (8 disks)? 26 128 GBytes 1x merge 128 GBytes 2x merge 16 GBytes sort time [ns/byte] 24 22 20 18 16 14 12 128 256 512 1024 2048 block size [KByte] B is not a technology constant 4096 8192 Sanders: Algorithm Engineering 18. Mai 2006 87 Optimal Versus Naive Prefetching Total merge time 176 read buffers 112 read buffers 48 read buffers difference to naive prefetching [%] 1 0 -1 -2 -3 0 20 40 60 80 fraction of prefetch buffers [%] 100 Sanders: Algorithm Engineering 18. Mai 2006 Impact of Prefetch and Overlap Buffers 150 no prefetch buffer heuristic schedule 145 merging time [s] 140 135 130 125 120 115 16 32 48 64 80 96 112 128 144 160 172 number of read buffers 88 Sanders: Algorithm Engineering 18. Mai 2006 89 Tradeoff: Write Buffer Size Versus Read Buffer Size 145 140 merge time [s] 135 130 125 120 115 110 0 50 100 150 number of read buffers 200 Sanders: Algorithm Engineering 18. Mai 2006 90 Scalability sort time [ns/byte] 20 15 10 5 128-byte elements 512-byte elements 4N/DB bulk I/Os 0 16 32 64 input size [GByte] 128 Sanders: Algorithm Engineering 18. Mai 2006 Discussion Theory and practice harmonize No expensive server hardware necessary (SCSI,. . . ) No need to work with artificial M No 2/4 GByte limits Faster than academic implementations (Must be) as fast as commercial implementations but with performance guarantees Blocks are much larger than often assumed. Not a technology constant Parallel disks bandwidth “for free” don’t negliect internal costs 91 Sanders: Algorithm Engineering 18. Mai 2006 More Parallel Disk Sorting? Pipelining: Input does not come from disk but from a logical input stream. Output goes to a logical output stream only half the I/Os for sorting todo: better overlapping often no I/Os for scanning Parallelism: This is the only way to go for really many disks Tuning and Special Cases: ssssort, permutations, balance work between merging and run formation?. . . Longer Runs: not done with guaranteed overlapping, fast internal sorting ! Distribution Sorting: Better for seeks etc.? Inplace Sorting: Could also be faster 92 Sanders: Algorithm Engineering 18. Mai 2006 Determinism: A practical and theoretically efficient algorithm? 93 Sanders: Algorithm Engineering 18. Mai 2006 Procedure formLongRuns q, q0 : PriorityQueue for i := 1 to M do q.insert(readElement) invariant |q| + |q0 | = M loop while q 6= 0/ writeElement(e:= q.deleteMin) if input exhausted then break outer loop if e0 := readElement < e then q0 .insert(e0 ) else q.insert(e0 ) q:= q0 ; q0 := 0/ output q in sorted order; output q0 in sorted order Knuth: average run length 2M todo: cache-effiziente Implementierung 94