# High Performance Computing 3 | I/O Avoiding Algorithms ### 1. I/O Basics #### (1) I/O Definition In our case, I/O refers to the transfers of data between slow and fast memories. #### (2) A Sense of Scale Before we start to discuss, let's first see an example. Suppose we are given an input dataset to sort and we have the following inforation, - Record (item) size: `r = 256 Bytes = 2^8 Bytes` - Volum of data to sort on disk (slow mem): `r * n = 1 PiB = 2^50 Bytes` - DRAM size (fast mem): `r * z = 64 GiB = 2^36 Bytes` - Memory transfer size: `r * L = 32 KiB = 2^15 Bytes` Now we can find out that, ``` n = 2^42 records = 4 * (2^10)^4 ≈ 4 * (10^3)^4 = 4.4 Tops nlog2(n) = 185 Tops Z = 2^28 Tops L = 2^7 Tops ``` Now we have the baseline `nlog2(n)` and now let's see the improvements relative to the base line when we consider the L size transaction and the Z size fast memory. ``` nlog2(n/L) = 154. ~ *1.2 n = 4.4 ~ *42 (n/L)log2(n/L) = 1.2 ~ *154 (n/L)log2(n/Z) = 0.275 ~ *672 (n/L)log_(Z/L)(n/L) = 0.0523 ~ *3530 ``` From this result, we can find out that one big improvement comes out from reducing `n` to `n/L`. This means when we pass over the data, we do so in L size transactions as much as possible. The other big improvement comes from going from `log2` to `log_(Z/L)` and this improvement involves the capacity of the fast memory `Z`. #### (3) The Lower Bound The goal of this lesson is to understand the lower bound on the amount of communication needed to sort on a machine with slow and fast memory. And here's the lower bound, ``` Q(n;Z,L) = Ω((n/L)log_(Z/L)(n/L)) ``` ### 2. I/O Avoiding Merge Sort #### (1) Merge Sort Phase 1 Now, let's see a problem of sorting two elements in a two level memory system and we also assume that the processor is sequential (aka. not parallel). So here's the merge sort idea. - Phase 1 - Localically dividing the input into chunks of size `fZ` where f is a multiplier in no larger than 1 so that the chunk can fit into the fast memory. ``` f ∈ [0,1) # of chunks = n/(fZ) ``` - Read the input of a chunk from the slow memory into the fast memory, producing a **sorted chunk** - After the chunk is sorted, write it back to the slow memory - To sort all the chunks, we need to get n/(fZ) runs in total We can also have the following pseudo code of the phase 1, ``` Partition input into n/(fZ) chunks for each chunk i = 1 to n/(fZ) do: load chunk i sort chunk i into a sorted run i write run i ``` Let's also have a taste of the phase 2 as follows, - Phase 2 - Merge the n/(fZ) runs to a single #### (2) Merge Sort Phase 1 Asymptotic Cost Let's then analyze the asymptotic cost of the merge sort. We assume, - `f` is a constant so it can be ignored - `L mod (fZ) = 0` - `(fZ) mod n = 0` Then in phase 1, - `load chunk i` has `O(n/L)` transfers because `n` elements are transferred in `L` words - `sort chunk i into a sorted run i` has `O(nlogZ)` computations because, ``` O(fZlog(fZ) * n/(fZ)) = O(nlog(fZ)) = O(nlogZ) ``` - `write run i` has `O(n/L)` transfers #### (3) Merge Sort Phase 2 Then we are going to see how we can merge `m` sorted runs into a single sorted run. Suppose we each run has a size of `s`, then, ``` n = m * s ``` A classical merge sort idea is to merge pairs of runs until we get a final single run and now let's see what happens at each level. At each level `k` started from 0, we have the sorted run as size `2^k * s`. #### (4) Merge Sort Phase 2 Cost - One Pair Considering a pair of runs A and B, each of size `2^(k-1) * s` and our goal is to produce a merged run C which will hold `2^k * s` sorted items. ``` C = merge(A, B) ``` Assume the fast memory of size `Z` holds three buffers and each of them can hold `L` elements. A proportion of A (L sized) and B (L sized) will be loaded into the first buffer and the second buffer, and then the merged result of them will be stored in C. When C is full, then flush to the slow memory and continue until we have merged the pair of runs. The pseudo code should be as follows, ``` read L-sized blocks of A, B to _A and _B while any unmerged items in A and B do: _A, _B -> _C as possible if _A empty: load more A to _A if _B empty: load more B to _B if _C full: flush _C to C Flush any unmerged items in A or B ``` The cost to merge A and B should be, ``` A B C ------------ ------------ ---------- Pair Transfer Cost = 2^(k-1)s / L + 2^(k-1)s / L + 2^(k)s / L = 2^(k+1)s / L ``` Considering the comparisons, the asymptotic upper-bound cost should be, ``` Pair Comparison Cost = Θ(2^ks) ``` #### (5) Merge Sort Phase 2 Cost - Total The calcualtions above is just for merging one pair and for the original merge tree, we have the number of merged pairs at level `k` as, ``` # of pairs = n/(2^ks) ``` When `# of pairs = 1`, we reach the maximum of `k`, so, ``` # of levels = max(k) = log(n/s) ``` Therefore, in total we have the costs as, ``` Transfer Cost = (Pair Transfer Cost * # of pairs per level) * # of levels = (2^(k+1)s / L * n/(2^ks)) * log(n/s) = 2n/L * log(n/s) Comparison Cost = (Pair Comparison Cost * # of pairs per level) * # of levels = (Θ(2^ks) * n/(2^ks)) * log(n/s) = Θ(nlog(n/s)) ``` #### (6) General Costs for Two-way Merge Sort on Two Levels Now if we cosider this problem in a two-level condition with two phases. In phase 1 and 2, the costs are, ``` Phase Transfer Comparison 1 O(n/L) O(nlogZ) 2 O(n/L*log(n/Z)) O(nlog(n/Z)) ``` So in total we have, ``` Transfer Cost = O(n/L) + O(n/L*log(n/Z)) = O(n/L*log(n/Z)) Comparison Cost = O(nlogZ) + O(nlog(n/Z)) = O(n(logZ + log(n/Z))) = O(nlog(Z * n/Z)) = O(nlogn) ``` #### (7) Problem of Two-way Merge Sort We can find out that the transfer cost `Q(n;Z,L)` we have here above as we discussed is, ``` Q(n;Z,L) = O(n/L*log(n/Z)) ``` And we can also write it fancier as, ``` Q(n;Z,L) = O(n/L*log(n/Z)) = O(n/L * (log(n/L) - log(Z/L))) ``` This is also, ``` Q(n;Z,L) = O(n/L * log(Z/L) * (log(n/L)/log(Z/L) - 1)) ``` Compared with the lower bound we have as, ``` Q(n;Z,L) = Ω(n/L * log_(Z/L)(n/L)) = Ω(n/L * log(n/L)/log(Z/L)) ``` We can find the improvement of the lower bound is, ``` log(Z/L) * (1 - log(Z/L)/log(n/L)) ``` The reason for that is because we only uses three L-sized buffers in the fast memory instead of the total size of `Z`. More specifically, the 2-way merge uses just 3 of the total `Z/L` available blocks of the fast memory. In order to improve that, we have to consider the **multiway merging**. #### (8) Multiway Merge Sort So the idea to improve the two way merging is to merge more than two runs at a time to fully utilize the fast memory. Let's say we are merging `k` runs of `s` size each at a time to a single run we must staisify, ``` (k + 1) L <= Z ``` In each add, we will first find the smallest item across all the `k` runs and then add it to the `output` (e.g. `k+1`) buffer. When the output buffer gets filled, the only thing we have to do is to flush it. Now the only question is how to pick the next smallest item across `k` buffers. We have several options, - Linear Scan - Min-heap (aka. priority queue) If we go with the min-heap, we will have the following costs. - building: `O(k)` - extractMin: `O(logk)` - insert: `O(logk)` Then, let's have a look at the cost of a single k-way merge, - Transfers: `2ks/L` - `ks/L` for a load - `ks/L` for a write - Comparisions: `O(k + kslogk)` - O(k) to build the heap - every `ks` items are either inserted or extracted, so O(kslogk) If we look into the whole picture, the number of a multiway merge includes, - total comparisions numbers: `O(nlogn)` - this is similar to any compare based sorting algorithms - total transfers numbers: `Q(n;Z,L) = Θ((n/L)*log_(Z/L)(n/L))` - assume we can always do a k way merge in fast memory so `k = Θ(Z/L) < Z/L` - the **maximum numbe of levels** of the merge tree should be `l = Θ(log_(Z/L)(n/L))`, use this as a hint - for the `i-1` line, there should be `k` amount `k^(i-1)s` items being merged to a single run of `k^i*s` items - Number of transfers per run at level i: `Θ(k^i*s/L)` - Number of runs at level i: `n/(k^i*s)` - So **total transfers** at level i: `Θ(k^i*s/L) * n/(k^i*s) = Θ(n/L)` - So the totoal transfer numbers should be `Θ(n/L) * Θ(log_(Z/L)(n/L)) = Θ((n/L)*log_(Z/L)(n/L))` #### (9) Performance of Multiway Merge Sort Now is this multiway merge sort good enough to the theoritical lower bound? The answer is yes and let's see a proof here. Let's say we have `n` items in an array for sorting, so, ``` # of possible orderings = n! ``` Let's also suppose we have a two-level memory with the fast memory of size `Z` and the transfer size `L`. For each transfer, `L` items comes to the fast memory and we can know something new about the orderings. Suppose we have the number of orderings after `t-1` transfers as `K(t-1)` and, ``` K(0) = n! ``` To put the new `L` items in the new transfer to the fast memory, at most we can have the number of ways to order items in fast memory as, ``` \tbinom{Z}{L} * L! ``` So that, ``` K(t) >= K(t-1) / (\tbinom{Z}{L} * L!) ``` Consider this in `t` transfers, ``` K(t) >= K(0) / (\tbinom{Z}{L} * L!)^t ``` So, ``` K(t) >= n! / (\tbinom{Z}{L} * L!)^t ``` However, this count is a little bit conservative than necessary because `L!` assumes that we don't know the order of `L`. However, we do know something about `L` because we only have `n/L` of possibilities, so the number of L-sized unseen items per read is smaller or equal to `n/L`. This is to say that, ``` K(t) >= n! / (\tbinom{Z}{L} * L!)^t = n! / (\tbinom{Z}{L})^t * (L!)^t = n! / ((\tbinom{Z}{L})^t * (L!)^t) >= n! / (\tbinom{Z}{L})^t * (L!)^(n/L) ``` So, ``` K(t) >= n! / (\tbinom{Z}{L})^t * (L!)^(n/L) ``` Suppose we want to have the ordered result after t transfers, we need to have `K(t) = 1! = 1`, so, ``` 1 >= n! / (\tbinom{Z}{L})^t * (L!)^(n/L) ``` Add `log` to both size, ``` log(n!) <= log((\tbinom{Z}{L})^t * (L!)^(n/L)) ``` Here we have two properties, - `log(x!) ~ xlogx` - `log(\tbinom{a}{b}) ~ blog(a/b)` Put this inside our inequation above, we have, ``` nlogn <= tLlog(Z/L) + n/L * LlogL ``` Move `t` to one side of this inequation and the rest to another side, we can then have, ``` t >~ (n/L)*log_(Z/L)(n/L) ``` This is to say that `(n/L)*log_(Z/L)(n/L)` is the theorical lower bound and the algorithm reached to the best performance. ### 3. I/O Avoiding Binary Search #### (1) Number of Transfers in Binary Search Suppose we have a sorted list of `n` items. The fast memory size is `Z` and transfer size is `L`. When `n <= L`, we only need one transfer so, ``` Q(n;Z,L) = 1, when n <= L ``` However, when `n > L`, what we have to do is to find the median item and then load `L` items before or after it. In this case we have, ``` Q(n;Z,L) = 1 + Q(n/2;Z,L), when n > L ``` The solution to this recurrence is, ``` Q(n;Z,L) = O(log(n/L)) ``` But can we do better? #### (2) Lower Bound for Search Let's think about the binary search in another way. To find the index `i` we want to search, it takes, ``` O(logn) bits ``` Also, the maximum number of bits we learn per `L` sized read should be `logL`, ``` Q(n;Z,L) = Ω(logn/logL) = Ω(log_L(n)) ``` So compared with the lower bounds for the binary search O(log(n/L)) = O(log(n)-log(L)) ~ O(logn) that we can see a speedup of `logL`. #### (3) Lower Bound for Binary Tree However, the binary search can not reach the lower bound, but the binary tree can reach the lower bound. We will skip this part here.