High Performance Computing 3 | I/O Avoiding Algorithms

1. I/O Basics

(1) I/O Definition

In our case, I/O refers to the transfers of data between slow and fast memories.

(2) A Sense of Scale

Before we start to discuss, let’s first see an example. Suppose we are given an input dataset to sort and we have the following inforation,

Record (item) size: r = 256 Bytes = 2^8 Bytes
Volum of data to sort on disk (slow mem): r * n = 1 PiB = 2^50 Bytes
DRAM size (fast mem): r * z = 64 GiB = 2^36 Bytes
Memory transfer size: r * L = 32 KiB = 2^15 Bytes

Now we can find out that,

n = 2^42 records = 4 * (2^10)^4 ≈ 4 * (10^3)^4 = 4.4 Tops
nlog2(n) = 185 Tops
Z = 2^28 Tops
L = 2^7 Tops

Now we have the baseline nlog2(n) and now let’s see the improvements relative to the base line when we consider the L size transaction and the Z size fast memory.

nlog2(n/L)           = 154.         ~ *1.2
n                    =   4.4        ~ *42
(n/L)log2(n/L)       =   1.2        ~ *154
(n/L)log2(n/Z)       =   0.275      ~ *672
(n/L)log_(Z/L)(n/L)  =   0.0523     ~ *3530

From this result, we can find out that one big improvement comes out from reducing n to n/L. This means when we pass over the data, we do so in L size transactions as much as possible. The other big improvement comes from going from log2 to log_(Z/L) and this improvement involves the capacity of the fast memory Z.

(3) The Lower Bound

The goal of this lesson is to understand the lower bound on the amount of communication needed to sort on a machine with slow and fast memory. And here’s the lower bound,

Q(n;Z,L) = Ω((n/L)log_(Z/L)(n/L))

2. I/O Avoiding Merge Sort

(1) Merge Sort Phase 1

Now, let’s see a problem of sorting two elements in a two level memory system and we also assume that the processor is sequential (aka. not parallel). So here’s the merge sort idea.

Phase 1
- Localically dividing the input into chunks of size fZ where f is a multiplier in no larger than 1 so that the chunk can fit into the fast memory.
```
f ∈ [0,1)
# of chunks = n/(fZ)
```
- Read the input of a chunk from the slow memory into the fast memory, producing a sorted chunk
- After the chunk is sorted, write it back to the slow memory
- To sort all the chunks, we need to get n/(fZ) runs in total

We can also have the following pseudo code of the phase 1,

Partition input into n/(fZ) chunks
for each chunk i = 1 to n/(fZ) do:
    load chunk i
    sort chunk i into a sorted run i
    write run i

Let’s also have a taste of the phase 2 as follows,

Phase 2
- Merge the n/(fZ) runs to a single

(2) Merge Sort Phase 1 Asymptotic Cost

Let’s then analyze the asymptotic cost of the merge sort. We assume,

f is a constant so it can be ignored
L mod (fZ) = 0
(fZ) mod n = 0

Then in phase 1,

load chunk i has O(n/L) transfers because n elements are transferred in L words
sort chunk i into a sorted run i has O(nlogZ) computations because,

O(fZlog(fZ) * n/(fZ)) = O(nlog(fZ)) = O(nlogZ)

write run i has O(n/L) transfers

(3) Merge Sort Phase 2

Then we are going to see how we can merge m sorted runs into a single sorted run. Suppose we each run has a size of s, then,

n = m * s

A classical merge sort idea is to merge pairs of runs until we get a final single run and now let’s see what happens at each level. At each level k started from 0, we have the sorted run as size 2^k * s.

(4) Merge Sort Phase 2 Cost - One Pair

Considering a pair of runs A and B, each of size 2^(k-1) * s and our goal is to produce a merged run C which will hold 2^k * s sorted items.

C = merge(A, B)

Assume the fast memory of size Z holds three buffers and each of them can hold L elements. A proportion of A (L sized) and B (L sized) will be loaded into the first buffer and the second buffer, and then the merged result of them will be stored in C. When C is full, then flush to the slow memory and continue until we have merged the pair of runs. The pseudo code should be as follows,

read L-sized blocks of A, B to _A and _B
while any unmerged items in A and B do:
    _A, _B -> _C as possible
    if _A empty:
        load more A to _A
    if _B empty:
        load more B to _B
    if _C full:
        flush _C to C
Flush any unmerged items in A or B

The cost to merge A and B should be,

                          A              B             C
                     ------------   ------------   ----------
Pair Transfer Cost = 2^(k-1)s / L + 2^(k-1)s / L + 2^(k)s / L
                   = 2^(k+1)s / L

Considering the comparisons, the asymptotic upper-bound cost should be,

Pair Comparison Cost = Θ(2^ks)

(5) Merge Sort Phase 2 Cost - Total

The calcualtions above is just for merging one pair and for the original merge tree, we have the number of merged pairs at level k as,

# of pairs = n/(2^ks)

When # of pairs = 1, we reach the maximum of k, so,

# of levels = max(k) = log(n/s)

Therefore, in total we have the costs as,

Transfer Cost = (Pair Transfer Cost * # of pairs per level) * # of levels
              = (2^(k+1)s / L * n/(2^ks)) * log(n/s)
              = 2n/L * log(n/s)

Comparison Cost = (Pair Comparison Cost * # of pairs per level) * # of levels
                = (Θ(2^ks) * n/(2^ks)) * log(n/s)
                = Θ(nlog(n/s))

(6) General Costs for Two-way Merge Sort on Two Levels

Now if we cosider this problem in a two-level condition with two phases. In phase 1 and 2, the costs are,

Phase           Transfer              Comparison
1                O(n/L)                O(nlogZ) 
2            O(n/L*log(n/Z))         O(nlog(n/Z))

So in total we have,

Transfer Cost = O(n/L) + O(n/L*log(n/Z))
              = O(n/L*log(n/Z))
             
Comparison Cost = O(nlogZ) + O(nlog(n/Z))
                = O(n(logZ + log(n/Z)))
                = O(nlog(Z * n/Z))
                = O(nlogn)

(7) Problem of Two-way Merge Sort

We can find out that the transfer cost Q(n;Z,L) we have here above as we discussed is,

Q(n;Z,L) = O(n/L*log(n/Z))

And we can also write it fancier as,

Q(n;Z,L) = O(n/L*log(n/Z)) = O(n/L * (log(n/L) - log(Z/L)))

This is also,

Q(n;Z,L) = O(n/L * log(Z/L) * (log(n/L)/log(Z/L) - 1))

Compared with the lower bound we have as,

Q(n;Z,L) = Ω(n/L * log_(Z/L)(n/L)) = Ω(n/L * log(n/L)/log(Z/L))

We can find the improvement of the lower bound is,

log(Z/L) * (1 - log(Z/L)/log(n/L))

The reason for that is because we only uses three L-sized buffers in the fast memory instead of the total size of Z. More specifically, the 2-way merge uses just 3 of the total Z/L available blocks of the fast memory. In order to improve that, we have to consider the multiway merging.

(8) Multiway Merge Sort

So the idea to improve the two way merging is to merge more than two runs at a time to fully utilize the fast memory. Let’s say we are merging k runs of s size each at a time to a single run we must staisify,

(k + 1) L <= Z

In each add, we will first find the smallest item across all the k runs and then add it to the output (e.g. k+1) buffer. When the output buffer gets filled, the only thing we have to do is to flush it.

Now the only question is how to pick the next smallest item across k buffers. We have several options,

Linear Scan
Min-heap (aka. priority queue)

If we go with the min-heap, we will have the following costs.

building: O(k)
extractMin: O(logk)
insert: O(logk)

Then, let’s have a look at the cost of a single k-way merge,

Transfers: 2ks/L
- ks/L for a load
- ks/L for a write
Comparisions: O(k + kslogk)
- O(k) to build the heap
- every ks items are either inserted or extracted, so O(kslogk)

If we look into the whole picture, the number of a multiway merge includes,

total comparisions numbers: O(nlogn)
- this is similar to any compare based sorting algorithms
total transfers numbers: Q(n;Z,L) = Θ((n/L)*log_(Z/L)(n/L))
- assume we can always do a k way merge in fast memory so k = Θ(Z/L) < Z/L
- the maximum numbe of levels of the merge tree should be l = Θ(log_(Z/L)(n/L)), use this as a hint
- for the i-1 line, there should be k amount k^(i-1)s items being merged to a single run of k^i*s items
  - Number of transfers per run at level i: Θ(k^i*s/L)
  - Number of runs at level i: n/(k^i*s)
  - So total transfers at level i： Θ(k^i*s/L) * n/(k^i*s) = Θ(n/L)
- So the totoal transfer numbers should be Θ(n/L) * Θ(log_(Z/L)(n/L)) = Θ((n/L)*log_(Z/L)(n/L))

(9) Performance of Multiway Merge Sort

Now is this multiway merge sort good enough to the theoritical lower bound? The answer is yes and let’s see a proof here.

Let’s say we have n items in an array for sorting, so,

# of possible orderings = n!

Let’s also suppose we have a two-level memory with the fast memory of size Z and the transfer size L. For each transfer, L items comes to the fast memory and we can know something new about the orderings. Suppose we have the number of orderings after t-1 transfers as K(t-1) and,

K(0) = n!

To put the new L items in the new transfer to the fast memory, at most we can have the number of ways to order items in fast memory as,

\tbinom{Z}{L} * L!

So that,

K(t) >= K(t-1) / (\tbinom{Z}{L} * L!)

Consider this in t transfers,

K(t) >= K(0) / (\tbinom{Z}{L} * L!)^t

So,

K(t) >= n! / (\tbinom{Z}{L} * L!)^t

However, this count is a little bit conservative than necessary because L! assumes that we don’t know the order of L. However, we do know something about L because we only have n/L of possibilities, so the number of L-sized unseen items per read is smaller or equal to n/L. This is to say that,

K(t) >= n! / (\tbinom{Z}{L} * L!)^t = n! / (\tbinom{Z}{L})^t * (L!)^t = n! / ((\tbinom{Z}{L})^t * (L!)^t) >= n! / (\tbinom{Z}{L})^t * (L!)^(n/L)

So,

K(t) >= n! / (\tbinom{Z}{L})^t * (L!)^(n/L)

Suppose we want to have the ordered result after t transfers, we need to have K(t) = 1! = 1, so,

1 >= n! / (\tbinom{Z}{L})^t * (L!)^(n/L)

Add log to both size,

log(n!) <= log((\tbinom{Z}{L})^t * (L!)^(n/L))

Here we have two properties,

log(x!) ~ xlogx
log(\tbinom{a}{b}) ~ blog(a/b)

Put this inside our inequation above, we have,

nlogn <= tLlog(Z/L) + n/L * LlogL

Move t to one side of this inequation and the rest to another side, we can then have,

t >~ (n/L)*log_(Z/L)(n/L)

This is to say that (n/L)*log_(Z/L)(n/L) is the theorical lower bound and the algorithm reached to the best performance.

3. I/O Avoiding Binary Search

(1) Number of Transfers in Binary Search

Suppose we have a sorted list of n items. The fast memory size is Z and transfer size is L. When n <= L, we only need one transfer so,

Q(n;Z,L) = 1,    when n <= L

However, when n > L, what we have to do is to find the median item and then load L items before or after it. In this case we have,

Q(n;Z,L) = 1 + Q(n/2;Z,L),        when n > L

The solution to this recurrence is,

Q(n;Z,L) = O(log(n/L))

But can we do better?

(2) Lower Bound for Search

Let’s think about the binary search in another way. To find the index i we want to search, it takes,

O(logn) bits

Also, the maximum number of bits we learn per L sized read should be logL,

Q(n;Z,L) = Ω(logn/logL) = Ω(log_L(n))

So compared with the lower bounds for the binary search O(log(n/L)) = O(log(n)-log(L)) ~ O(logn) that we can see a speedup of logL.

(3) Lower Bound for Binary Tree

However, the binary search can not reach the lower bound, but the binary tree can reach the lower bound. We will skip this part here.