High Performance Computing 9 | Midterm Review

1. Work Span Model

Average Available Parallelism

Avg. Para = W(n) / D(n)

Work-Span Law

Tp(n) >= max{D(n), ceil(W(n)/p)}

Brent’s Theorem (Slack)

Tp <= (W - D) / P + D

Best Sequential Time

T'(n) = τW(n)
T'(n) = W(n) if we assume unit time per work

Computation Time

T_comp = τW

Transfer Time

T_mem = ɑLQ

Speedup

S(p) = = T'(n) / Tp(n) = W(n) / Tp(n)

Work Optimality

W = W'

Weak Scalability

P = O(W'/D)

Computational Intensity

I = W / LQ

Machine Balance Point

B = ɑ / τ

Parfor Cost

Work: O(n)
Span: O(log(n)) or idealy O(1)

Overall Time Range

τW * max(1, B/I) <= T <= τW * (1 + B/I)

2. Power and Energy

Recall: ɑ and τ

ɑ = time per transaction operation
τ = time per computation operation

Lemma

1/ɑ = transaction operation per unit time
1/τ = computation operation per unit time

Execution time for a DAG

Usual work: W/(Pτ)
Spans: D/τ
Trasactions: QL/ɑ
Tp >= max(W/(Pτ), D/τ, QL/ɑ)

Principle of Balance

W/(Pτ) >= QL/ɑ
Also, W/Q >= τ/ɑ * PL = PL/B

Power

P = E / T = P_0 + ΔP = P_0 + CV^2 * f * a

Dynamic Power

ΔP = CV^2 * f * a

Frequency-Valt Relation

f ∝ V

Power-Clock Law

P ∝ f^3

3. Memory Locality

Block Transfer Cost

W(n) = Ω(n)
Ω(ceil(n/L)) <= Q(n;Z,L) <= ceil(n/L) + 1

Block Reduction Cost

W(n) = Θ(n)
Q(n;Z,L) >= Θ(ceil(n/L))

Matrix-Matrix Multiplication Cost

W(n) = O(n^3)
Q(n;Z,L) = Ω(n^2 / L)

Matrix-Vector Multiplication Cost

W(n) = O(n^2)
Q(n;Z,L) = 3n/L + n^2 + n^2/L = O(n^2)

Normalized Performance

R = τW' / T

Maximum Normalized Performance

R_max = W'/W * min(1, I/B)

4. I/O Avoiding

Two-way Merge Sort

# Phase 1
f ∈ [0,1)
Partition input into n/(fZ) chunks
for each chunk i = 1 to n/(fZ) do:
    load chunk i
    sort chunk i into a sorted run i
    write run i

# Phase 2
read L-sized blocks of A, B to _A and _B
while any unmerged items in A and B do:
    _A, _B -> _C as possible
    if _A empty:
        load more A to _A
    if _B empty:
        load more B to _B
    if _C full:
        flush _C to C
Flush any unmerged items in A or B

Two-way Merge Sort Cost

W(n) = O(nlogn)
Q(n;Z,L) = O(n/L*log(n/Z))

Multiway Merge Sort Cost

W(n) = O(nlogn)
Q(n;Z,L) >~ (n/L)*log_(Z/L)(n/L)

Binary Search

W(n) = O(logn)
Q(n) = O(log(n/L)) by solving Recurrence

5. Cache Oblivious

Cache Defined Transfers

Q(n;Z,L) = # of misses + # of store evictions

Lemma: Optimal and LRU

Q_LRU(n;Z,L) <= 2Q_OPT(n;Z/2,L)

Tall Cache Assumption

Z >= L^2

CO Matrix Matrix Multiplication

function mm(n, A, B, C):
    if n = 1 then C <- A + B
    else
        Logically partition A, B and C into quadrants
        for i = 1 to 2 do:
            for j = 1 to 2 do:
                 for k = 1 to 2 do:
                     mm(n/2, Aik, Bkj, Cij)

CO Matrix Matrix Multiplication Cost

W(n) = O(n^3)
Q(n;Z,L) = Θ(n^3/(L*sqrt(Z)))

CO Binary Search: the I/O avoiding binary search is already cache oblivious
Van Emde Boas BST
- partitioning the levels
- layout all the upper subtree elements together
- concatenate with the lower subtree elements
Van Emde Boas BST Maximum Cache misses

Q = h / h_sub = Ω(logN/logL)

6. Scan and Rank

Sequential Add Scan Cost

W(n) = O(n)
D(n) = O(n)

Naive Add Scan (par)

for i = 1 to n do:
    B[i] = reduce(A[:i])

Naive Add Scan Cost

W(n) = O(n^2)
D(n) = O(n)

Advanced Add Scan (par): odd-even seperate

function addScan(A[n]):
    if n == 1 do:
        return A[:n]
    Let:
        I_odd[1:n/2] = 1, 3, 5, ...
        I_even[1:n/2] = 2, 4, 6, ...
        
    for i in I_odd do:
        A[i+1] = A[i] + A[i+1]       // partial scan
    
    A[I_even] = addScan(A[I_even])   // even scan
    
    for i in I_odd[2:] do:
        A[i] += A[i-1]               // odd scan

Advanced Add Scan Cost

W(n) = O(n)
D(n) = O(log^2(n))

Sequential RankList

function rankList(head):
    r = 0
    cur = head
    
    while cur != NULL do:
        cur.rank = r
        cur = cur.next
        r++
        
node.rank = distance of node to head

Sequential RankList Cost

W(n) = O(n)

Linked List Representation: Array Pool
- Assign index to each node by list I
- Based on I, put nodes to a list V
- Replace the concept of pointer to the index and place to another array N
Premitive jump

function jumpList(N_in[1:m], N_out[1:m]):
    parallel for i = 1 to m do:
        if N_in[i] do:
            N_out[i] = N_in[N_in[i]]

Parallel List Rank

initial R_in = [0, 1, 1, ...]

function updateRanks(R_in[1:m], R_out[1:m], N[1:m]):
    parallel for i = 1 to n do:
        if N[i] do:
            R_out[N[i]] = R_in[i] + R_in[N[i]]
            
function rankList(V, N, h):
    R_in[m], R_out[m]    // array of ranks
    N_in[m], N_out[m]    // pointer index arrays
    
    // init
    R_in[:], R_out[:] = [1]
    R_in[h], R_out[h] = 0
    N_in[:], N_out[:] = N[:]
    
    // compute
    for i = 1 to (# of jumps) do:
        updateRanks(R_in, R_out, N_in)
        jumpList(N_in, N_out)
        // swap to continue the loop
        R_in, R_out = R_out, R_in
        N_in, N_out = N_out, N_in

Parallel List Rank Cost

W(n) = O(nlogn) not work optimal
D(n) = O(log^2(n))

7. Bitonic Sort

Three Item Sort Network
Four Item Bitonic Sorting Network
Bitonic Split

function bitonicSplit(A[n]):
    // assume 2|n
    parfor i = 0 to (n/2-1) do:
        a = A[i]
        b = A[i + n/2]
        A[i] = min(a, b)
        A[i + n/2] = max(a, b)

Bitonic Split Network
Bitonic Merge

function bitonicMerge(A[n]):
    // assime A is bitonic, 2|n
    if n>= 2 then:
        bitonicSplit(A[:])
        spawn bitonicMerge(A[:(n/2-1)])
        bitonicMerge(A[(n/2):(n-1)])

Bitonic Merge Network
Generate Bitonic Sequence

function genBitonic(A[n]):
    if n >= 2 then:
    // assume 2|n
    spawn genBitonic(A[:(n/2-1)])
    genBitonic(A[(n/2):(n-1)])
    sync
    spawn bitonicMerge+(A[:(n/2-1)])
    bitonicMerge-(A[(n/2):(n-1)])

Bitonic Sort

function bitonicSort(A[n]):
    genBitonic(A[:])
    bitonicMerge+(A[:])

Bitonic Sort Cost

W(n) = Θ(nlog^2(n)) not work optimal
D(n) = Θ(log^2(n))

High Performance Computing 9 | Midterm Review

1. Work Span Model

2. Power and Energy

3. Memory Locality

4. I/O Avoiding

5. Cache Oblivious

6. Scan and Rank

7. Bitonic Sort

8. Tree