# High Performance Computing 9 | Midterm Review ### 1. Work Span Model - Average Available Parallelism ``` Avg. Para = W(n) / D(n) ``` - Work-Span Law ``` Tp(n) >= max{D(n), ceil(W(n)/p)} ``` - Brent’s Theorem (Slack) ``` Tp <= (W - D) / P + D ``` - Best Sequential Time ``` T'(n) = τW(n) T'(n) = W(n) if we assume unit time per work ``` - Computation Time ``` T_comp = τW ``` - Transfer Time ``` T_mem = ɑLQ ``` - Speedup ``` S(p) = = T'(n) / Tp(n) = W(n) / Tp(n) ``` - Work Optimality ``` W = W' ``` - Weak Scalability ``` P = O(W'/D) ``` - Computational Intensity ``` I = W / LQ ``` - Machine Balance Point ``` B = ɑ / τ ``` - Parfor Cost ``` Work: O(n) Span: O(log(n)) or idealy O(1) ``` - Overall Time Range ``` τW * max(1, B/I) <= T <= τW * (1 + B/I) ``` ### 2. Power and Energy - Recall: ɑ and τ ``` ɑ = time per transaction operation τ = time per computation operation ``` - Lemma ``` 1/ɑ = transaction operation per unit time 1/τ = computation operation per unit time ``` - Execution time for a DAG ``` Usual work: W/(Pτ) Spans: D/τ Trasactions: QL/ɑ Tp >= max(W/(Pτ), D/τ, QL/ɑ) ``` - Principle of Balance ``` W/(Pτ) >= QL/ɑ Also, W/Q >= τ/ɑ * PL = PL/B ``` - Power ``` P = E / T = P_0 + ΔP = P_0 + CV^2 * f * a ``` - Dynamic Power ``` ΔP = CV^2 * f * a ``` - Frequency-Valt Relation ``` f ∝ V ``` - Power-Clock Law ``` P ∝ f^3 ``` ### 3. Memory Locality - Block Transfer Cost ``` W(n) = Ω(n) Ω(ceil(n/L)) <= Q(n;Z,L) <= ceil(n/L) + 1 ``` - Block Reduction Cost ``` W(n) = Θ(n) Q(n;Z,L) >= Θ(ceil(n/L)) ``` - Matrix-Matrix Multiplication Cost ``` W(n) = O(n^3) Q(n;Z,L) = Ω(n^2 / L) ``` - Matrix-Vector Multiplication Cost ``` W(n) = O(n^2) Q(n;Z,L) = 3n/L + n^2 + n^2/L = O(n^2) ``` - Normalized Performance ``` R = τW' / T ``` - Maximum Normalized Performance ``` R_max = W'/W * min(1, I/B) ``` ### 4. I/O Avoiding - Two-way Merge Sort ``` # Phase 1 f ∈ [0,1) Partition input into n/(fZ) chunks for each chunk i = 1 to n/(fZ) do: load chunk i sort chunk i into a sorted run i write run i # Phase 2 read L-sized blocks of A, B to _A and _B while any unmerged items in A and B do: _A, _B -> _C as possible if _A empty: load more A to _A if _B empty: load more B to _B if _C full: flush _C to C Flush any unmerged items in A or B ``` - Two-way Merge Sort Cost ``` W(n) = O(nlogn) Q(n;Z,L) = O(n/L*log(n/Z)) ``` - Multiway Merge Sort Cost ``` W(n) = O(nlogn) Q(n;Z,L) >~ (n/L)*log_(Z/L)(n/L) ``` - Binary Search ``` W(n) = O(logn) Q(n) = O(log(n/L)) by solving Recurrence ``` ### 5. Cache Oblivious - Cache Defined Transfers ``` Q(n;Z,L) = # of misses + # of store evictions ``` - Lemma: Optimal and LRU ``` Q_LRU(n;Z,L) <= 2Q_OPT(n;Z/2,L) ``` - Tall Cache Assumption ``` Z >= L^2 ``` - CO Matrix Matrix Multiplication ``` function mm(n, A, B, C): if n = 1 then C <- A + B else Logically partition A, B and C into quadrants for i = 1 to 2 do: for j = 1 to 2 do: for k = 1 to 2 do: mm(n/2, Aik, Bkj, Cij) ``` - CO Matrix Matrix Multiplication Cost ``` W(n) = O(n^3) Q(n;Z,L) = Θ(n^3/(L*sqrt(Z))) ``` - CO Binary Search: the I/O avoiding binary search is already cache oblivious - Van Emde Boas BST - partitioning the levels - layout all the upper subtree elements together - concatenate with the lower subtree elements - Van Emde Boas BST Maximum Cache misses ``` Q = h / h_sub = Ω(logN/logL) ``` ### 6. Scan and Rank - Sequential Add Scan Cost ``` W(n) = O(n) D(n) = O(n) ``` - Naive Add Scan (par) ``` for i = 1 to n do: B[i] = reduce(A[:i]) ``` - Naive Add Scan Cost ``` W(n) = O(n^2) D(n) = O(n) ``` - Advanced Add Scan (par): odd-even seperate ``` function addScan(A[n]): if n == 1 do: return A[:n] Let: I_odd[1:n/2] = 1, 3, 5, ... I_even[1:n/2] = 2, 4, 6, ... for i in I_odd do: A[i+1] = A[i] + A[i+1] // partial scan A[I_even] = addScan(A[I_even]) // even scan for i in I_odd[2:] do: A[i] += A[i-1] // odd scan ``` - Advanced Add Scan Cost ``` W(n) = O(n) D(n) = O(log^2(n)) ``` - Sequential RankList ``` function rankList(head): r = 0 cur = head while cur != NULL do: cur.rank = r cur = cur.next r++ node.rank = distance of node to head ``` - Sequential RankList Cost ``` W(n) = O(n) ``` - Linked List Representation: Array Pool - Assign index to each node by list I - Based on I, put nodes to a list V - Replace the concept of pointer to the index and place to another array N - Premitive jump ``` function jumpList(N_in[1:m], N_out[1:m]): parallel for i = 1 to m do: if N_in[i] do: N_out[i] = N_in[N_in[i]] ``` - Parallel List Rank ``` initial R_in = [0, 1, 1, ...] function updateRanks(R_in[1:m], R_out[1:m], N[1:m]): parallel for i = 1 to n do: if N[i] do: R_out[N[i]] = R_in[i] + R_in[N[i]] function rankList(V, N, h): R_in[m], R_out[m] // array of ranks N_in[m], N_out[m] // pointer index arrays // init R_in[:], R_out[:] = [1] R_in[h], R_out[h] = 0 N_in[:], N_out[:] = N[:] // compute for i = 1 to (# of jumps) do: updateRanks(R_in, R_out, N_in) jumpList(N_in, N_out) // swap to continue the loop R_in, R_out = R_out, R_in N_in, N_out = N_out, N_in ``` - Parallel List Rank Cost ``` W(n) = O(nlogn) not work optimal D(n) = O(log^2(n)) ``` ### 7. Bitonic Sort - Three Item Sort Network ![](https://github.com/Sadamingh/notepics/blob/main/hpc/2.png?raw=true) - Four Item Bitonic Sorting Network ![](https://github.com/Sadamingh/notepics/blob/main/hpc/3.png?raw=true) - Bitonic Split ``` function bitonicSplit(A[n]): // assume 2|n parfor i = 0 to (n/2-1) do: a = A[i] b = A[i + n/2] A[i] = min(a, b) A[i + n/2] = max(a, b) ``` - Bitonic Split Network ![](https://github.com/Sadamingh/notepics/blob/main/hpc/5.png?raw=true) - Bitonic Merge ``` function bitonicMerge(A[n]): // assime A is bitonic, 2|n if n>= 2 then: bitonicSplit(A[:]) spawn bitonicMerge(A[:(n/2-1)]) bitonicMerge(A[(n/2):(n-1)]) ``` - Bitonic Merge Network ![](https://github.com/Sadamingh/notepics/blob/main/hpc/5.png?raw=true) - Generate Bitonic Sequence ``` function genBitonic(A[n]): if n >= 2 then: // assume 2|n spawn genBitonic(A[:(n/2-1)]) genBitonic(A[(n/2):(n-1)]) sync spawn bitonicMerge+(A[:(n/2-1)]) bitonicMerge-(A[(n/2):(n-1)]) ``` - Bitonic Sort ``` function bitonicSort(A[n]): genBitonic(A[:]) bitonicMerge+(A[:]) ``` - Bitonic Sort Cost ``` W(n) = Θ(nlog^2(n)) not work optimal D(n) = Θ(log^2(n)) ``` ### 8. Tree