High Performance Computing 9 | Midterm Reviewο
1. Work Span Modelο
Average Available Parallelism
Avg. Para = W(n) / D(n)
Work-Span Law
Tp(n) >= max{D(n), ceil(W(n)/p)}
Brentβs Theorem (Slack)
Tp <= (W - D) / P + D
Best Sequential Time
T'(n) = ΟW(n)
T'(n) = W(n) if we assume unit time per work
Computation Time
T_comp = ΟW
Transfer Time
T_mem = ΙLQ
Speedup
S(p) = = T'(n) / Tp(n) = W(n) / Tp(n)
Work Optimality
W = W'
Weak Scalability
P = O(W'/D)
Computational Intensity
I = W / LQ
Machine Balance Point
B = Ι / Ο
Parfor Cost
Work: O(n)
Span: O(log(n)) or idealy O(1)
Overall Time Range
ΟW * max(1, B/I) <= T <= ΟW * (1 + B/I)
2. Power and Energyο
Recall: Ι and Ο
Ι = time per transaction operation
Ο = time per computation operation
Lemma
1/Ι = transaction operation per unit time
1/Ο = computation operation per unit time
Execution time for a DAG
Usual work: W/(PΟ)
Spans: D/Ο
Trasactions: QL/Ι
Tp >= max(W/(PΟ), D/Ο, QL/Ι)
Principle of Balance
W/(PΟ) >= QL/Ι
Also, W/Q >= Ο/Ι * PL = PL/B
Power
P = E / T = P_0 + ΞP = P_0 + CV^2 * f * a
Dynamic Power
ΞP = CV^2 * f * a
Frequency-Valt Relation
f β V
Power-Clock Law
P β f^3
3. Memory Localityο
Block Transfer Cost
W(n) = Ξ©(n)
Ξ©(ceil(n/L)) <= Q(n;Z,L) <= ceil(n/L) + 1
Block Reduction Cost
W(n) = Ξ(n)
Q(n;Z,L) >= Ξ(ceil(n/L))
Matrix-Matrix Multiplication Cost
W(n) = O(n^3)
Q(n;Z,L) = Ξ©(n^2 / L)
Matrix-Vector Multiplication Cost
W(n) = O(n^2)
Q(n;Z,L) = 3n/L + n^2 + n^2/L = O(n^2)
Normalized Performance
R = ΟW' / T
Maximum Normalized Performance
R_max = W'/W * min(1, I/B)
4. I/O Avoidingο
Two-way Merge Sort
# Phase 1
f β [0,1)
Partition input into n/(fZ) chunks
for each chunk i = 1 to n/(fZ) do:
load chunk i
sort chunk i into a sorted run i
write run i
# Phase 2
read L-sized blocks of A, B to _A and _B
while any unmerged items in A and B do:
_A, _B -> _C as possible
if _A empty:
load more A to _A
if _B empty:
load more B to _B
if _C full:
flush _C to C
Flush any unmerged items in A or B
Two-way Merge Sort Cost
W(n) = O(nlogn)
Q(n;Z,L) = O(n/L*log(n/Z))
Multiway Merge Sort Cost
W(n) = O(nlogn)
Q(n;Z,L) >~ (n/L)*log_(Z/L)(n/L)
Binary Search
W(n) = O(logn)
Q(n) = O(log(n/L)) by solving Recurrence
5. Cache Obliviousο
Cache Defined Transfers
Q(n;Z,L) = # of misses + # of store evictions
Lemma: Optimal and LRU
Q_LRU(n;Z,L) <= 2Q_OPT(n;Z/2,L)
Tall Cache Assumption
Z >= L^2
CO Matrix Matrix Multiplication
function mm(n, A, B, C):
if n = 1 then C <- A + B
else
Logically partition A, B and C into quadrants
for i = 1 to 2 do:
for j = 1 to 2 do:
for k = 1 to 2 do:
mm(n/2, Aik, Bkj, Cij)
CO Matrix Matrix Multiplication Cost
W(n) = O(n^3)
Q(n;Z,L) = Ξ(n^3/(L*sqrt(Z)))
CO Binary Search: the I/O avoiding binary search is already cache oblivious
Van Emde Boas BST
partitioning the levels
layout all the upper subtree elements together
concatenate with the lower subtree elements
Van Emde Boas BST Maximum Cache misses
Q = h / h_sub = Ξ©(logN/logL)
6. Scan and Rankο
Sequential Add Scan Cost
W(n) = O(n)
D(n) = O(n)
Naive Add Scan (par)
for i = 1 to n do:
B[i] = reduce(A[:i])
Naive Add Scan Cost
W(n) = O(n^2)
D(n) = O(n)
Advanced Add Scan (par): odd-even seperate
function addScan(A[n]):
if n == 1 do:
return A[:n]
Let:
I_odd[1:n/2] = 1, 3, 5, ...
I_even[1:n/2] = 2, 4, 6, ...
for i in I_odd do:
A[i+1] = A[i] + A[i+1] // partial scan
A[I_even] = addScan(A[I_even]) // even scan
for i in I_odd[2:] do:
A[i] += A[i-1] // odd scan
Advanced Add Scan Cost
W(n) = O(n)
D(n) = O(log^2(n))
Sequential RankList
function rankList(head):
r = 0
cur = head
while cur != NULL do:
cur.rank = r
cur = cur.next
r++
node.rank = distance of node to head
Sequential RankList Cost
W(n) = O(n)
Linked List Representation: Array Pool
Assign index to each node by list I
Based on I, put nodes to a list V
Replace the concept of pointer to the index and place to another array N
Premitive jump
function jumpList(N_in[1:m], N_out[1:m]):
parallel for i = 1 to m do:
if N_in[i] do:
N_out[i] = N_in[N_in[i]]
Parallel List Rank
initial R_in = [0, 1, 1, ...]
function updateRanks(R_in[1:m], R_out[1:m], N[1:m]):
parallel for i = 1 to n do:
if N[i] do:
R_out[N[i]] = R_in[i] + R_in[N[i]]
function rankList(V, N, h):
R_in[m], R_out[m] // array of ranks
N_in[m], N_out[m] // pointer index arrays
// init
R_in[:], R_out[:] = [1]
R_in[h], R_out[h] = 0
N_in[:], N_out[:] = N[:]
// compute
for i = 1 to (# of jumps) do:
updateRanks(R_in, R_out, N_in)
jumpList(N_in, N_out)
// swap to continue the loop
R_in, R_out = R_out, R_in
N_in, N_out = N_out, N_in
Parallel List Rank Cost
W(n) = O(nlogn) not work optimal
D(n) = O(log^2(n))
7. Bitonic Sortο
Three Item Sort Network
Four Item Bitonic Sorting Network
Bitonic Split
function bitonicSplit(A[n]):
// assume 2|n
parfor i = 0 to (n/2-1) do:
a = A[i]
b = A[i + n/2]
A[i] = min(a, b)
A[i + n/2] = max(a, b)
Bitonic Split Network
Bitonic Merge
function bitonicMerge(A[n]):
// assime A is bitonic, 2|n
if n>= 2 then:
bitonicSplit(A[:])
spawn bitonicMerge(A[:(n/2-1)])
bitonicMerge(A[(n/2):(n-1)])
Bitonic Merge Network
Generate Bitonic Sequence
function genBitonic(A[n]):
if n >= 2 then:
// assume 2|n
spawn genBitonic(A[:(n/2-1)])
genBitonic(A[(n/2):(n-1)])
sync
spawn bitonicMerge+(A[:(n/2-1)])
bitonicMerge-(A[(n/2):(n-1)])
Bitonic Sort
function bitonicSort(A[n]):
genBitonic(A[:])
bitonicMerge+(A[:])
Bitonic Sort Cost
W(n) = Ξ(nlog^2(n)) not work optimal
D(n) = Ξ(log^2(n))