High-Performance Computer Architecture 25 | Cache Experiment
High-Performance Computer Architecture 25 | Cache Experiment

NOTE: The present article does NOT include anything related to the final submission (no answers, no specific values, no results) for the course CS6290 HPCA because of the honor code. Most of the contents in this article is repeating the basic instructions and the directions of the Project 2. More Linux commands are provided as complements to the project’s guidelines.
Please feel free to contact me if this article violates the rules of Georgia Tech and I will immediately delete this article with no doubt.
0. Basic Setup
In this experiment, we are going to focus on the cache. We will use the original cmp4-noc.conf
file and you can view the file by,
$ cd sesc/confs
$ cat cmp4-noc.conf | more
Also, we will be using the FMM benchmark with 256 particles and single-core execution, and you can find this simulation from,
$ cd ~/sesc/apps/Splash2/fmm
Then we can view the content of this directory by,
$ ls
If you access this directory for the first time, you may probably get the following output,
Changed Changed1 Changed.new Changed.NOLOCK Input Makefile Source
From this output, we can find out that the file fmm.mipseb
doesn’t exist. We can use the Makefile
to compile this file,
$ make
$ ls
Then the output should be,
Changed Changed.new fmm.mipseb Makefile
Changed1 Changed.NOLOCK Input Source
Then we can run the simulation as we have done in the previous experiments and by the following command,
$ ~/sesc/sesc.opt -f Default -c ~/sesc/confs/cmp4-noc.conf -iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
The output report will be named as sesc_fmm.mipseb.Default
and we have to make sure that the fmm.err
file should be empty,
$ cat fmm.err
and for fmm.out
,
$ cat fmm.out
the output should begin with,
Creating a two cluster, non uniform distribution for 256 particles
and have a,
Total time for steps 3 to 5 : 0
at the end.
- Experiment 1: Cache Performance
In this experiment, we will be modifying the data caches of the simulated processor. So let’s take a closer look at the configuration file ~/sesc/confs/cmp4-noc.conf
by,
$ cat ~/sesc/confs/cmp4-noc.conf | more
Let’s then have a look at the DMemory
section,
# data source
[DMemory]
deviceType = 'smpcache'
size = 32*1024
assoc = 4
bsize = $(cacheLineSize)
writePolicy = 'WB'
replPolicy = 'LRU'
protocol = 'DMESI'
numPorts = 2
portOccp = 1 # Number of occupancy per port. 0: UnlimitedPort, 1:FullyPipelinedPort, other value: PortPipe
hitDelay = 1
missDelay = 1
#displNotify = false
MSHR = "DMSHR"
lowerLevel = "Router RTR sharedBy 1"
It says that the structure the processor gets data from is of type smpcache
which,
- can store 32KB of data: by
size
- 4-way set associative: by
assoc
- has a 64B block/line size: by
bsize
which uses thecacheLineSize
(64) - write-back cache: by
writePolicy
- LRU replacement policy: by
replPolicy
- handle 2 cache accesses every cycle: by
numPorts
andportOccp
- has a 1-cycle hit time: by
hitDelay
- takes 1 cycles to detect a miss: by
missDelay
- keeps track of the misses with (data miss handling registers) DMSHR: by
MSHR
In the DMSHR
section of the configuration file, we can find out that it is a 64-entry structure where each entry can keep track of a miss to an entire 64-byte block. On a miss, the L1 cache requests data from the core’s local slice of the L2 cache, or from the on-chip router that connects it to the L2 slices of other cores.
[DMSHR]
type = 'single' # Options: none, nodeps, full, single, banked Check libsuc/MSHR
size = 64
bsize = $(cacheLineSize)
Note that in this project we will still be using only one core (Core 0) so it gets to use the entire L2 cache (all four slices). Looking at the L2Slice
section,
[L2Slice]
deviceType = 'slicecache'
inclusive = false
size = 1*1024*1024
assoc = 16
bsize = $(cacheLineSize)
writePolicy = 'WB'
replPolicy = 'LRU'
numPorts = 2 # one for L1, one for snooping
portOccp = 1 # throughput of a cache
hitDelay = 12
missDelay = 12 # exclusive, i.e., not added to hitDelay
numPortsDir = 1 # one for L1, one for snooping
portOccpDir = 1 # throughput of a cache
hitDelayDir = 1
MSHR = 'L2MSHR'
lowerLevel = "Router RTR sharedBy 1"
We see that,
- each slice can store 1 MB of data (so the total L2 cache size is 4MB): by
size
- it is a 16-way set-associative cache: by
assoc
- with 64B block size: by
bsize
- with write-back cache: by
writePolicy
- with LRU replacement policy: by
replPolicy
- handle 2 cache accesses every cycle: by
numPorts
andportOccp
- has a 12-cycle hit time: by
hitDelay
- takes 12 cycles to detect a miss: by
missDelay
- keeps track of the misses with (data miss handling registers) DMSHR: by
MSHR
- it uses a 64-entry MSHR to keep track of misses: by
MSHR
When there is a miss, it is handed off to a local on-chip router (see the router
section), which uses the on-chip network (NOC) to deliver the message to a memory controller (see the NOC
section).
It, in turn, uses the off-chip processor-memory bus to access the main memory, and this can be found from the memory
section,
[Memory]
deviceType = 'niceCache'
size = 64
assoc = 1
bsize = 64
writePolicy = 'WB'
replPolicy = 'LRU'
numPorts = 1
portOccp = 1
hitDelay = 200
missDelay = 10000
MSHR = NoMSHR
lowerLevel = 'voidDevice'
which is modeled in this configuration as an infinite cache with,
- a 200-cycle hit delay: by
hitDelay
Now, let’s read the default report sesc_fmm.mipseb.Default
by,
$ ~/sesc/scripts/report.pl sesc_fmm.mipseb.Default
Then we should change the L1 cache size to 2kB, leave all other conf parameters unchanged and get a new configuration file named cmp4-noc-SmallL1.conf
. Then we can redo the simulation by,
$ ~/sesc/sesc.opt -f SmallL1 -c ~/sesc/confs/cmp4-noc-SmallL1.conf
-iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
We can then read this report by,
$ ~/sesc/scripts/report.pl sesc_fmm.mipseb.SmallL1
Compare the miss rate of these two caches, we are expected to see that the larger cache (32KB) will have a lower miss rate and thus, it has a better performance.
Now, let’s modify the original configuration file cmp4-noc.conf
to make a directed-mapped cache. The new configuration file named cmp4-noc-DMapL1.conf
. Then we can redo the simulation by,
$ ~/sesc/sesc.opt -fDMapL1
-c ~/sesc/confs/cmp4-noc-DMapL1.conf
-iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
We can then read this report by,
$ ~/sesc/scripts/report.pl sesc_fmm.mipseb.DMapL1
We can find out the set-associative cache has a better performance than the direct-mapped cache.
Now let’s restore the default configuration (32kB, 4-way set associative L1 cache) and change the L1 cache latency to 4 cycles (change both hitDelay and missDelay to 3) and then to 7 cycles. The two configuration files should be named as cmp4-noc-4CycL1.conf
and cmp4-noc-7CycL1.conf
, respectively. Then we can run the simulations by,
$ ~/sesc/sesc.opt -f 4CycL1
-c ~/sesc/confs/cmp4-noc-4CycL1.conf
-iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
$ ~/sesc/sesc.opt -f 7CycL1
-c ~/sesc/confs/cmp4-noc-7CycL1.conf
-iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
We can then read these reports by,
$ ~/sesc/scripts/report.plsesc_fmm.mipseb.
4CycL1
$ ~/sesc/scripts/report.plsesc_fmm.mipseb.7CycL1
2. Experiment 2: NXLRU Policy Implementation
The cache implementation in the simulator can only model the LRU replacement policy. Even though a RANDOM policy can be specified in the configuration file, the code that models the replacement policy will still implement LRU.
Now we will explore what happens when we actually change the cache’s replacement policy. We will implement the NXLRU (Next to Least Recently Used). While LRU replaces the block that is the first in LRU order (i.e. the least recently used block) in the cache set, NXLRU should replace the block that is the second in LRU order in the set.
To implement NXLRU, we need to modify the code of the simulator. The source file which implements the smpcache
(used for our L1 cache) is in the ~/sesc/src/libcmp/
directory named,
SMPCache.h
SMCache.cpp
For much of the basic cache behavior, the SMPCache also uses code in ~/sesc/src/libsuc/
named,
CacheCore.h
CacheCore.cpp
In these files, there are separate classes for,
CacheDM
(for direct-mapped caches)CacheAssoc
(for set-associative caches).
Since direct-mapped caches do not have a replacement policy (they must replace the one line where the new block must go), we will be looking at the CacheAssoc
class.
First, we must add NXLRU
as an option that can be specified in the conf file and it can be selected when a CacheAssoc
object is constructed. Probably a good approach is to look for LRU
in the code to see how this is done for LRU
(and RANDOM
), and then add NXLRU
. And remember to check the following parts,
macros
CacheGeneric
CacheAssoc
Then we must actually implement this policy. The function that actually implements the cache’s replacement policy is the findLine2Replace
method of the CacheAssoc
class in CacheCore.cpp
.
The parameter addr
supplied to this method is the new address that needs a line in the cache. Note that this method does not only implement the replacement policy because an actual replacement (replace one valid line with another) may not be needed. For example, when addr is already in the cache (a cache hit), this method returns the line that contains addr
. In the most typical case, the cache will be treated as a DMC (direct-mapped cache) which will only check the most typical case instead of all the cases in the cache. If we have a cache miss, then it will then be treated as a set-associative cache. This is an implementation of the Way Prediction.
// extract tag from the address
Addr_t tag = calcTag(addr);
// extract index and get the corresponding set
Line **theSet = &content[calcIndex4Tag(tag)];
/* Check most typical case */
// if the tag in the set equals the tag we extracted
// means a cache hit
if ((*theSet)->getTag() == tag) {
GI(tag,(*theSet)->isValid()); // set the valid bit to 1
return *theSet; // return the line contains addr
}
However, when we have a cache miss with the way prediction, we need to have a search in the set and treat this cache as a normal set-associative cache. When we have a cache hit, we will store the hit line in the parameter lineHit
. While if we have a cache miss, we will need to check whether we have a non-valid line. When the set where the addr
belongs to contains non-valid lines, one of those non-valid lines lineFree
is stored and will be used — a valid block may have a cache hit in the future, while a non-valid line cannot, so we should only replace a valid line if the set has no non-valid lines. Finally, when the set contains locked
lines, they are skipped.
// get the set ending line based on associativity
Line **setEnd = theSet + assoc;
// assume we have a cache hit line and initialize it with NULL
Line **lineHit=0;
// assume we have a non-valid line and initialize it with NULL
Line **lineFree=0;
/* search the cache hit in the set */
// let's start looping from the LRU to the MRU line in the set
Line **l = setEnd -1;
// when the address of line l is smaller than the
// set beginning address, we have to break the loop
// because we have checked the whole set
while(l >= theSet) {
// let's check if the current line has a cache hit
if ((*l)->getTag() == tag) {
lineHit = l; // set to lineHit if we have a hit
break; // then break the loop
}
// if the current line is non-valid, it can be useful
if (!(*l)->isValid())
lineFree = l; // set the free line
// if we haven't got a free line, we set the free line to
// the LRU line without lock
else if (lineFree == 0 && !(*l)->isLocked())
lineFree = l;
// If line is invalid, isLocked must be false
GI(!(*l)->isValid(), !(*l)->isLocked());
l--; // loop the next line
}
// if we have a cache hit, we can directly return the line hit
if (lineHit)
return *lineHit;
So the actual LRU
policy implemented by findLine2Replace
is that,
from the set where addr
belongs,
- return the line that contains
addr
if there is such a line - otherwise return the invalid line that was accessed most recently if there are any invalid lines
- otherwise return the LRU line among the lines that are not locked
Even that is not the complete specification because findLine2Replace
must consider what should happen when all lines are valid and locked. In that case, it returns 0 unless ignoreLocked
is true, in which case it returns the least recently used line chosen among all the (valid and locked) lines in the set.
Our NXLRU policy should treat hits and invalid lines just like the existing LRU and RANDOM policy, but when there is no hit and no invalid lines to return, the NXLRU policy should find the second-least-recently-used line among the non-locked lines.
However, if only one non-locked line exists in the set, that line must be returned. And if all lines are valid and locked, the second-least-recently-used one in the set should be returned.
Because hanging the behavior of existing policies will change the behavior of all cache-like structures in the processor, including TLBs. We will want to change the replacement policy only in L1 caches and leave the behavior of TLBs, L2 caches, etc. unchanged! So we must add a new NXLRU policy instead of modifying the existing LRU (or RANDOM) code.
After modification, we have to copy the files in the shared folder to the sesc library by,
$ cp /media/sf_CS6290/sesc/src/libsuc/CacheCore.h ~/
sesc/src/libsuc/CacheCore.h
cp /media/sf_CS6290/sesc/src/libsuc/
$CacheCore.cpp ~/
sesc/src/libsuc/CacheCore.cpp
Then we have to rebuild the sesc
simulator by,
$ cd ~/sesc
$ make
Now, let’s run a simulation with a 2kB L1 cache, using NXLRU policy, and with all other settings at their default values. The configuration file should be named as cmp4-noc-L1NXLRU.conf
.The simulation report for this should be named as sesc_fmm.mipseb.L1NXLRU
.
$ ~/sesc/sesc.opt -fL1NXLRU
-c ~/sesc/confs/cmp4-noc-L1NXLRU.conf
-iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
We can then read these reports by,
$ ~/sesc/scripts/report.pl sesc_fmm.mipseb.L1NXLRU
Because report.pl
does not provide summary statistics on the L2 cache, you will have to directly examine the report file generated by SESC for knowing the number of blocks that are fetched for L1 and L2.
$ cat sesc_fmm.mipseb.L1NXLRU | more
This file begins with a copy of the configuration that was used, then reports how many events of each kind were observed in each part of the processor. Events in the DL1 cache of processor zero (the one running the application) are reported in lines that start with P(0)_DL1:
.
In the report file, the number of blocks requested by the L1 cache from the L2 cache is reported as lineFill
(these become entire-block reads from the L2 cache), and the number of write-backs the L1 wants to do to the L2 is reported as writeBack
(these become entire-block writes to the L2 cache). Then if we want to get the block reads from L2, we can use,
$ cat sesc_fmm.mipseb.L1NXLRU | grep "P(0)_DL1:lineFill"
Compare this result with the sesc_fmm.mipseb.SmallL1
report that we have generated in experiment 1, you are expected to find that the NXLRU will have more cache reads to L2 because this policy is worse than LRU.
3. Experiment 3: Misses Classification
Now we will change the simulator to identify what kind of L1 cache miss we are having each time. Recall that the misses can be,
- compulsory: a miss is a compulsory miss if it would occur in an infinite-sized cache, i.e. if the block has never been in the cache before
- capacity: a non-compulsory miss that would occur even in a fully associative LRU cache that has the same block size and overall capacity
- conflict: a miss that is neither a compulsory nor a capacity miss
The L1 cache in the simulator counts read and write misses in separate counters, which appear in the simulation report as readMiss
and writeMiss
number for each cache. For example, there is line in the report for P(0)_DL1:readMiss=something
in the report file. This is implement in the file SMPCache.h
and SMPCache.cpp
by the following variables,
GStatsCntr readMiss;
GStatsCntr writeMiss;
Now we need to have additional counters, which should appear in the simulation report file as compMiss
, capMiss
, and confMiss
counters (these three values should add up to the readMiss+writeMiss value). Each of the new counters should count both read and write misses of that kind.
Note that we can also have counters that count compulsory, capacity, and conflict misses separately for reads and writes, or to do this classification of misses for other caches. But we must should the overall result in the report.
From the SMPCache.cpp
file, we can find out that the readMiss
and writeMiss
increases in the doRead
and dowrite
. So we may consider modifying something in these methods.
Another potentially confusing consideration is that it is possible to have an access that is a hit in a set-associative cache but it a miss in the fully associative cache that we are modeling to determine if a cache miss is a conflict miss.
After modification, we have to copy the files in the shared folder to the sesc library by,
$ cp /media/sf_CS6290/sesc/src/libcmp/SMPCache.h~/
sesc/src/libcmp/SMPCache.hcp /media/sf_CS6290/sesc/src/libcmp/SMPCache.cpp
$~/
sesc/src/libcmp/SMPCache.cpp
Then we have to rebuild the sesc
simulator by,
$ cd ~/sesc
$ make
With your new miss-classification code in the simulator, you should run a simulation with,
- the default configuration (32kB 4-way set-associative LRU L1, cache)
- the 2kB 4-way set-associative LRU L1 cache
- the direct-mapped 32kB L1 cache
- the 32kB 4-way set-associative NXLRU L1 cache
We have implemented some of the configuration files above. And the configuration files should be,
cmp4-noc.conf
cmp4-noc-SmallL1.conf
cmp4-noc-DMapL1.conf
We haven’t got the last configuration file and we have to make a new one named cmp4-noc-DefNXLRU.conf
.
Then we can generate the new reports by,
$ ~/sesc/sesc.opt -fDefLRU
-c ~/sesc/confs/cmp4-noc.conf
-iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
$ ~/sesc/sesc.opt -fSmallLRU
-c ~/sesc/confs/cmp4-noc-SmallL1.conf
-iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
$ ~/sesc/sesc.opt -fDefDM
-c ~/sesc/confs/cmp4-noc-DMapL1.conf
-iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
$ ~/sesc/sesc.opt -fDefNXLRU
-c ~/sesc/confs/cmp4-noc-DefNXLRU.conf
-iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
And the newly generated reports are,
sesc_fmm.mipseb.DefLRU
sesc_fmm.mipseb.SmallLRU
sesc_fmm.mispeb.DefDM
sesc_fmm.mispeb.DefNXLRU