High-Performance Computer Architecture 25 | Cache Experiment

Series: High-Performance Computer Architecture

High-Performance Computer Architecture 25 | Cache Experiment

NOTE: The present article does NOT include anything related to the final submission (no answers, no specific values, no results) for the course CS6290 HPCA because of the honor code. Most of the contents in this article is repeating the basic instructions and the directions of the Project 2. More Linux commands are provided as complements to the project’s guidelines.

Please feel free to contact me if this article violates the rules of Georgia Tech and I will immediately delete this article with no doubt.

0. Basic Setup

In this experiment, we are going to focus on the cache. We will use the original cmp4-noc.conf file and you can view the file by,

$ cd sesc/confs
$ cat cmp4-noc.conf | more

Also, we will be using the FMM benchmark with 256 particles and single-core execution, and you can find this simulation from,

$ cd ~/sesc/apps/Splash2/fmm

Then we can view the content of this directory by,

$ ls

If you access this directory for the first time, you may probably get the following output,

Changed  Changed1  Changed.new  Changed.NOLOCK  Input  Makefile  Source

From this output, we can find out that the file fmm.mipseb doesn’t exist. We can use the Makefile to compile this file,

$ make
$ ls

Then the output should be,

Changed   Changed.new     fmm.mipseb  Makefile
Changed1  Changed.NOLOCK  Input       Source

Then we can run the simulation as we have done in the previous experiments and by the following command,

$ ~/sesc/sesc.opt -f Default -c ~/sesc/confs/cmp4-noc.conf -iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1

The output report will be named as sesc_fmm.mipseb.Default and we have to make sure that the fmm.err file should be empty,

$ cat fmm.err

and for fmm.out,

$ cat fmm.out

the output should begin with,

Creating a two cluster, non uniform distribution for 256 particles

and have a,

Total time for steps 3 to 5 : 0

at the end.

Experiment 1: Cache Performance

In this experiment, we will be modifying the data caches of the simulated processor. So let’s take a closer look at the configuration file ~/sesc/confs/cmp4-noc.conf by,

$ cat ~/sesc/confs/cmp4-noc.conf | more

Let’s then have a look at the DMemory section,

# data source
[DMemory]
deviceType    = 'smpcache'
size          = 32*1024  
assoc         = 4 
bsize         = $(cacheLineSize)
writePolicy   = 'WB'
replPolicy = 'LRU'
protocol      = 'DMESI'
numPorts      = 2 
portOccp      = 1 # Number of occupancy per port. 0: UnlimitedPort, 1:FullyPipelinedPort, other value: PortPipe
hitDelay      = 1
missDelay     = 1               
#displNotify   = false
MSHR          = "DMSHR"
lowerLevel   = "Router RTR sharedBy 1"

It says that the structure the processor gets data from is of type smpcache which,

can store 32KB of data: by size
4-way set associative: by assoc
has a 64B block/line size: by bsize which uses the cacheLineSize (64)
write-back cache: by writePolicy
LRU replacement policy: by replPolicy
handle 2 cache accesses every cycle: by numPorts and portOccp
has a 1-cycle hit time: by hitDelay
takes 1 cycles to detect a miss: by missDelay
keeps track of the misses with (data miss handling registers) DMSHR: by MSHR

In the DMSHR section of the configuration file, we can find out that it is a 64-entry structure where each entry can keep track of a miss to an entire 64-byte block. On a miss, the L1 cache requests data from the core’s local slice of the L2 cache, or from the on-chip router that connects it to the L2 slices of other cores.

[DMSHR]
type = 'single' # Options: none, nodeps, full, single, banked Check libsuc/MSHR
size = 64
bsize = $(cacheLineSize)

Note that in this project we will still be using only one core (Core 0) so it gets to use the entire L2 cache (all four slices). Looking at the L2Slice section,

[L2Slice]
deviceType    = 'slicecache'
inclusive     = false
size          = 1*1024*1024
assoc         = 16
bsize         = $(cacheLineSize)
writePolicy   = 'WB'
replPolicy = 'LRU'
numPorts      = 2                # one for L1, one for snooping
portOccp      = 1  # throughput of a cache
hitDelay      = 12
missDelay     = 12               # exclusive, i.e., not added to hitDelay
numPortsDir      = 1                # one for L1, one for snooping
portOccpDir      = 1  # throughput of a cache
hitDelayDir      = 1
MSHR          = 'L2MSHR'
lowerLevel    = "Router RTR sharedBy 1"

We see that,

each slice can store 1 MB of data (so the total L2 cache size is 4MB): by size
it is a 16-way set-associative cache: by assoc
with 64B block size: by bsize
with write-back cache: by writePolicy
with LRU replacement policy: by replPolicy
handle 2 cache accesses every cycle: by numPorts and portOccp
has a 12-cycle hit time: by hitDelay
takes 12 cycles to detect a miss: by missDelay
keeps track of the misses with (data miss handling registers) DMSHR: by MSHR
it uses a 64-entry MSHR to keep track of misses: by MSHR

When there is a miss, it is handed off to a local on-chip router (see the router section), which uses the on-chip network (NOC) to deliver the message to a memory controller (see the NOC section).

It, in turn, uses the off-chip processor-memory bus to access the main memory, and this can be found from the memory section,

[Memory]
deviceType    = 'niceCache'
size          = 64
assoc         = 1
bsize         = 64
writePolicy   = 'WB'
replPolicy = 'LRU'
numPorts      = 1
portOccp      = 1
hitDelay      = 200 
missDelay     = 10000
MSHR          = NoMSHR
lowerLevel    = 'voidDevice'

which is modeled in this configuration as an infinite cache with,

a 200-cycle hit delay: by hitDelay

Now, let’s read the default report sesc_fmm.mipseb.Default by,

$ ~/sesc/scripts/report.pl sesc_fmm.mipseb.Default

Then we should change the L1 cache size to 2kB, leave all other conf parameters unchanged and get a new configuration file named cmp4-noc-SmallL1.conf . Then we can redo the simulation by,

$ ~/sesc/sesc.opt -f SmallL1 -c ~/sesc/confs/cmp4-noc-SmallL1.conf -iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1

We can then read this report by,

$ ~/sesc/scripts/report.pl sesc_fmm.mipseb.SmallL1

Compare the miss rate of these two caches, we are expected to see that the larger cache (32KB) will have a lower miss rate and thus, it has a better performance.

Now, let’s modify the original configuration file cmp4-noc.conf to make a directed-mapped cache. The new configuration file named cmp4-noc-DMapL1.conf . Then we can redo the simulation by,

$ ~/sesc/sesc.opt -f DMapL1 -c ~/sesc/confs/cmp4-noc-DMapL1.conf -iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1

We can then read this report by,

$ ~/sesc/scripts/report.pl sesc_fmm.mipseb.DMapL1

We can find out the set-associative cache has a better performance than the direct-mapped cache.

Now let’s restore the default configuration (32kB, 4-way set associative L1 cache) and change the L1 cache latency to 4 cycles (change both hitDelay and missDelay to 3) and then to 7 cycles. The two configuration files should be named as cmp4-noc-4CycL1.conf and cmp4-noc-7CycL1.conf, respectively. Then we can run the simulations by,

$ ~/sesc/sesc.opt -f 4CycL1 -c ~/sesc/confs/cmp4-noc-4CycL1.conf -iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
$ ~/sesc/sesc.opt -f 7CycL1 -c ~/sesc/confs/cmp4-noc-7CycL1.conf -iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1

We can then read these reports by,

$ ~/sesc/scripts/report.pl sesc_fmm.mipseb.4CycL1
$ ~/sesc/scripts/report.pl sesc_fmm.mipseb.7CycL1

2. Experiment 2: NXLRU Policy Implementation

The cache implementation in the simulator can only model the LRU replacement policy. Even though a RANDOM policy can be specified in the configuration file, the code that models the replacement policy will still implement LRU.

Now we will explore what happens when we actually change the cache’s replacement policy. We will implement the NXLRU (Next to Least Recently Used). While LRU replaces the block that is the first in LRU order (i.e. the least recently used block) in the cache set, NXLRU should replace the block that is the second in LRU order in the set.

To implement NXLRU, we need to modify the code of the simulator. The source file which implements the smpcache (used for our L1 cache) is in the ~/sesc/src/libcmp/ directory named,

SMPCache.h
SMCache.cpp

For much of the basic cache behavior, the SMPCache also uses code in ~/sesc/src/libsuc/ named,

CacheCore.h
CacheCore.cpp

In these files, there are separate classes for,

CacheDM (for direct-mapped caches)
CacheAssoc (for set-associative caches).

Since direct-mapped caches do not have a replacement policy (they must replace the one line where the new block must go), we will be looking at the CacheAssoc class.

First, we must add NXLRU as an option that can be specified in the conf file and it can be selected when a CacheAssoc object is constructed. Probably a good approach is to look for LRU in the code to see how this is done for LRU (and RANDOM), and then add NXLRU. And remember to check the following parts,

macros
CacheGeneric
CacheAssoc

Then we must actually implement this policy. The function that actually implements the cache’s replacement policy is the findLine2Replace method of the CacheAssoc class in CacheCore.cpp.

The parameter addr supplied to this method is the new address that needs a line in the cache. Note that this method does not only implement the replacement policy because an actual replacement (replace one valid line with another) may not be needed. For example, when addr is already in the cache (a cache hit), this method returns the line that contains addr. In the most typical case, the cache will be treated as a DMC (direct-mapped cache) which will only check the most typical case instead of all the cases in the cache. If we have a cache miss, then it will then be treated as a set-associative cache. This is an implementation of the Way Prediction.

// extract tag from the address
Addr_t tag    = calcTag(addr);

// extract index and get the corresponding set
Line **theSet = &content[calcIndex4Tag(tag)];

/* Check most typical case */
// if the tag in the set equals the tag we extracted
// means a cache hit
if ((*theSet)->getTag() == tag) {
    GI(tag,(*theSet)->isValid());        // set the valid bit to 1
    return *theSet;               // return the line contains addr
}

However, when we have a cache miss with the way prediction, we need to have a search in the set and treat this cache as a normal set-associative cache. When we have a cache hit, we will store the hit line in the parameter lineHit. While if we have a cache miss, we will need to check whether we have a non-valid line. When the set where the addr belongs to contains non-valid lines, one of those non-valid lines lineFree is stored and will be used — a valid block may have a cache hit in the future, while a non-valid line cannot, so we should only replace a valid line if the set has no non-valid lines. Finally, when the set contains locked lines, they are skipped.

// get the set ending line based on associativity
Line **setEnd = theSet + assoc;

// assume we have a cache hit line and initialize it with NULL
Line **lineHit=0;

// assume we have a non-valid line and initialize it with NULL
Line **lineFree=0;

/* search the cache hit in the set */
// let's start looping from the LRU to the MRU line in the set
Line **l = setEnd -1;

// when the address of line l is smaller than the 
// set beginning address, we have to break the loop
// because we have checked the whole set
while(l >= theSet) {
    // let's check if the current line has a cache hit
    if ((*l)->getTag() == tag) {
        lineHit = l;         // set to lineHit if we have a hit
        break;               // then break the loop
    }
    // if the current line is non-valid, it can be useful
    if (!(*l)->isValid())
        lineFree = l;        // set the free line
    // if we haven't got a free line, we set the free line to 
    // the LRU line without lock
    else if (lineFree == 0 && !(*l)->isLocked())
        lineFree = l;

    // If line is invalid, isLocked must be false
    GI(!(*l)->isValid(), !(*l)->isLocked());
    l--;                     // loop the next line
}

// if we have a cache hit, we can directly return the line hit
if (lineHit)
    return *lineHit;

So the actual LRU policy implemented by findLine2Replace is that,

from the set where addr belongs,

return the line that contains addr if there is such a line
otherwise return the invalid line that was accessed most recently if there are any invalid lines
otherwise return the LRU line among the lines that are not locked

Even that is not the complete specification because findLine2Replace must consider what should happen when all lines are valid and locked. In that case, it returns 0 unless ignoreLocked is true, in which case it returns the least recently used line chosen among all the (valid and locked) lines in the set.

Our NXLRU policy should treat hits and invalid lines just like the existing LRU and RANDOM policy, but when there is no hit and no invalid lines to return, the NXLRU policy should find the second-least-recently-used line among the non-locked lines.

However, if only one non-locked line exists in the set, that line must be returned. And if all lines are valid and locked, the second-least-recently-used one in the set should be returned.

Because hanging the behavior of existing policies will change the behavior of all cache-like structures in the processor, including TLBs. We will want to change the replacement policy only in L1 caches and leave the behavior of TLBs, L2 caches, etc. unchanged! So we must add a new NXLRU policy instead of modifying the existing LRU (or RANDOM) code.

After modification, we have to copy the files in the shared folder to the sesc library by,

$ cp /media/sf_CS6290/sesc/src/libsuc/CacheCore.h ~/sesc/src/libsuc/CacheCore.h
$ cp /media/sf_CS6290/sesc/src/libsuc/CacheCore.cpp ~/sesc/src/libsuc/CacheCore.cpp

Then we have to rebuild the sesc simulator by,

$ cd ~/sesc
$ make

Now, let’s run a simulation with a 2kB L1 cache, using NXLRU policy, and with all other settings at their default values. The configuration file should be named as cmp4-noc-L1NXLRU.conf.The simulation report for this should be named as sesc_fmm.mipseb.L1NXLRU.

$ ~/sesc/sesc.opt -f L1NXLRU -c ~/sesc/confs/cmp4-noc-L1NXLRU.conf -iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1

We can then read these reports by,

$ ~/sesc/scripts/report.pl sesc_fmm.mipseb.L1NXLRU

Because report.pl does not provide summary statistics on the L2 cache, you will have to directly examine the report file generated by SESC for knowing the number of blocks that are fetched for L1 and L2.

$ cat sesc_fmm.mipseb.L1NXLRU | more

This file begins with a copy of the configuration that was used, then reports how many events of each kind were observed in each part of the processor. Events in the DL1 cache of processor zero (the one running the application) are reported in lines that start with P(0)_DL1:.

In the report file, the number of blocks requested by the L1 cache from the L2 cache is reported as lineFill (these become entire-block reads from the L2 cache), and the number of write-backs the L1 wants to do to the L2 is reported as writeBack (these become entire-block writes to the L2 cache). Then if we want to get the block reads from L2, we can use,

$ cat sesc_fmm.mipseb.L1NXLRU | grep "P(0)_DL1:lineFill"

Compare this result with the sesc_fmm.mipseb.SmallL1 report that we have generated in experiment 1, you are expected to find that the NXLRU will have more cache reads to L2 because this policy is worse than LRU.

3. Experiment 3: Misses Classification

Now we will change the simulator to identify what kind of L1 cache miss we are having each time. Recall that the misses can be,

compulsory: a miss is a compulsory miss if it would occur in an infinite-sized cache, i.e. if the block has never been in the cache before
capacity: a non-compulsory miss that would occur even in a fully associative LRU cache that has the same block size and overall capacity
conflict: a miss that is neither a compulsory nor a capacity miss

The L1 cache in the simulator counts read and write misses in separate counters, which appear in the simulation report as readMiss and writeMiss number for each cache. For example, there is line in the report for P(0)_DL1:readMiss=something in the report file. This is implement in the file SMPCache.h and SMPCache.cpp by the following variables,

GStatsCntr readMiss;
GStatsCntr writeMiss;

Now we need to have additional counters, which should appear in the simulation report file as compMiss, capMiss, and confMiss counters (these three values should add up to the readMiss+writeMiss value). Each of the new counters should count both read and write misses of that kind.

Note that we can also have counters that count compulsory, capacity, and conflict misses separately for reads and writes, or to do this classification of misses for other caches. But we must should the overall result in the report.

From the SMPCache.cpp file, we can find out that the readMiss and writeMiss increases in the doRead and dowrite. So we may consider modifying something in these methods.

Another potentially confusing consideration is that it is possible to have an access that is a hit in a set-associative cache but it a miss in the fully associative cache that we are modeling to determine if a cache miss is a conflict miss.

After modification, we have to copy the files in the shared folder to the sesc library by,

$ cp /media/sf_CS6290/sesc/src/libcmp/SMPCache.h ~/sesc/src/libcmp/SMPCache.h
$ cp /media/sf_CS6290/sesc/src/libcmp/SMPCache.cpp ~/sesc/src/libcmp/SMPCache.cpp

Then we have to rebuild the sesc simulator by,

$ cd ~/sesc
$ make

With your new miss-classification code in the simulator, you should run a simulation with,

the default configuration (32kB 4-way set-associative LRU L1, cache)
the 2kB 4-way set-associative LRU L1 cache
the direct-mapped 32kB L1 cache
the 32kB 4-way set-associative NXLRU L1 cache

We have implemented some of the configuration files above. And the configuration files should be,

cmp4-noc.conf
cmp4-noc-SmallL1.conf
cmp4-noc-DMapL1.conf

We haven’t got the last configuration file and we have to make a new one named cmp4-noc-DefNXLRU.conf.

Then we can generate the new reports by,

$ ~/sesc/sesc.opt -f DefLRU -c ~/sesc/confs/cmp4-noc.conf -iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
$ ~/sesc/sesc.opt -f SmallLRU -c ~/sesc/confs/cmp4-noc-SmallL1.conf -iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
$ ~/sesc/sesc.opt -f DefDM -c ~/sesc/confs/cmp4-noc-DMapL1.conf -iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1
$ ~/sesc/sesc.opt -f DefNXLRU -c ~/sesc/confs/cmp4-noc-DefNXLRU.conf -iInput/input.256 -ofmm.out -efmm.err fmm.mipseb -p 1

And the newly generated reports are,

sesc_fmm.mipseb.DefLRU
sesc_fmm.mipseb.SmallLRU
sesc_fmm.mispeb.DefDM
sesc_fmm.mispeb.DefNXLRU