Come interpreti l'output di cachegrind per i problemi di memorizzazione nella cache?

Per curiosità, ho eseguito la codifica di diverse versioni di Matrix Multiplication e ho eseguito cachegrind contro di esso. Nei miei risultati qui sotto, mi chiedevo quali parti fossero L1, L2, L3 mancati e riferimenti e cosa significasse veramente tutto ciò? Di seguito è riportato il mio codice per le moltiplicazioni di matrice, nel caso in cui qualcuno ne avesse bisogno.Come interpreti l'output di cachegrind per i problemi di memorizzazione nella cache?

#define SLOWEST 
==6933== Cachegrind, a cache and branch-prediction profiler 
==6933== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==6933== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==6933== Command: ./a.out 500 
==6933== 
--6933-- warning: L3 cache found, using its data for the LL simulation. 
--6933-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 60.7487 seconds. 
==6933== 
==6933== I refs:  6,039,791,314 
==6933== I1 misses:   1,611 
==6933== LLi misses:   1,519 
==6933== I1 miss rate:   0.00% 
==6933== LLi miss rate:   0.00% 
==6933== 
==6933== D refs:  2,892,704,678 (2,763,005,485 rd + 129,699,193 wr) 
==6933== D1 misses:  136,223,560 ( 136,174,705 rd +  48,855 wr) 
==6933== LLd misses:   53,675 (  5,247 rd +  48,428 wr) 
==6933== D1 miss rate:   4.7% (   4.9%  +   0.0% ) 
==6933== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==6933== 
==6933== LL refs:   136,225,171 ( 136,176,316 rd +  48,855 wr) 
==6933== LL misses:   55,194 (  6,766 rd +  48,428 wr) 
==6933== LL miss rate:   0.0% (   0.0%  +   0.0% ) 

#define SLOWER 
==8463== Cachegrind, a cache and branch-prediction profiler 
==8463== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==8463== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==8463== Command: ./a.out 500 
==8463== 
--8463-- warning: L3 cache found, using its data for the LL simulation. 
--8463-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 49.7397 seconds. 
==8463== 
==8463== I refs:  4,537,213,120 
==8463== I1 misses:   1,571 
==8463== LLi misses:   1,487 
==8463== I1 miss rate:   0.00% 
==8463== LLi miss rate:   0.00% 
==8463== 
==8463== D refs:  2,891,485,608 (2,761,862,312 rd + 129,623,296 wr) 
==8463== D1 misses:  59,961,522 ( 59,913,256 rd +  48,266 wr) 
==8463== LLd misses:   53,113 (  5,246 rd +  47,867 wr) 
==8463== D1 miss rate:   2.0% (   2.1%  +   0.0% ) 
==8463== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==8463== 
==8463== LL refs:   59,963,093 ( 59,914,827 rd +  48,266 wr) 
==8463== LL misses:   54,600 (  6,733 rd +  47,867 wr) 
==8463== LL miss rate:   0.0% (   0.0%  +   0.0% ) 

#define SLOW 
==9174== Cachegrind, a cache and branch-prediction profiler 
==9174== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==9174== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==9174== Command: ./a.out 500 
==9174== 
--9174-- warning: L3 cache found, using its data for the LL simulation. 
--9174-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 35.8901 seconds. 
==9174== 
==9174== I refs:  3,039,713,059 
==9174== I1 misses:   1,570 
==9174== LLi misses:   1,486 
==9174== I1 miss rate:   0.00% 
==9174== LLi miss rate:   0.00% 
==9174== 
==9174== D refs:  1,893,235,586 (1,763,112,301 rd + 130,123,285 wr) 
==9174== D1 misses:  63,285,950 ( 62,987,684 rd +  298,266 wr) 
==9174== LLd misses:   53,113 (  5,246 rd +  47,867 wr) 
==9174== D1 miss rate:   3.3% (   3.5%  +   0.2% ) 
==9174== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==9174== 
==9174== LL refs:   63,287,520 ( 62,989,254 rd +  298,266 wr) 
==9174== LL misses:   54,599 (  6,732 rd +  47,867 wr) 
==9174== LL miss rate:   0.0% (   0.0%  +   0.0% ) 

#define MEDIUM 
==7838== Cachegrind, a cache and branch-prediction profiler 
==7838== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==7838== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==7838== Command: ./a.out 500 
==7838== 
--7838-- warning: L3 cache found, using its data for the LL simulation. 
--7838-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 23.4097 seconds. 
==7838== 
==7838== I refs:  2,548,967,151 
==7838== I1 misses:   1,610 
==7838== LLi misses:   1,522 
==7838== I1 miss rate:   0.00% 
==7838== LLi miss rate:   0.00% 
==7838== 
==7838== D refs:  1,399,237,303 (1,267,363,440 rd + 131,873,863 wr) 
==7838== D1 misses:   592,807 (  293,091 rd +  299,716 wr) 
==7838== LLd misses:   53,147 (  5,248 rd +  47,899 wr) 
==7838== D1 miss rate:   0.0% (   0.0%  +   0.2% ) 
==7838== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==7838== 
==7838== LL refs:    594,417 (  294,701 rd +  299,716 wr) 
==7838== LL misses:   54,669 (  6,770 rd +  47,899 wr) 
==7838== LL miss rate:   0.0% (   0.0%  +   0.0% ) 

#define MEDIUMISH 
==8438== Cachegrind, a cache and branch-prediction profiler 
==8438== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al. 
==8438== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info 
==8438== Command: ./a.out 500 
==8438== 
--8438-- warning: L3 cache found, using its data for the LL simulation. 
--8438-- warning: pretending that LL cache has associativity 24 instead of actual 16 
Multiplied matrix A and B in 24.0327 seconds. 
==8438== 
==8438== I refs:  2,550,211,553 
==8438== I1 misses:   1,576 
==8438== LLi misses:   1,488 
==8438== I1 miss rate:   0.00% 
==8438== LLi miss rate:   0.00% 
==8438== 
==8438== D refs:  1,400,107,343 (1,267,610,303 rd + 132,497,040 wr) 
==8438== D1 misses:   339,977 (  42,583 rd +  297,394 wr) 
==8438== LLd misses:   53,114 (  5,248 rd +  47,866 wr) 
==8438== D1 miss rate:   0.0% (   0.0%  +   0.2% ) 
==8438== LLd miss rate:   0.0% (   0.0%  +   0.0% ) 
==8438== 
==8438== LL refs:    341,553 (  44,159 rd +  297,394 wr) 
==8438== LL misses:   54,602 (  6,736 rd +  47,866 wr) 
==8438== LL miss rate:   0.0% (   0.0%  +   0.0% )

Codice di moltiplicazione della matrice.

#if defined(SLOWEST) 
    void multiply (float **A, float **B, float **out, int size) { 
     for (int row=0;row<size;row++) 
      for (int col=0;col<size;col++) 
       for (int in=0;in<size;in++) 
        out[row][col] += A[row][in] * B[in][col]; 
    } 
// Takes in 1-D arrays, same as before. 
#elif defined(SLOWER) 
    void multiply (float *A, float *B, float *out, int size) { 
     for (int row=0;row<size;row++) 
      for (int col=0;col<size;col++) 
       for (int in=0;in<size;in++) 
        out[row * size + col] += A[row * size + in] * B[in * size + col]; 
    } 
// Flips first and second loops 
#elif defined(SLOW) 
    void multiply (float *A, float *B, float *out, int size) { 
     for (int col=0;col<size;col++) 
      for (int row=0;row<size;row++) { 
       float curr = 0; // prevents from calculating position each time through 
       for (int in=0;in<size;in++) 
        curr += A[row * size + in] * B[in *size + col]; 
       out[row * size + col] = curr; 
      } 
    } 
#elif defined(MEDIUM) 
    // Keeps it organized for future codes. 
    float dotProduct(float *A, float *B, int size) { 
     float curr = 0; 

     for (int i=0;i<size;i++) 
      curr += A[i] * B[i]; 

     return curr; 
    } 
    void multiply (float *A, float *B, float *out, int size) { 
     float *temp = new float[size]; 

     for (int col=0;col<size;col++) { 
      for (int i=0;i<size;i++) // stores column into sequential array 
       temp[i] = B[i * size + col]; 
      for (int row=0;row<size;row++) 
       out[row * size + col] = dotProduct(&A[row], temp, size); // uses function above for dot product. 
     } 

     delete[] temp; 
    } 
#elif defined(MEDIUMISH) 
    float dotProduct(float *A, float *B, int size) { 
     float curr = 0; 

     for (int i=0;i<size;i++) 
      curr += A[i] * B[i]; 

     return curr; 
    } 
    void multiply (float *A, float *B, float *out, int size) { 
     for (int i=0;i<size-1;i++) 
      for (int j=i+1;j<size;j++) 
       std::swap(B[i * size + j], B[j * size + i]); 

     for (int col=0;col<size;col++) 
      for (int row=0;row<size;row++) 
       out[row * size + col] = dotProduct(&A[row], &B[row], size); // uses function above for dot product. 
    } 
#elif defined(FAST) 

#elif defined(FASTER) 

#endif

fonte

2013-11-24 Kevin Melkowski

Secondo il documentation Cachegrind simulare solo la prima e le cache ultimo livello:

Cachegrind simula come il programma interagisce con la cache gerarchia di una macchina e (opzionalmente) predizione delle diramazioni. Simula una macchina con istruzioni di primo livello indipendenti e cache di dati (I1 e D1), supportate da una cache di secondo livello unificata (L2). Questo corrisponde esattamente alla configurazione di molte macchine moderne.

Tuttavia, alcune macchine moderne hanno tre o quattro livelli di cache. Per queste macchine (nei casi in cui Cachegrind può rilevare automaticamente la configurazione della cache ) Cachegrind simula le cache di primo livello e di ultimo livello. Il motivo di questa scelta è che la cache di ultimo livello ha la maggiore influenza sul runtime, poiché maschera gli accessi alla memoria principale . Inoltre, le cache L1 hanno spesso un'associatività bassa, pertanto la simulazione di esse può rilevare casi in cui il codice interagisce male con la cache (ad esempio attraversando una colonna in base alla lunghezza della riga con una potenza di 2).

Ciò significa che non è possibile ottenere informazioni L2 ma solo L1 e L3 nel proprio caso.

La prima parte dell'output di cachegrind riporta informazioni sulla cache delle istruzioni L1. In tutto il tuo esempio, il numero di mancate missioni di istruzioni L1 è insignificante, il tasso di errore è sempre 0%. Significa che tutti i tuoi programmi si adattano alla cache delle istruzioni L1.

La seconda parte dell'output riporta informazioni sulle cache di dati L1 e LL (cache di ultimo livello, L3 nel tuo caso). Utilizzando il tasso di perdere D1: informazioni si dovrebbe vedere quale versione del vostro algoritmo di moltiplicazione di matrici è "il più cache di efficienza"

La parte finale del summs uscita cachegrind le informazioni su LL (cache ultimo livello, L3 nel tuo caso) per entrambe le istruzioni e i dati. Fornisce quindi il numero di accessi alla memoria e la percentuale di richieste di memoria servite dalla cache.

fonte

2014-01-13 07:57:54

Come interpreti l'output di cachegrind per i problemi di memorizzazione nella cache?

risposta

Problemi correlati