Misurazione dell'incremento lento degli atomici rispetto agli incrementi interi regolari

Una recente discussione mi ha fatto domandare quanto sia costoso un incremento atomico rispetto a un incremento intero normale.Misurazione dell'incremento lento degli atomici rispetto agli incrementi interi regolari

ho scritto il codice per cercare di punto di riferimento in questo modo:

#include <iostream> 
#include <atomic> 
#include <chrono> 

static const int NUM_TEST_RUNS = 100000; 
static const int ARRAY_SIZE = 500; 

void runBenchmark(std::atomic<int>& atomic_count, int* count_array, int array_size, bool do_atomic_increment){  
    for(int i = 0; i < array_size; ++i){ 
     ++count_array[i];   
    } 

    if(do_atomic_increment){ 
     ++atomic_count; 
    } 
} 

int main(int argc, char* argv[]){ 

    int num_test_runs = NUM_TEST_RUNS; 
    int array_size = ARRAY_SIZE; 

    if(argc == 3){ 
     num_test_runs = atoi(argv[1]); 
     array_size = atoi(argv[2]);   
    } 

    if(num_test_runs == 0 || array_size == 0){ 
     std::cout << "Usage: atomic_operation_overhead <num_test_runs> <num_integers_in_array>" << std::endl; 
     return 1; 
    } 

    // Instantiate atomic counter 
    std::atomic<int> atomic_count; 

    // Allocate the integer buffer that will be updated every time 
    int* count_array = new int[array_size]; 

    // Track the time elapsed in case of incrmeenting with mutex locking 
    auto start = std::chrono::steady_clock::now(); 
    for(int i = 0; i < num_test_runs; ++i){ 
     runBenchmark(atomic_count, count_array, array_size, true);   
    } 
    auto end = std::chrono::steady_clock::now(); 

    // Calculate time elapsed for incrementing without mutex locking 
    auto diff_with_lock = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start); 
    std::cout << "Elapsed time with atomic increment for " 
       << num_test_runs << " test runs: " 
       << diff_with_lock.count() << " ns" << std::endl; 

    // Track the time elapsed in case of incrementing without a mutex locking 
    start = std::chrono::steady_clock::now(); 
    for(unsigned int i = 0; i < num_test_runs; ++i){ 
     runBenchmark(atomic_count, count_array, array_size, false); 
    } 
    end = std::chrono::steady_clock::now(); 

    // Calculate time elapsed for incrementing without mutex locking 
    auto diff_without_lock = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start); 
    std::cout << "Elapsed time without atomic increment for " 
       << num_test_runs << " test runs: " 
       << diff_without_lock.count() << " ns" << std::endl; 

    auto difference_running_times = diff_with_lock - diff_without_lock; 
    auto proportion = difference_running_times.count()/(double)diff_without_lock.count();   
    std::cout << "How much slower was locking: " << proportion * 100.0 << " %" << std::endl;   

    // We loop over all entries in the array and print their sum 
    // We do this mainly to prevent the compiler from optimizing out 
    // the loop where we increment all the values in the array 
    int array_sum = 0; 
    for(int i = 0; i < array_size; ++i){ 
     array_sum += count_array[i]; 
    } 
    std::cout << "Array sum (just to prevent loop getting optimized out): " << array_sum << std::endl; 

    delete [] count_array; 

    return 0; 
}

Il problema che sto avendo è che questo programma produce risultati ampiamente divergenti in ogni seduta:

[email protected]:~/Projects/Misc$ ./atomic_operation_overhead 1000 500 
Elapsed time with atomic increment for 1000 test runs: 99852 ns 
Elapsed time without atomic increment for 1000 test runs: 96396 ns 
How much slower was locking: 3.58521 % 
[email protected]:~/Projects/Misc$ ./atomic_operation_overhead 1000 500 
Elapsed time with atomic increment for 1000 test runs: 182769 ns 
Elapsed time without atomic increment for 1000 test runs: 138319 ns 
How much slower was locking: 32.1359 % 
[email protected]:~/Projects/Misc$ ./atomic_operation_overhead 1000 500 
Elapsed time with atomic increment for 1000 test runs: 98858 ns 
Elapsed time without atomic increment for 1000 test runs: 96404 ns 
How much slower was locking: 2.54554 % 
[email protected]:~/Projects/Misc$ ./atomic_operation_overhead 1000 500 
Elapsed time with atomic increment for 1000 test runs: 107848 ns 
Elapsed time without atomic increment for 1000 test runs: 105174 ns 
How much slower was locking: 2.54245 % 
[email protected]:~/Projects/Misc$ ./atomic_operation_overhead 1000 500 
Elapsed time with atomic increment for 1000 test runs: 113865 ns 
Elapsed time without atomic increment for 1000 test runs: 100559 ns 
How much slower was locking: 13.232 % 
[email protected]:~/Projects/Misc$ ./atomic_operation_overhead 1000 500 
Elapsed time with atomic increment for 1000 test runs: 98956 ns 
Elapsed time without atomic increment for 1000 test runs: 106639 ns 
How much slower was locking: -7.20468 %

Questo mi induce a credere che potrebbe esserci un bug nel codice di benchmark stesso. C'è qualche errore che mi manca? Il mio utilizzo di std :: chrono per il benchmarking è errato? O è la differenza di tempo a causa del sovraccarico per la gestione del segnale nel sistema operativo relativo alle operazioni atomiche?

Cosa potrei fare di sbagliato?

banco di prova:

Intel® Core™ i7-4700MQ CPU @ 2.40GHz × 8 
8GB RAM 
GNU/Linux:Ubuntu LTS 14.04 (64 bit) 
GCC version: 4.8.4  
Compilation: g++ -std=c++11 -O3 atomic_operation_overhead.cpp -o atomic_operation_overhead

EDIT: Aggiornato l'uscita di test dopo la compilazione con l'ottimizzazione -O3.

EDIT: Dopo aver eseguito i test per un aumento del numero di iterazioni e l'aggiunta di una somma ciclo per evitare che l'ottimizzazione fuori l'incremento loop come suggerito da Adam, ho più risultati convergenti:

[email protected]:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 
Elapsed time with atomic increment for 99999999 test runs: 7111974931 ns 
Elapsed time without atomic increment for 99999999 test runs: 6938317779 ns 
How much slower was locking: 2.50287 % 
Array sum (just to prevent loop getting optimized out): 1215751192 
[email protected]:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 
Elapsed time with atomic increment for 99999999 test runs: 7424952991 ns 
Elapsed time without atomic increment for 99999999 test runs: 7262721866 ns 
How much slower was locking: 2.23375 % 
Array sum (just to prevent loop getting optimized out): 1215751192 
[email protected]:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 
Elapsed time with atomic increment for 99999999 test runs: 7172114343 ns 
Elapsed time without atomic increment for 99999999 test runs: 7030985219 ns 
How much slower was locking: 2.00725 % 
Array sum (just to prevent loop getting optimized out): 1215751192 
[email protected]:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 
Elapsed time with atomic increment for 99999999 test runs: 7094552104 ns 
Elapsed time without atomic increment for 99999999 test runs: 6971060941 ns 
How much slower was locking: 1.77148 % 
Array sum (just to prevent loop getting optimized out): 1215751192 
[email protected]:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 
Elapsed time with atomic increment for 99999999 test runs: 7099907902 ns 
Elapsed time without atomic increment for 99999999 test runs: 6970289856 ns 
How much slower was locking: 1.85958 % 
Array sum (just to prevent loop getting optimized out): 1215751192 
[email protected]:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 
Elapsed time with atomic increment for 99999999 test runs: 7763604675 ns 
Elapsed time without atomic increment for 99999999 test runs: 7229145316 ns 
How much slower was locking: 7.39312 % 
Array sum (just to prevent loop getting optimized out): 1215751192 
[email protected]:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 
Elapsed time with atomic increment for 99999999 test runs: 7164534212 ns 
Elapsed time without atomic increment for 99999999 test runs: 6994993609 ns 
How much slower was locking: 2.42374 % 
Array sum (just to prevent loop getting optimized out): 1215751192 
[email protected]:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 
Elapsed time with atomic increment for 99999999 test runs: 7154697145 ns 
Elapsed time without atomic increment for 99999999 test runs: 6997030700 ns 
How much slower was locking: 2.25333 % 
Array sum (just to prevent loop getting optimized out): 1215751192

fonte

2015-09-21 balajeerc

Come hai compilato il tuo codice? Con quali flag di ottimizzazione? Quale versione del compilatore? E dovresti ripetere ogni corsa più volte! –

@BasileStarynkevitch Grazie per averlo fatto notare. Ho fatto una modifica aggiornando il post con le informazioni del compilatore. Per quanto riguarda l'esecuzione più volte, si noti che run_benchmark è già stato eseguito più volte (specificato utilizzando un argomento della riga di comando). – balajeerc

Hai dimenticato almeno '-O1' o' -O2' (o anche '-O3') nel comando di compilazione. Il codice di benchmarking compilato senza ottimizzazioni è inutile. –

Alcune riflessioni:

Eseguire più iterazioni, almeno quanto basta per alcuni secondi. Le tue esecuzioni richiedono millisecondi, quindi un interrupt I/O potrebbe essere sufficiente a distorcere i risultati.
Stampa la somma alla fine. Il compilatore potrebbe essere abbastanza intelligente da ottimizzare i tuoi loop molto più di quanto pensi, quindi il tuo codice potrebbe fare meno lavoro di quanto pensi. Se il compilatore vede che il valore non viene mai letto, potrebbe cancellare completamente i tuoi loop.
Effettua l'iterazione tutto in un ciclo, a differenza di un ciclo che chiama una funzione. Mentre il compilatore probabilmente incorpora la tua chiamata di funzione, è meglio non introdurre un'altra potenziale fonte di rumore.
Sono sicuro che lo farai dopo, ma aggiungi un test filettato. Potrebbe anche farlo per entrambi; riceverai una somma errata nella variabile non atomica a causa delle gare, ma almeno vedrai la penalità delle prestazioni che paghi per coerenza.

fonte

2015-09-21 05:47:59 Adam

Aggiunta la stampa della somma della matrice e il numero di iterazioni incrementato in modo che fosse eseguito per alcuni secondi. Ora ottengo risultati più ragionevoli. Ho aggiornato il post per riflettere questi cambiamenti. Grazie mille! – balajeerc

Aggiungerò presto un test multi-thread. – balajeerc

Misurazione dell'incremento lento degli atomici rispetto agli incrementi interi regolari

risposta

Problemi correlati