Haskell/GHC per thread costa

Sto cercando di capire quanto sia costoso un thread (verde) in Haskell (GHC 7.10.1 su OS X 10.10.5). Sono consapevole del fatto che è molto economico rispetto a un thread OS reale, sia per l'utilizzo della memoria che per la CPU.Haskell/GHC per thread costa

destro, così ho iniziato a scrivere un programma super semplice con forche n (verde) discussioni (utilizzando l'eccellente libreria async) e quindi può ospitare solo ogni thread per m secondi.

Beh, che è abbastanza facile:

$ cat PerTheadMem.hs 
import Control.Concurrent (threadDelay) 
import Control.Concurrent.Async (mapConcurrently) 
import System.Environment (getArgs) 

main = do 
    args <- getArgs 
    let (numThreads, sleep) = case args of 
           numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int) 
           _ -> error "wrong args" 
    mapConcurrently (\_ -> threadDelay (sleep*1000*1000)) [1..numThreads]

e prima di tutto, cerchiamo di compilare ed eseguire esso:

$ ghc --version 
The Glorious Glasgow Haskell Compilation System, version 7.10.1 
$ ghc -rtsopts -O3 -prof -auto-all -caf-all PerTheadMem.hs 
$ time ./PerTheadMem 100000 10 +RTS -sstderr

che dovrebbe sborsare 100k discussioni e attendere 10s in ogni e poi ci stampare alcuni informazioni:

$ time ./PerTheadMem 100000 10 +RTS -sstderr 
340,942,368 bytes allocated in the heap 
880,767,000 bytes copied during GC 
164,702,328 bytes maximum residency (11 sample(s)) 
21,736,080 bytes maximum slop 
350 MB total memory in use (0 MB lost due to fragmentation) 

Tot time (elapsed) Avg pause Max pause 
Gen 0  648 colls,  0 par 0.373s 0.415s  0.0006s 0.0223s 
Gen 1  11 colls,  0 par 0.298s 0.431s  0.0392s 0.1535s 

INIT time 0.000s ( 0.000s elapsed) 
MUT  time 79.062s (92.803s elapsed) 
GC  time 0.670s ( 0.846s elapsed) 
RP  time 0.000s ( 0.000s elapsed) 
PROF time 0.000s ( 0.000s elapsed) 
EXIT time 0.065s ( 0.091s elapsed) 
Total time 79.798s (93.740s elapsed) 

%GC  time  0.8% (0.9% elapsed) 

Alloc rate 4,312,344 bytes per MUT second 

Productivity 99.2% of total user, 84.4% of total elapsed 


real 1m33.757s 
user 1m19.799s 
sys 0m2.260s

Ci sono voluti abbastanza lunghi (1m33.757s) dato che ogni thread dovrebbe solo aspetta solo 10 secondi, ma lo abbiamo costruito senza fili per cui ora è abbastanza giusto. Tutto sommato, abbiamo usato 350 MB, non male, quello è 3,5 KB per thread. Dato che la dimensione dello stack iniziale (-ki is 1 KB).

destro, ma ora cerchiamo di compilare è in modalità filettato e vedere se siamo in grado di ottenere più velocemente:

$ ghc -rtsopts -O3 -prof -auto-all -caf-all -threaded PerTheadMem.hs 
$ time ./PerTheadMem 100000 10 +RTS -sstderr 
3,996,165,664 bytes allocated in the heap 
2,294,502,968 bytes copied during GC 
3,443,038,400 bytes maximum residency (20 sample(s)) 
14,842,600 bytes maximum slop 
3657 MB total memory in use (0 MB lost due to fragmentation) 

Tot time (elapsed) Avg pause Max pause 
Gen 0  6435 colls,  0 par 0.860s 1.022s  0.0002s 0.0028s 
Gen 1  20 colls,  0 par 2.206s 2.740s  0.1370s 0.3874s 

TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) 

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) 

INIT time 0.000s ( 0.001s elapsed) 
MUT  time 0.879s ( 8.534s elapsed) 
GC  time 3.066s ( 3.762s elapsed) 
RP  time 0.000s ( 0.000s elapsed) 
PROF time 0.000s ( 0.000s elapsed) 
EXIT time 0.074s ( 0.247s elapsed) 
Total time 4.021s (12.545s elapsed) 

Alloc rate 4,544,893,364 bytes per MUT second 

Productivity 23.7% of total user, 7.6% of total elapsed 

gc_alloc_block_sync: 0 
whitehole_spin: 0 
gen[0].sync: 0 
gen[1].sync: 0 

real 0m12.565s 
user 0m4.021s 
sys 0m1.154s

Wow, molto più veloce, a soli 12s ora, modo migliore. Da Activity Monitor ho visto che utilizzava approssimativamente 4 thread del sistema operativo per i thread 100k verdi, il che ha senso.

Tuttavia, 3657 MB di memoria totale! Questo è 10 volte superiore rispetto alla versione senza thread utilizzata ...

Fino ad ora, non ho effettuato alcuna profilazione utilizzando -prof o -hy o così. Per indagare un po 'di più ho poi fatto un po' di profilatura dell'heap (-hy) in separate esecuzioni. L'utilizzo della memoria non è cambiato in entrambi i casi, i grafici di profilatura dell'heap hanno un aspetto interessante (a sinistra: non a thread, a destra: a thread) ma non riesco a trovare il motivo della differenza 10x.

Diffusione dell'output di profilo (file .prof) Non riesco a trovare alcuna differenza reale.

Pertanto, la mia domanda: da dove viene la differenza di 10 volte nell'utilizzo della memoria?

EDIT: Solo per dirlo: la stessa differenza si applica quando il programma non è nemmeno compilato con il supporto di profilazione. Quindi l'esecuzione di time ./PerTheadMem 100000 10 +RTS -sstderr con ghc -rtsopts -threaded -fforce-recomp PerTheadMem.hs è 3559 MB. E con ghc -rtsopts -fforce-recomp PerTheadMem.hs è 395 MB.

EDIT 2: Su Linux (GHC 7.10.2 sul Linux 3.13.0-32-generiC#57-Ubuntu SMP, x86_64) lo stesso accade: non filettato 460 MB in 1m28.538s e filettato è 3483 MB è 12.604s. /usr/bin/time -v ... report Maximum resident set size (kbytes): 413684 e Maximum resident set size (kbytes): 1645384 rispettivamente.

EDIT 3: cambiato anche il programma da utilizzare forkIO direttamente:

import Control.Concurrent (threadDelay, forkIO) 
import Control.Concurrent.MVar 
import Control.Monad (mapM_) 
import System.Environment (getArgs) 

main = do 
    args <- getArgs 
    let (numThreads, sleep) = case args of 
           numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int) 
           _ -> error "wrong args" 
    mvar <- newEmptyMVar 
    mapM_ (\_ -> forkIO $ threadDelay (sleep*1000*1000) >> putMVar mvar()) 
      [1..numThreads] 
    mapM_ (\_ -> takeMVar mvar) [1..numThreads]

E non cambia nulla: non filettati: 152 MB, filettato: 3308 MB.

fonte

2015-10-15 Johannes Weiß

Mi chiedo quanti dettagli di overhead aggiungono. Sotto Linux, puoi convincere 'time' a generare statistiche sulla memoria. Cosa succede se si compila senza profiling e si chiede al sistema operativo le statistiche di memoria? – MathematicalOrchid

@MathematicalOrchid Ho eseguito quattro run in totale, 2 senza profiling (1 threaded/1 non-threaded), 2 con profiling. L'output '-sstderr' non è cambiato. Le immagini provengono dalle ultime due corse. Inoltre ho controllato l'utilizzo di mem in Activity Monitor e non sono riuscito a vedere un grande diff tra w/e w/o profiling. –

OK, vale la pena provare. Sono fuori di idee ora. : -} – MathematicalOrchid

IMHO, il colpevole è threadDelay. * threadDelay ** utilizza molta memoria. Ecco un programma equivalente al tuo che si comporta meglio con la memoria. Assicura che tutti i thread siano eseguiti contemporaneamente avendo un calcolo a lunga esecuzione.

uBound = 38 
lBound = 34 

doSomething :: Integer -> Integer 
doSomething 0 = 1 
doSomething 1 = 1 
doSomething n | n < uBound && n > 0 = let 
        a = doSomething (n-1) 
        b = doSomething (n-2) 
       in a `seq` b `seq` (a + b) 
       | otherwise = doSomething (n `mod` uBound) 

e :: Chan Integer -> Int -> IO() 
e mvar i = 
    do 
     let y = doSomething . fromIntegral $ lBound + (fromIntegral i `mod` (uBound - lBound)) 
     y `seq` writeChan mvar y 

main = 
    do 
     args <- getArgs 
     let (numThreads, sleep) = case args of 
            numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int) 
            _ -> error "wrong args" 
      dld = (sleep*1000*1000) 
     chan <- newChan 
     mapM_ (\i -> forkIO $ e chan i) [1..numThreads] 
     putStrLn "All threads created" 
     mapM_ (\_ -> readChan chan >>= putStrLn . show) [1..numThreads] 
     putStrLn "All read"

e qui ci sono le statistiche di temporizzazione:

$ ghc -rtsopts -O -threaded test.hs 
$ ./test 200 10 +RTS -sstderr -N4 

133,541,985,480 bytes allocated in the heap 
    176,531,576 bytes copied during GC 
     356,384 bytes maximum residency (16 sample(s)) 
      94,256 bytes maximum slop 
       4 MB total memory in use (0 MB lost due to fragmentation) 

            Tot time (elapsed) Avg pause Max pause 
    Gen 0  64246 colls, 64246 par 1.185s 0.901s  0.0000s 0.0274s 
    Gen 1  16 colls, 15 par 0.004s 0.002s  0.0001s 0.0002s 

    Parallel GC work balance: 65.96% (serial 0%, perfect 100%) 

    TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4) 

    SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) 

    INIT time 0.000s ( 0.003s elapsed) 
    MUT  time 63.747s (16.333s elapsed) 
    GC  time 1.189s ( 0.903s elapsed) 
    EXIT time 0.001s ( 0.000s elapsed) 
    Total time 64.938s (17.239s elapsed) 

    Alloc rate 2,094,861,384 bytes per MUT second 

    Productivity 98.2% of total user, 369.8% of total elapsed 

gc_alloc_block_sync: 98548 
whitehole_spin: 0 
gen[0].sync: 0 
gen[1].sync: 2

residenza massima è di circa 1,5 kb per thread. Ho giocato un po 'con il numero di thread e la durata del calcolo. Poiché i thread iniziano a fare cose subito dopo forkIO, la creazione di 100000 thread richiede davvero molto tempo. Ma i risultati hanno tenuto per 1000 thread.

Ecco un altro programma in cui è stato threadDelay "fattorizzato", questo non fa uso di qualsiasi CPU e può essere eseguito facilmente con 100000 discussioni:

e :: MVar() -> MVar() -> IO() 
e start end = 
    do 
     takeMVar start 
     putMVar end() 

main = 
    do 
     args <- getArgs 
     let (numThreads, sleep) = case args of 
            numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int) 
            _ -> error "wrong args" 
     starts <- mapM (const newEmptyMVar) [1..numThreads] 
     ends <- mapM (const newEmptyMVar) [1..numThreads] 
     mapM_ (\ (start,end) -> forkIO $ e start end) (zip starts ends) 
     mapM_ (\ start -> putMVar start()) starts 
     putStrLn "All threads created" 
     threadDelay (sleep * 1000 * 1000) 
     mapM_ (\ end -> takeMVar end) ends 
     putStrLn "All done"

E i risultati:

 129,270,632 bytes allocated in the heap 
    404,154,872 bytes copied during GC 
     77,844,160 bytes maximum residency (10 sample(s)) 
     10,929,688 bytes maximum slop 
      165 MB total memory in use (0 MB lost due to fragmentation) 

            Tot time (elapsed) Avg pause Max pause 
    Gen 0  128 colls, 128 par 0.178s 0.079s  0.0006s 0.0152s 
    Gen 1  10 colls,  9 par 0.367s 0.137s  0.0137s 0.0325s 

    Parallel GC work balance: 50.09% (serial 0%, perfect 100%) 

    TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4) 

    SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) 

    INIT time 0.000s ( 0.001s elapsed) 
    MUT  time 0.189s (10.094s elapsed) 
    GC  time 0.545s ( 0.217s elapsed) 
    EXIT time 0.001s ( 0.002s elapsed) 
    Total time 0.735s (10.313s elapsed) 

    Alloc rate 685,509,460 bytes per MUT second 

    Productivity 25.9% of total user, 1.8% of total elapsed

Sul mio i5, ci vuole meno di un secondo per creare i 100000 thread e mettere il mvar "start". La residenza di picco è di circa 778 byte per filo, non male affatto!

verificare la realizzazione di threadDelay, vediamo che esso è effettivamente diverso per il caso filettato e svitato:

https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Conc.IO.html#threadDelay

Poi qui: https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Event.TimerManager.html

che sembra abbastanza innocente. Ma una vecchia versione di base ha un'ortografia arcano (memoria) castigo per coloro che invocano threadDelay:

https://hackage.haskell.org/package/base-4.4.0.0/docs/src/GHC-Event-Manager.html#line-121

Se c'è ancora un problema o no, è difficile da dire. Tuttavia, si può sempre sperare che un programma concorrente di "vita reale" non abbia bisogno di avere troppi thread in attesa su threadDelay allo stesso tempo. D'ora in poi terremo d'occhio il mio utilizzo di threadDelay.

fonte

2015-10-16 05:46:53 dsign

WOW! Posso confermare, ho appena cambiato il mio programma per usare 'MVar's e le nuove figure sono: 221 MB senza thread e 282 MB threaded. Mai pensato mai che 'threadDelay' potrebbe essere un problema. Grazie mille. –

Haskell/GHC per thread costa

risposta

Problemi correlati