2011-06-15 13 views
9

Devo rilevare automaticamente i cali in un grafico 2D, come le regioni contrassegnate da cerchi rossi nella figura sottostante. Mi interessano solo i tuffi "principali", nel senso che i tuffi devono estendersi su una lunghezza minima nell'asse x. Il numero di dips è sconosciuto, cioè diversi grafici conterranno diversi numeri di dip. Qualche idea?Rilevamento di cali in un grafico 2D

Dips in a 2D plot

Aggiornamento:

Come richiesto, ecco i dati di esempio, insieme ad un tentativo di lisciare utilizzando il filtro mediano, come suggerito da vigneti.

Sembra che ora mi serva un modo efficace per approssimare la derivata in ciascun punto che ignorerebbe i piccoli segni di virata che rimangono nei dati. C'è un approccio standard?

y <- c(0.9943,0.9917,0.9879,0.9831,0.9553,0.9316,0.9208,0.9119,0.8857,0.7951,0.7605,0.8074,0.7342,0.6374,0.6035,0.5331,0.4781,0.4825,0.4825,0.4879,0.5374,0.4600,0.3668,0.3456,0.4282,0.3578,0.3630,0.3399,0.3578,0.4116,0.3762,0.3668,0.4420,0.4749,0.4556,0.4458,0.5084,0.5043,0.5043,0.5331,0.4781,0.5623,0.6604,0.5900,0.5084,0.5802,0.5802,0.6174,0.6124,0.6374,0.6827,0.6906,0.7034,0.7418,0.7817,0.8311,0.8001,0.7912,0.7912,0.7540,0.7951,0.7817,0.7644,0.7912,0.8311,0.8311,0.7912,0.7688,0.7418,0.7232,0.7147,0.6906,0.6715,0.6681,0.6374,0.6516,0.6650,0.6604,0.6124,0.6334,0.6374,0.5514,0.5514,0.5412,0.5514,0.5374,0.5473,0.4825,0.5084,0.5126,0.5229,0.5126,0.5043,0.4379,0.4781,0.4600,0.4781,0.3806,0.4078,0.3096,0.3263,0.3399,0.3184,0.2820,0.2167,0.2122,0.2080,0.2558,0.2255,0.1921,0.1766,0.1732,0.1205,0.1732,0.0723,0.0701,0.0405,0.0643,0.0771,0.1018,0.0587,0.0884,0.0884,0.1240,0.1088,0.0554,0.0607,0.0441,0.0387,0.0490,0.0478,0.0231,0.0414,0.0297,0.0701,0.0502,0.0567,0.0405,0.0363,0.0464,0.0701,0.0832,0.0991,0.1322,0.1998,0.3146,0.3146,0.3184,0.3578,0.3311,0.3184,0.4203,0.3578,0.3578,0.3578,0.4282,0.5084,0.5802,0.5667,0.5473,0.5514,0.5331,0.4749,0.4037,0.4116,0.4203,0.3184,0.4037,0.4037,0.4282,0.4513,0.4749,0.4116,0.4825,0.4918,0.4879,0.4918,0.4825,0.4245,0.4333,0.4651,0.4879,0.5412,0.5802,0.5126,0.4458,0.5374,0.4600,0.4600,0.4600,0.4600,0.3992,0.4879,0.4282,0.4333,0.3668,0.3005,0.3096,0.3847,0.3939,0.3630,0.3359,0.2292,0.2292,0.2748,0.3399,0.2963,0.2963,0.2385,0.2531,0.1805,0.2531,0.2786,0.3456,0.3399,0.3491,0.4037,0.3885,0.3806,0.2748,0.2700,0.2657,0.2963,0.2865,0.2167,0.2080,0.1844,0.2041,0.1602,0.1416,0.2041,0.1958,0.1018,0.0744,0.0677,0.0909,0.0789,0.0723,0.0660,0.1322,0.1532,0.1060,0.1018,0.1060,0.1150,0.0789,0.1266,0.0965,0.1732,0.1766,0.1766,0.1805,0.2820,0.3096,0.2602,0.2080,0.2333,0.2385,0.2385,0.2432,0.1602,0.2122,0.2385,0.2333,0.2558,0.2432,0.2292,0.2209,0.2483,0.2531,0.2432,0.2432,0.2432,0.2432,0.3053,0.3630,0.3578,0.3630,0.3668,0.3263,0.3992,0.4037,0.4556,0.4703,0.5173,0.6219,0.6412,0.7275,0.6984,0.6756,0.7079,0.7192,0.7342,0.7458,0.7501,0.7540,0.7605,0.7605,0.7342,0.7912,0.7951,0.8036,0.8074,0.8074,0.8118,0.7951,0.8118,0.8242,0.8488,0.8650,0.8488,0.8311,0.8424,0.7912,0.7951,0.8001,0.8001,0.7458,0.7192,0.6984,0.6412,0.6516,0.5900,0.5802,0.5802,0.5762,0.5623,0.5374,0.4556,0.4556,0.4333,0.3762,0.3456,0.4037,0.3311,0.3263,0.3311,0.3717,0.3762,0.3717,0.3668,0.3491,0.4203,0.4037,0.4149,0.4037,0.3992,0.4078,0.4651,0.4967,0.5229,0.5802,0.5802,0.5846,0.6293,0.6412,0.6374,0.6604,0.7317,0.7034,0.7573,0.7573,0.7573,0.7772,0.7605,0.8036,0.7951,0.7817,0.7869,0.7724,0.7869,0.7869,0.7951,0.7644,0.7912,0.7275,0.7342,0.7275,0.6984,0.7342,0.7605,0.7418,0.7418,0.7275,0.7573,0.7724,0.8118,0.8521,0.8823,0.8984,0.9119,0.9316,0.9512) 

yy <- runmed(y, 41) 
plot(y, type="l", ylim=c(0,1), ylab="", xlab="", lwd=0.5) 
points(yy, col="blue", type="l", lwd=2) 

Median filtering

+0

Immagino che si potrebbe smussare i dati un po 'e utilizzare questo: http://stackoverflow.com/questions/6324354/ aggiungere-una-curva-che-adatta-i-picchi-da-una-trama-in-r/ –

+2

dati di esempio sarebbe stato bello ... –

+1

@Joris Ho aggiunto i dati utilizzati per generare la trama. Grazie per la segnalazione. – Leo

risposta

6

MODIFICA: la funzione impedisce alle regioni di contenere solo la parte più bassa, se lo si desidera.

In realtà, l'utilizzo della media è più semplice rispetto all'utilizzo della mediana. Questo ti permette di trovare regioni in cui i valori reali sono costantemente al di sotto della media. La mediana non è abbastanza liscia per un'applicazione facile.

Un esempio funzione per fare questo sarebbe:

FindLowRegion <- function(x,n=length(x)/4,tol=length(x)/20,p=0.5){ 
    nx <- length(x) 
    n <- 2*(n %/% 2) + 1 
    # smooth out based on means 
    sx <- rowMeans(embed(c(rep(NA,n/2),x,rep(NA,n/2)),n),na.rm=T) 
    # find which series are far from the mean 
    rlesx <- rle((sx-x)>0) 
    # construct start and end of regions 
    int <- embed(cumsum(c(1,rlesx$lengths)),2) 
    # which regions fulfill requirements 
    id <- rlesx$value & rlesx$length > tol 
    # Cut regions to be in general smaller than median 
    regions <- 
    apply(int[id,],1,function(i){ 
     i <- min(i):max(i) 
     tmp <- x[i] 
     id <- which(tmp < quantile(tmp,p)) 
     id <- min(id):max(id) 
     i[id]    
    }) 
    # return 
    unlist(regions) 
} 

dove

  • n determina quanta valori sono utilizzati per calcolare il funzionamento media,
  • tol determina quanti valori consecutivi voglia essere inferiore alla media corrente per parlare di una regione bassa e
  • p determina il taglio utilizzato (come un quantile) per rimuovere le regioni dalla loro parte più bassa. Quando p = 1, viene mostrata la regione inferiore completa.

La funzione è ottimizzata per lavorare sui dati come presentati, ma potrebbe essere necessario modificare leggermente i numeri per lavorare con altri dati.

Questa funzione restituisce un set di indici, che consente di trovare le regioni basse.Illustrato con il vettore y:

Lows <- FindLowRegion(y) 

newx <- seq_along(y) 
newy <- ifelse(newx %in% Lows,y,NA) 
plot(y, col="blue", type="l", lwd=2) 
lines(newx,newy,col="red",lwd="3") 

Dà:

enter image description here

+0

Validità statistica? – hadley

+1

@Hadley: circa loess() e gli amici. –

+1

Nella tua istruzione 'newy <- ifelse (x% in% Lows, y, NA)', da dove viene 'x'? Non dovrebbe essere 'newx'? –

3

si deve lisciare il grafico in qualche modo. Median filtration è abbastanza utile per quello scopo (vedere http://en.wikipedia.org/wiki/Median_filter). Dopo aver livellato, dovrai semplicemente cercare i minimi, come al solito (cioè cercare i punti in cui la derivata 1 passa da negativa a positiva).

+0

Grazie per il suggerimento. Ho aggiornato la domanda con il filtraggio mediano. C'è ancora del rumore, quindi rimane un problema: approssimazione robusta della 1a derivata. – Leo

+0

@Leo: Non conosco praticamente nulla di R ... Ma parlando algoritmicamente, la prima cosa che proverei è una finestra scorrevole: quando tutti i punti nella finestra, ad eccezione di quelli più a sinistra e più a destra, sono al di sotto del più basso di sinistra e più a destra, quindi viene trovato un dip e la finestra viene spostata dalla sua larghezza in una volta, altrimenti la finestra viene spostata di un singolo passo ... – vines

+1

Il livellamento medio è molto più utile, come mostrato nella mia risposta. –

0

Il mio primo pensiero è stato qualcosa di molto più rozzo del filtraggio. Perché non cercare le grandi gocce seguite da periodi stabili abbastanza lunghi?

span.b <- 20 
threshold.b <- 0.2 
dy.b <- c(rep(NA, span.b), diff(y, lag = span.b)) 
span.f <- 10 
threshold.f <- 0.05 
dy.f <- c(diff(y, lag = span.f), rep(NA, span.f)) 
down <- which(dy.b < -1 * threshold.b & abs(dy.f) < threshold.f) 
abline(v = down) 

Il grafico mostra che non è perfetto, ma non scarta i valori erratici (credo che dipende dalla vostra opinione sui dati).

1

Una risposta più semplice (che inoltre non richiede smoothing) potrebbe essere fornito adattando la funzione maxdrawdown() dal tseries. Un prelievo è comunemente definito come il ritiro dal massimo più recente; qui vogliamo il contrario. Tale funzione potrebbe quindi essere utilizzata in una finestra scorrevole sui dati o su dati segmentati.

maxdrawdown <- function(x) { 
    if(NCOL(x) > 1) 
     stop("x is not a vector or univariate time series") 
    if(any(is.na(x))) 
     stop("NAs in x") 
    cmaxx <- cummax(x)-x 
    mdd <- max(cmaxx) 
    to <- which(mdd == cmaxx) 
    from <- double(NROW(to)) 
    for (i in 1:NROW(to)) 
     from[i] <- max(which(cmaxx[1:to[i]] == 0)) 
    return(list(maxdrawdown = mdd, from = from, to = to)) 
} 

Così, invece di usare cummax(), si dovrebbe passare a cummin() ecc

+0

Mi piacciono le risposte semplici. – Jubbles

Problemi correlati