R: il modo più veloce per estrarre tutte le sottostringhe contenute tra due sottostringhe

Sono alla ricerca di un modo efficace per estrarre tutte le corrispondenze tra due sottostringhe in una stringa di caratteri. Per esempio. dire che voglio per estrarre tutte le stringhe contenute tra stringaR: il modo più veloce per estrarre tutte le sottostringhe contenute tra due sottostringhe

start="strt"

stop="stp" 
in string 
x="strt111stpblablastrt222stp"

vorrei ottenere vettore

"111" "222"

Qual è il modo più efficace per fare questo in R? Usando un'espressione regolare forse? O ci sono modi migliori?

fonte

2014-07-16 Tom Wenseleers

Per qualcosa di semplice come questo, la base R lo gestisce bene.

È possibile accendere PCRE utilizzando perl=T e utilizzare le asserzioni lookaround.

x <- 'strt111stpblablastrt222stp' 
regmatches(x, gregexpr('(?<=strt).*?(?=stp)', x, perl=T))[[1]] 
# [1] "111" "222"

Spiegazione:

(?<=   # look behind to see if there is: 
    strt  # 'strt' 
)    # end of look-behind 
.*?   # any character except \n (0 or more times) 
(?=   # look ahead to see if there is: 
    stp   # 'stp' 
)    # end of look-ahead

EDIT: Aggiornato seguito le risposte secondo la nuova sintassi.

Si può anche considerare l'utilizzo del pacchetto stringi.

library(stringi) 
x <- 'strt111stpblablastrt222stp' 
stri_extract_all_regex(x, '(?<=strt).*?(?=stp)')[[1]] 
# [1] "111" "222"

E rm_between dal pacchetto qdapRegex.

library(qdapRegex) 
x <- 'strt111stpblablastrt222stp' 
rm_between(x, 'strt', 'stp', extract=TRUE)[[1]] 
# [1] "111" "222"

fonte

2014-07-16 06:43:47 hwnd

Molti thx - questo è perfetto e grazie per la spiegazione molto bella! –

@TomWenseleers sei il benvenuto. – hwnd

+1, per il completamento dirò che 'strt \ K' potrebbe sostituire il' (? <= Strt) '(niente di sbagliato con esso, solo un'altra opzione) – zx81

Dal momento che ci possono essere diverse start/stop stringhe per ogni ingresso, penso che una regex sarà la soluzione più efficiente:

(?<=strt)(?:(?!stp).)*

corrisponderà tutto dopo strt fino alla fine della stringa o stp, che si verifica primo. Se vuoi affermare che c'è sempre un stp, aggiungi (?=stp) alla fine della regex. Puoi persino applicare questa espressione regolare a un vettore.

regmatches(subject, gregexpr("(?<=strt)(?:(?!stp).)*", subject, perl=TRUE));

fonte

2014-07-16 06:43:34

Si può anche prendere in considerazione:

library(qdap) 
unname(genXtract(x, "strt", "stp")) 
#[1] "111" "222"

confronto Velocità

x1 <- rep(x,1e5) 
system.time(res1 <- regmatches(x1,gregexpr('(?<=strt).*?(?=stp)',x1,perl=T))) 
# user system elapsed 
# 2.187 0.000 2.015 

system.time(res2 <- regmatches(x1, gregexpr("(?<=strt)(?:(?!stp).)*", x1, perl=TRUE))) 
#user system elapsed 
# 1.902 0.000 1.780 

system.time(res3 <- str_extract_all(x1, perl('(?<=strt).*?(?=stp)'))) 
# user system elapsed 
# 6.990 0.000 6.636 

system.time(res4 <- genXtract(x1, "strt", "stp")) ##setNames(genXtract(...), NULL) is a bit slower 
# user system elapsed 
# 1.457 0.000 1.414 

names(res4) <- NULL 
identical(res1,res4) 
#[1] TRUE

fonte

2014-07-16 11:42:56 akrun

Thx per l'opzione aggiuntiva leggermente più veloce - che bello !! –

Se si sta parlando di velocità nelle stringhe di ricerca non v'è un solo pacchetto per fare questo - stringi

x <- "strt111stpblablastrt222stp" 
hwnd <- function(x1) regmatches(x1,gregexpr('(?<=strt).*?(?=stp)',x1,perl=T)) 
Tim <- function(x1) regmatches(x1, gregexpr("(?<=strt)(?:(?!stp).)*", x1, perl=TRUE)) 
stringr <- function(x1) str_extract_all(x1, perl('(?<=strt).*?(?=stp)')) 
akrun <- function(x1) genXtract(x1, "strt", "stp") 
stringi <- function(x1) stri_extract_all_regex(x1, perl('(?<=strt).*?(?=stp)')) 

require(microbenchmark) 
microbenchmark(stringi(x), hwnd(x), Tim(x), stringr(x)) 
Unit: microseconds 
     expr  min  lq median  uq  max neval 
stringi(x) 46.778 58.1030 64.017 67.3485 123.398 100 
    hwnd(x) 61.498 73.1095 79.084 85.5190 111.757 100 
    Tim(x) 60.243 74.6830 80.755 86.3370 102.678 100 
stringr(x) 236.081 261.9425 272.115 279.6750 440.036 100

Purtroppo non ho potuto testare la soluzione @akrun perché il pacchetto qdap ha alcuni errori durante l'installazione. E solo la sua soluzione sembra quella in grado di battere stringi ...

fonte

2014-08-23 23:05:54 bartektartanus

Mi aspetterei che 'genXtract' sia molto più lento (fino a 10-20 volte più lento). È costruito per flessibilità e facilità d'uso. In molti casi il tempo di un ricercatore è più prezioso del tempo di calcolo. Se questo è il caso, "genXtract" è una scelta eccellente. Se sei alla ricerca della velocità, io, come te, sono un grande fan di 'stringi'. –

Sono più che un fan di 'stringi' - Sono un autore :) – bartektartanus

R: il modo più veloce per estrarre tutte le sottostringhe contenute tra due sottostringhe

risposta

Problemi correlati