Java Lucene NGramTokenizer

Sto provando a convertire le stringhe in ngram. Stranamente nella documentazione per il NGramTokenizer non vedo un metodo che restituirà i singoli ngram che sono stati tokenizzati. In effetti vedo solo due metodi nella classe NGramTokenizer che restituiscono oggetti stringa.Java Lucene NGramTokenizer

Ecco il codice che ho:

Reader reader = new StringReader("This is a test string"); 
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);

Dove sono i ngrams che sono stati token?
Come posso ottenere l'output in Stringhe/Parole?

Voglio che il mio output sia come: Questo è, a, test, stringa, Questo è, è un test, una stringa di prova, Questo è un, è un test, una stringa di test.

fonte

2012-11-17 CodeKingPlusPlus

Non penso che troverai quello che stai cercando cercando i metodi per restituire String. Avrai bisogno di trattare con Attribute s.

dovrebbe funzionare qualcosa come:

Reader reader = new StringReader("This is a test string"); 
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3); 
CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class); 
gramTokenizer.reset(); 

while (gramTokenizer.incrementToken()) { 
    String token = charTermAttribute.toString(); 
    //Do something 
} 
gramTokenizer.end(); 
gramTokenizer.close();

azzerare() il Tokenizer se ha bisogno di essere riutilizzato dopo che, però.

creazione di token raggruppamento di parole, piuttosto che caratteri, per i commenti:

Reader reader = new StringReader("This is a test string"); 
TokenStream tokenizer = new StandardTokenizer(Version.LUCENE_36, reader); 
tokenizer = new ShingleFilter(tokenizer, 1, 3); 
CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class); 

while (tokenizer.incrementToken()) { 
    String token = charTermAttribute.toString(); 
    //Do something 
}

fonte

2012-11-20 23:06:32 femtoRgon

Cosa posso fare con le stringhe, invece di caratteri in termini di attributi? Quindi il mio risultato sarebbe qualcosa del tipo: Questo è, a, test, stringa, Questo è, è un test, ... una stringa di test. – CodeKingPlusPlus

Ok, non è questo che NGramTokenizer di Lucene è progettato per gestire. Quello che vorresti usare, credo, è uno ShingleFilter combinato con StandardTokenizer. Aggiornerò la mia risposta, più facile da esprimere lì ... – femtoRgon

Conoscete eventuali filtri di parole d'arresto che posso utilizzare nel processo di tokenizzazione? – CodeKingPlusPlus

Senza creare un programma di test, direi che incrementToken() restituisce il token successivo che sarà uno degli ngram.

Ad esempio, utilizzando Ngram lunghezze 1-3 con la stringa 'a b c d', NGramTokenizer potrebbero ritorno:

a 
a b 
a b c 
b 
b c 
b c d 
c 
c d 
d

cui 'a', 'a b', ecc sono i ngrams risultanti.

[Edit]

Si potrebbe anche voler guardare Querying lucene tokens without indexing, come si parla di sbirciare nel flusso token.

fonte

2012-11-20 22:33:04

Il problema è incrementToken() restituisce un valore booleano ... – CodeKingPlusPlus

Per la versione recente di Lucene (4.2.1), si tratta di un codice pulito che funziona. Prima di eseguire questo codice, è necessario importare 2 file jar:

Lucene-core-4.2.1.jar
Lucene-analuzers-comune-4.2.1.jar

Trova questi file a http://www.apache.org/dyn/closer.cgi/lucene/java/4.2.1

//LUCENE 4.2.1 
Reader reader = new StringReader("This is a test string");  
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3); 

CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class); 

while (gramTokenizer.incrementToken()) { 
    String token = charTermAttribute.toString(); 
    System.out.println(token); 
}

fonte

2013-04-25 05:12:48 Amir

pacchetto ngramalgoimpl; import java.util.*;

public class NGR {

public static List<String> n_grams(int n, String str) { 
    List<String> n_grams = new ArrayList<String>(); 
    String[] words = str.split(" "); 
    for (int i = 0; i < words.length - n + 1; i++) 
     n_grams.add(concatination(words, i, i+n)); 
    return n_grams; 
} 
/*stringBuilder is used to cancatinate mutable sequence of characters*/ 
public static String concatination(String[] words, int start, int end) { 
    StringBuilder sb = new StringBuilder(); 
    for (int i = start; i < end; i++) 
     sb.append((i > start ? " " : "") + words[i]); 
    return sb.toString(); 
} 

public static void main(String[] args) { 
    for (int n = 1; n <= 3; n++) { 
     for (String ngram : n_grams(n, "This is my car.")) 
      System.out.println(ngram); 
     System.out.println(); 
    } 
}

}

fonte

2017-09-19 12:29:26

si prega di fornire contesto, cosa fa questo codice e come fornisce una risposta sulla domanda? –

@KevinKloet visualizza la domanda e fornisce una risposta –

Java Lucene NGramTokenizer

risposta

Problemi correlati