Come generare n-grammi in scala?

Sto provando a codificare algoritmo di stampa dissociata basato su n-grammo in scala. Come generare un n-gram per un file di grandi dimensioni: Ad esempio, per il file contenente "l'ape è l'ape delle api".Come generare n-grammi in scala?

Prima deve scegliere un n-grammo casuale. Ad esempio, l'ape.
Quindi deve cercare n-grammi iniziando con (n-1) parole. Ad esempio, l'ape di.
stampa l'ultima parola di questo n-grammo. Quindi ripete.

Potete per favore darmi qualche suggerimento su come farlo? Ci scusiamo per l'inconveniente.

fonte

2011-11-24 user1002579

Non so quello che un n-grammo è. Stai semplicemente scegliendo le parole in modo casuale? O ha qualche logica? – santiagobasulto

@santiagobasulto Wikipedia è tuo amico: http://en.wikipedia.org/wiki/N-gram –

È questo per qualsiasi caso relativo a http://stackoverflow.com/questions/8256830/how-to-make-string -sequence-in-scala? –

Le vostre domande potrebbero essere un po 'più specifiche, ma ecco il mio tentativo.

val words = "the bee is the bee of the bees" 
words.split(' ').sliding(2).foreach(p => println(p.mkString))

fonte

2011-11-24 15:08:46 peri4n

Non che questo ti dia solo 2 grammi. Se si desiderano n-grammi, allora n deve essere parametrizzato. – tuxdna

Si può provare questo con un parametro di n

val words = "the bee is the bee of the bees" 
val w = words.split(" ") 

val n = 4 
val ngrams = (for(i <- 1 to n) yield w.sliding(i).map(p => p.toList)).flatMap(x => x) 
ngrams foreach println 

List(the) 
List(bee) 
List(is) 
List(the) 
List(bee) 
List(of) 
List(the) 
List(bees) 
List(the, bee) 
List(bee, is) 
List(is, the) 
List(the, bee) 
List(bee, of) 
List(of, the) 
List(the, bees) 
List(the, bee, is) 
List(bee, is, the) 
List(is, the, bee) 
List(the, bee, of) 
List(bee, of, the) 
List(of, the, bees) 
List(the, bee, is, the) 
List(bee, is, the, bee) 
List(is, the, bee, of) 
List(the, bee, of, the) 
List(bee, of, the, bees)

fonte

2013-05-24 09:58:58 tuxdna

Ecco un approccio basato sul flusso. Ciò non richiede troppa memoria durante il calcolo di n-grammi.

object ngramstream extends App { 

    def process(st: Stream[Array[String]])(f: Array[String] => Unit): Stream[Array[String]] = st match { 
    case x #:: xs => { 
     f(x) 
     process(xs)(f) 
    } 
    case _ => Stream[Array[String]]() 
    } 

    def ngrams(n: Int, words: Array[String]) = { 
    // exclude 1-grams 
    (2 to n).map { i => words.sliding(i).toStream } 
     .foldLeft(Stream[Array[String]]()) { 
     (a, b) => a #::: b 
     } 
    } 

    val words = "the bee is the bee of the bees" 
    val n = 4 
    val ngrams2 = ngrams(n, words.split(" ")) 

    process(ngrams2) { x => 
    println(x.toList) 
    } 

}

USCITA:

List(the, bee) 
List(bee, is) 
List(is, the) 
List(the, bee) 
List(bee, of) 
List(of, the) 
List(the, bees) 
List(the, bee, is) 
List(bee, is, the) 
List(is, the, bee) 
List(the, bee, of) 
List(bee, of, the) 
List(of, the, bees) 
List(the, bee, is, the) 
List(bee, is, the, bee) 
List(is, the, bee, of) 
List(the, bee, of, the) 
List(bee, of, the, bees)

fonte

2013-12-17 12:48:58 tuxdna

Mi piace, non sono sicuro dell'utilità di 'processo'. Perché non fare solo 'ngrams (...). Foreach (x => println (x.toList))'? – Mortimer

@Mortimer: domanda interessante. 'process' è solo una funzione aggiuntiva. Possiamo sicuramente usare 'ngrams2 foreach {x => println (x.toList)}'. Grazie :-) – tuxdna

Come generare n-grammi in scala?

risposta

Problemi correlati