Come si esegue il tokenize di una frase stringa in NLTK?

Sto usando nltk, quindi voglio creare i miei testi personalizzati proprio come quelli di default su nltk.books. Tuttavia, ho appena alzato al metodo comeCome si esegue il tokenize di una frase stringa in NLTK?

my_text = ['This', 'is', 'my', 'text']

mi piacerebbe scoprire un modo per inserire il mio "testo" come:

my_text = "This is my text, this is a nice way to input text."

Quale metodo, pitone di o da NLTK permette io per fare questo E ancora più importante, come posso sottovalutare i simboli di punteggiatura?

fonte

2013-02-24 diegoaguilar

Potrebbe chiarire, cosa si intende per 'sottovalutare punteggiatura symbols'? – quetzalcoatl

Credo che intendeva tokenize la frase di ingresso – alvas

Sì, per esempio se ho fatto: sentente = "Questa è la mia frase, una frase è una breve espressione" Quindi, 'frase' e 'frase' sarebbe due diversi elementi ... – diegoaguilar

Questo è in realtà il main page of nltk.org:

>>> import nltk 
>>> sentence = """At eight o'clock on Thursday morning 
... Arthur didn't feel very good.""" 
>>> tokens = nltk.word_tokenize(sentence) 
>>> tokens 
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

fonte

2013-02-24 23:28:02

il problema è che non si divide /. Se hai "oggi e/o domani sono bei giorni", fornisce "e/o" come singolo token di default. – thang

come convertire "not" in "not"? – Omayr

-9

come risposta @PavelAnossov, la risposta canonica, utilizzare la funzione word_tokenize in NLTK:

from nltk import word_tokenize 
sent = "This is my text, this is a nice way to input text." 
word_tokenize(sent)

Se la frase è veramente abbastanza semplice:

Utilizzando il set string.punctuation, rimuovere la punteggiatura quindi diviso utilizzando il delimitatore spazio bianco:

import string 
x = "This is my text, this is a nice way to input text." 
y = "".join([i for i in x if not in string.punctuation]).split(" ") 
print y

fonte

2013-03-01 07:48:29 alvas

@ la risposta di pavel risolverà problemi come 'didn't' ->' did' + 'n't' – alvas

Come si esegue il tokenize di una frase stringa in NLTK?

risposta

Problemi correlati