NLTK parola comportamento tokenize per doppie virgolette è confusa

import nltk 
>>> nltk.__version__ 
'3.0.4' 
>>> nltk.word_tokenize('"') 
['``'] 
>>> nltk.word_tokenize('""') 
['``', '``'] 
>>> nltk.word_tokenize('"A"') 
['``', 'A', "''"]

vedere come cambia " ad un doppio `` e ''?NLTK parola comportamento tokenize per doppie virgolette è confusa

Cosa sta succedendo qui? Perché sta cambiando il personaggio? C'è una soluzione? Come ho bisogno di cercare ogni token nella stringa più tardi.

Python 2.7.6 se fa alcuna differenza.

fonte

2015-08-24 Motasim

Permette di evitare gli errori (come uscire correttamente '" '??) Se vuoi cambiarlo puoi aggiornare [la fonte] (http://www.nltk.org/_modules/nltk/tokenize/punkt .html # PunktLanguageVars.word_tokenize). Ma puoi anche sostituire i caratteri sbagliati nel tuo elenco di token ... – clemtoy

TL; DR:

nltk.word_tokenize modifiche apportate a partire virgolette modifiche da " -> `` e termina virgolette da " -> ''.

a Long:

Prima la base nltk.word_tokenize tokenizza su come Penn Treebank stato token, viene da nltk.tokenize.treebank, vedi https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L91 e https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L23

class TreebankWordTokenizer(TokenizerI): 
    """ 
    The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. 
    This is the method that is invoked by ``word_tokenize()``. It assumes that the 
    text has already been segmented into sentences, e.g. using ``sent_tokenize()``.

arriva Poi un elenco di espressioni regolari sostituzioni per le contrazioni allo https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L48, proviene dallo "tokenizer di Robert MacIntyre", vale a dire https://www.cis.upenn.edu/~treebank/tokenizer.sed

Le contrazioni divide parole come 'intenzione', 'voglio', ecc .:

>>> from nltk import word_tokenize 
>>> word_tokenize("I wanna go home") 
['I', 'wan', 'na', 'go', 'home'] 
>>> word_tokenize("I gonna go home") 
['I', 'gon', 'na', 'go', 'home']

Dopo che raggiungiamo la parte punteggiatura che si sta chiedendo, vedere https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L63:

def tokenize(self, text): 
    #starting quotes 
    text = re.sub(r'^\"', r'``', text) 
    text = re.sub(r'(``)', r' \1 ', text) 
    text = re.sub(r'([ (\[{<])"', r'\1 `` ', text)

Ah ah, a partire citazioni modifiche da "->` `:

>>> import re 
>>> text = '"A"' 
>>> re.sub(r'^\"', r'``', text) 
'``A"' 
KeyboardInterrupt 
>>> re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text)) 
' `` A"' 
>>> re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text))) 
' `` A"' 
>>> text_after_startquote_changes = re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text))) 
>>> text_after_startquote_changes 
' `` A"'

Poi vediamo https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L85 che si occupa di citazioni che terminano:

#ending quotes 
    text = re.sub(r'"', " '' ", text) 
    text = re.sub(r'(\S)(\'\')', r'\1 \2 ', text)

applicando la regex:

>>> re.sub(r'"', " '' ", text_after_startquote_changes) 
" `` A '' " 
>>> re.sub(r'(\S)(\'\')', r'\1 \2 ', re.sub(r'"', " '' ", text_after_startquote_changes)) 
" `` A '' "

Quindi, se si desidera cercare l'elenco dei gettoni per le doppie virgolette dopo nltk.word_tokenize , cerca semplicemente `` e '' invece di ".

fonte

2015-08-25 06:53:23 alvas

Grazie che ha senso. – Motasim

NLTK parola comportamento tokenize per doppie virgolette è confusa

risposta

Problemi correlati