2015-12-24 9 views
5

Sto provando a scrivere un programma di estrazione di parole chiave usando taggatori POS Stanford e NER. Per l'estrazione di parole chiave, sono interessato solo ai nomi propri. Ecco l'approccio di basenltk: Come impedire l'insorgenza di nomi propri

  1. Pulire il backup dei dati, eliminando qualsiasi cosa ma alfabeti
  2. Rimuovere stopwords
  3. Stem ogni parola
  4. Determinare tag POS di ogni parola
  5. Se il tag POS è un sostantivo quindi inviarlo al NER
  6. Il NER determinerà quindi se la parola è una persona, organizzazione o luogo

codice di esempio

docText="'Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London" 

words = re.split("\W+",docText) 

stops = set(stopwords.words("english")) 

#remove stop words from the list 
words = [w for w in words if w not in stops and len(w) > 2] 

# Stemming 
pstem = PorterStemmer() 

words = [pstem.stem(w) for w in words]  

nounsWeWant = set(['NN' ,'NNS', 'NNP', 'NNPS']) 

finalWords = [] 

stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') 

for w in words: 
    if stp.tag([w.lower()])[0][1] not in nounsWeWant: 
     finalWords.append(w.lower()) 
    else: 
     finalWords.append(w) 

finalString = " ".join(finalWords) 
print finalString 

tagged = stn.tag(finalWords) 
print tagged 

che mi dà

Jack Frost work Boe Compani manag aircraft crew London 
[(u'Jack', u'PERSON'), (u'Frost', u'PERSON'), (u'work', u'O'), (u'Boe', u'O'), (u'Compani', u'O'), (u'manag', u'O'), (u'aircraft', u'O'), (u'crew', u'O'), (u'London', u'LOCATION')] 

così chiaramente, non volevo Boeing per essere arginato. né compagnia. Devo arginare le parole perché il mio input potrebbe contenere termini come Performing. Ho visto che una parola come Performing verrà rilevata dal NER come nome proprio e quindi potrebbe essere classificata come Organization. Quindi, in primo luogo io argano tutte le parole e converto in minuscolo. Quindi controllo se il tag POS della parola è un nome. Se è così, lo tengo così com'è. In caso contrario, converto la parola in minuscolo e la aggiungo all'elenco di parole finali che verrà passato al NER.

Qualche idea su come evitare di generare nomi propri?

+0

Un'idea potrebbe essere quella di verificare tutti i sostantivi in ​​inglese dizionario. I nomi propri hanno meno probabilità di essere elencati in un dizionario. – Riyaz

+0

Questo problema viene in genere definito propagazione degli errori in NLP. – Riyaz

+0

Perché stai facendo NER su parole con stemmed? Di solito, questo ti darà l'output sbagliato. – alvas

risposta

3

Utilizzare la pipeline Stanford CoreNLP completa per gestire la catena di strumenti della PNL. Evita il tuo tokenizer, pulitore, tagger POS, ecc. Non funzionerà bene con lo strumento NER.

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip 
unzip http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip 
cd stanford-corenlp-full-2015-12-09 
echo "Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London" > test.txt 
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file test.txt 
cat test.txt.out 

[out]:

<?xml version="1.0" encoding="UTF-8"?> 
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?> 
<root> 
    <document> 
    <sentences> 
     <sentence id="1"> 
     <tokens> 
      <token id="1"> 
      <word>Jack</word> 
      <lemma>Jack</lemma> 
      <CharacterOffsetBegin>0</CharacterOffsetBegin> 
      <CharacterOffsetEnd>4</CharacterOffsetEnd> 
      <POS>NNP</POS> 
      <NER>PERSON</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
      <token id="2"> 
      <word>Frost</word> 
      <lemma>Frost</lemma> 
      <CharacterOffsetBegin>5</CharacterOffsetBegin> 
      <CharacterOffsetEnd>10</CharacterOffsetEnd> 
      <POS>NNP</POS> 
      <NER>PERSON</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
      <token id="3"> 
      <word>works</word> 
      <lemma>work</lemma> 
      <CharacterOffsetBegin>11</CharacterOffsetBegin> 
      <CharacterOffsetEnd>16</CharacterOffsetEnd> 
      <POS>VBZ</POS> 
      <NER>O</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
      <token id="4"> 
      <word>for</word> 
      <lemma>for</lemma> 
      <CharacterOffsetBegin>17</CharacterOffsetBegin> 
      <CharacterOffsetEnd>20</CharacterOffsetEnd> 
      <POS>IN</POS> 
      <NER>O</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
      <token id="5"> 
      <word>Boeing</word> 
      <lemma>Boeing</lemma> 
      <CharacterOffsetBegin>21</CharacterOffsetBegin> 
      <CharacterOffsetEnd>27</CharacterOffsetEnd> 
      <POS>NNP</POS> 
      <NER>ORGANIZATION</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
      <token id="6"> 
      <word>Company</word> 
      <lemma>Company</lemma> 
      <CharacterOffsetBegin>28</CharacterOffsetBegin> 
      <CharacterOffsetEnd>35</CharacterOffsetEnd> 
      <POS>NNP</POS> 
      <NER>ORGANIZATION</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
      <token id="7"> 
      <word>.</word> 
      <lemma>.</lemma> 
      <CharacterOffsetBegin>35</CharacterOffsetBegin> 
      <CharacterOffsetEnd>36</CharacterOffsetEnd> 
      <POS>.</POS> 
      <NER>O</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
     </tokens> 
     <parse>(ROOT (S (NP (NNP Jack) (NNP Frost)) (VP (VBZ works) (PP (IN for) (NP (NNP Boeing) (NNP Company)))) (. .))) </parse> 
     <dependencies type="basic-dependencies"> 
      <dep type="root"> 
      <governor idx="0">ROOT</governor> 
      <dependent idx="3">works</dependent> 
      </dep> 
      <dep type="compound"> 
      <governor idx="2">Frost</governor> 
      <dependent idx="1">Jack</dependent> 
      </dep> 
      <dep type="nsubj"> 
      <governor idx="3">works</governor> 
      <dependent idx="2">Frost</dependent> 
      </dep> 
      <dep type="case"> 
      <governor idx="6">Company</governor> 
      <dependent idx="4">for</dependent> 
      </dep> 
      <dep type="compound"> 
      <governor idx="6">Company</governor> 
      <dependent idx="5">Boeing</dependent> 
      </dep> 
      <dep type="nmod"> 
      <governor idx="3">works</governor> 
      <dependent idx="6">Company</dependent> 
      </dep> 
      <dep type="punct"> 
      <governor idx="3">works</governor> 
      <dependent idx="7">.</dependent> 
      </dep> 
     </dependencies> 
     <dependencies type="collapsed-dependencies"> 
      <dep type="root"> 
      <governor idx="0">ROOT</governor> 
      <dependent idx="3">works</dependent> 
      </dep> 
      <dep type="compound"> 
      <governor idx="2">Frost</governor> 
      <dependent idx="1">Jack</dependent> 
      </dep> 
      <dep type="nsubj"> 
      <governor idx="3">works</governor> 
      <dependent idx="2">Frost</dependent> 
      </dep> 
      <dep type="case"> 
      <governor idx="6">Company</governor> 
      <dependent idx="4">for</dependent> 
      </dep> 
      <dep type="compound"> 
      <governor idx="6">Company</governor> 
      <dependent idx="5">Boeing</dependent> 
      </dep> 
      <dep type="nmod:for"> 
      <governor idx="3">works</governor> 
      <dependent idx="6">Company</dependent> 
      </dep> 
      <dep type="punct"> 
      <governor idx="3">works</governor> 
      <dependent idx="7">.</dependent> 
      </dep> 
     </dependencies> 
     <dependencies type="collapsed-ccprocessed-dependencies"> 
      <dep type="root"> 
      <governor idx="0">ROOT</governor> 
      <dependent idx="3">works</dependent> 
      </dep> 
      <dep type="compound"> 
      <governor idx="2">Frost</governor> 
      <dependent idx="1">Jack</dependent> 
      </dep> 
      <dep type="nsubj"> 
      <governor idx="3">works</governor> 
      <dependent idx="2">Frost</dependent> 
      </dep> 
      <dep type="case"> 
      <governor idx="6">Company</governor> 
      <dependent idx="4">for</dependent> 
      </dep> 
      <dep type="compound"> 
      <governor idx="6">Company</governor> 
      <dependent idx="5">Boeing</dependent> 
      </dep> 
      <dep type="nmod:for"> 
      <governor idx="3">works</governor> 
      <dependent idx="6">Company</dependent> 
      </dep> 
      <dep type="punct"> 
      <governor idx="3">works</governor> 
      <dependent idx="7">.</dependent> 
      </dep> 
     </dependencies> 
     </sentence> 
     <sentence id="2"> 
     <tokens> 
      <token id="1"> 
      <word>He</word> 
      <lemma>he</lemma> 
      <CharacterOffsetBegin>37</CharacterOffsetBegin> 
      <CharacterOffsetEnd>39</CharacterOffsetEnd> 
      <POS>PRP</POS> 
      <NER>O</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
      <token id="2"> 
      <word>manages</word> 
      <lemma>manage</lemma> 
      <CharacterOffsetBegin>40</CharacterOffsetBegin> 
      <CharacterOffsetEnd>47</CharacterOffsetEnd> 
      <POS>VBZ</POS> 
      <NER>O</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
      <token id="3"> 
      <word>5</word> 
      <lemma>5</lemma> 
      <CharacterOffsetBegin>48</CharacterOffsetBegin> 
      <CharacterOffsetEnd>49</CharacterOffsetEnd> 
      <POS>CD</POS> 
      <NER>NUMBER</NER> 
      <NormalizedNER>5.0</NormalizedNER> 
      <Speaker>PER0</Speaker> 
      </token> 
      <token id="4"> 
      <word>aircraft</word> 
      <lemma>aircraft</lemma> 
      <CharacterOffsetBegin>50</CharacterOffsetBegin> 
      <CharacterOffsetEnd>58</CharacterOffsetEnd> 
      <POS>NN</POS> 
      <NER>O</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
      <token id="5"> 
      <word>and</word> 
      <lemma>and</lemma> 
      <CharacterOffsetBegin>59</CharacterOffsetBegin> 
      <CharacterOffsetEnd>62</CharacterOffsetEnd> 
      <POS>CC</POS> 
      <NER>O</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
      <token id="6"> 
      <word>their</word> 
      <lemma>they</lemma> 
      <CharacterOffsetBegin>63</CharacterOffsetBegin> 
      <CharacterOffsetEnd>68</CharacterOffsetEnd> 
      <POS>PRP$</POS> 
      <NER>O</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
      <token id="7"> 
      <word>crew</word> 
      <lemma>crew</lemma> 
      <CharacterOffsetBegin>69</CharacterOffsetBegin> 
      <CharacterOffsetEnd>73</CharacterOffsetEnd> 
      <POS>NN</POS> 
      <NER>O</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
      <token id="8"> 
      <word>in</word> 
      <lemma>in</lemma> 
      <CharacterOffsetBegin>74</CharacterOffsetBegin> 
      <CharacterOffsetEnd>76</CharacterOffsetEnd> 
      <POS>IN</POS> 
      <NER>O</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
      <token id="9"> 
      <word>London</word> 
      <lemma>London</lemma> 
      <CharacterOffsetBegin>77</CharacterOffsetBegin> 
      <CharacterOffsetEnd>83</CharacterOffsetEnd> 
      <POS>NNP</POS> 
      <NER>LOCATION</NER> 
      <Speaker>PER0</Speaker> 
      </token> 
     </tokens> 
     <parse>(ROOT (S (NP (PRP He)) (VP (VBZ manages) (NP (NP (CD 5) (NN aircraft)) (CC and) (NP (NP (PRP$ their) (NN crew)) (PP (IN in) (NP (NNP London)))))))) </parse> 
     <dependencies type="basic-dependencies"> 
      <dep type="root"> 
      <governor idx="0">ROOT</governor> 
      <dependent idx="2">manages</dependent> 
      </dep> 
      <dep type="nsubj"> 
      <governor idx="2">manages</governor> 
      <dependent idx="1">He</dependent> 
      </dep> 
      <dep type="nummod"> 
      <governor idx="4">aircraft</governor> 
      <dependent idx="3">5</dependent> 
      </dep> 
      <dep type="dobj"> 
      <governor idx="2">manages</governor> 
      <dependent idx="4">aircraft</dependent> 
      </dep> 
      <dep type="cc"> 
      <governor idx="4">aircraft</governor> 
      <dependent idx="5">and</dependent> 
      </dep> 
      <dep type="nmod:poss"> 
      <governor idx="7">crew</governor> 
      <dependent idx="6">their</dependent> 
      </dep> 
      <dep type="conj"> 
      <governor idx="4">aircraft</governor> 
      <dependent idx="7">crew</dependent> 
      </dep> 
      <dep type="case"> 
      <governor idx="9">London</governor> 
      <dependent idx="8">in</dependent> 
      </dep> 
      <dep type="nmod"> 
      <governor idx="7">crew</governor> 
      <dependent idx="9">London</dependent> 
      </dep> 
     </dependencies> 
     <dependencies type="collapsed-dependencies"> 
      <dep type="root"> 
      <governor idx="0">ROOT</governor> 
      <dependent idx="2">manages</dependent> 
      </dep> 
      <dep type="nsubj"> 
      <governor idx="2">manages</governor> 
      <dependent idx="1">He</dependent> 
      </dep> 
      <dep type="nummod"> 
      <governor idx="4">aircraft</governor> 
      <dependent idx="3">5</dependent> 
      </dep> 
      <dep type="dobj"> 
      <governor idx="2">manages</governor> 
      <dependent idx="4">aircraft</dependent> 
      </dep> 
      <dep type="cc"> 
      <governor idx="4">aircraft</governor> 
      <dependent idx="5">and</dependent> 
      </dep> 
      <dep type="nmod:poss"> 
      <governor idx="7">crew</governor> 
      <dependent idx="6">their</dependent> 
      </dep> 
      <dep type="conj:and"> 
      <governor idx="4">aircraft</governor> 
      <dependent idx="7">crew</dependent> 
      </dep> 
      <dep type="case"> 
      <governor idx="9">London</governor> 
      <dependent idx="8">in</dependent> 
      </dep> 
      <dep type="nmod:in"> 
      <governor idx="7">crew</governor> 
      <dependent idx="9">London</dependent> 
      </dep> 
     </dependencies> 
     <dependencies type="collapsed-ccprocessed-dependencies"> 
      <dep type="root"> 
      <governor idx="0">ROOT</governor> 
      <dependent idx="2">manages</dependent> 
      </dep> 
      <dep type="nsubj"> 
      <governor idx="2">manages</governor> 
      <dependent idx="1">He</dependent> 
      </dep> 
      <dep type="nummod"> 
      <governor idx="4">aircraft</governor> 
      <dependent idx="3">5</dependent> 
      </dep> 
      <dep type="dobj"> 
      <governor idx="2">manages</governor> 
      <dependent idx="4">aircraft</dependent> 
      </dep> 
      <dep type="cc"> 
      <governor idx="4">aircraft</governor> 
      <dependent idx="5">and</dependent> 
      </dep> 
      <dep type="nmod:poss"> 
      <governor idx="7">crew</governor> 
      <dependent idx="6">their</dependent> 
      </dep> 
      <dep type="dobj" extra="true"> 
      <governor idx="2">manages</governor> 
      <dependent idx="7">crew</dependent> 
      </dep> 
      <dep type="conj:and"> 
      <governor idx="4">aircraft</governor> 
      <dependent idx="7">crew</dependent> 
      </dep> 
      <dep type="case"> 
      <governor idx="9">London</governor> 
      <dependent idx="8">in</dependent> 
      </dep> 
      <dep type="nmod:in"> 
      <governor idx="7">crew</governor> 
      <dependent idx="9">London</dependent> 
      </dep> 
     </dependencies> 
     </sentence> 
    </sentences> 
    <coreference> 
     <coreference> 
     <mention representative="true"> 
      <sentence>1</sentence> 
      <start>1</start> 
      <end>3</end> 
      <head>2</head> 
      <text>Jack Frost</text> 
     </mention> 
     <mention> 
      <sentence>2</sentence> 
      <start>1</start> 
      <end>2</end> 
      <head>1</head> 
      <text>He</text> 
     </mention> 
     </coreference> 
    </coreference> 
    </document> 
</root> 

O per ottenere il risultato JSON:

java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file test.txt -outputFormat json 

E se si ha realmente bisogno di un involucro di pitone, vedere https://github.com/smilli/py-corenlp

$ cd stanford-corenlp-full-2015-12-09 
$ export CLASSPATH=protobuf.jar:joda-time.jar:jollyday.jar:xom-1.2.10.jar:stanford-corenlp-3.6.0.jar:stanford-corenlp-3.6.0-models.jar:slf4j-api.jar 
$ java -mx4g edu.stanford.nlp.pipeline.StanfordCoreNLPServer & 
cd 
$ git clone https://github.com/smilli/py-corenlp.git 
$ cd py-corenlp 
$ python 
>>> from corenlp import StanfordCoreNLP 
>>> nlp = StanfordCoreNLP('http://localhost:9000') 
>>> text = ("Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London") 
>>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner', 'outputFormat': 'json'}) 
>>> output 
{u'sentences': [{u'parse': u'SENTENCE_SKIPPED_OR_UNPARSABLE', u'index': 0, u'tokens': [{u'index': 1, u'word': u'Jack', u'lemma': u'Jack', u'after': u' ', u'pos': u'NNP', u'characterOffsetEnd': 4, u'characterOffsetBegin': 0, u'originalText': u'Jack', u'ner': u'PERSON', u'before': u''}, {u'index': 2, u'word': u'Frost', u'lemma': u'Frost', u'after': u' ', u'pos': u'NNP', u'characterOffsetEnd': 10, u'characterOffsetBegin': 5, u'originalText': u'Frost', u'ner': u'PERSON', u'before': u' '}, {u'index': 3, u'word': u'works', u'lemma': u'work', u'after': u' ', u'pos': u'VBZ', u'characterOffsetEnd': 16, u'characterOffsetBegin': 11, u'originalText': u'works', u'ner': u'O', u'before': u' '}, {u'index': 4, u'word': u'for', u'lemma': u'for', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 20, u'characterOffsetBegin': 17, u'originalText': u'for', u'ner': u'O', u'before': u' '}, {u'index': 5, u'word': u'Boeing', u'lemma': u'Boeing', u'after': u' ', u'pos': u'NNP', u'characterOffsetEnd': 27, u'characterOffsetBegin': 21, u'originalText': u'Boeing', u'ner': u'ORGANIZATION', u'before': u' '}, {u'index': 6, u'word': u'Company', u'lemma': u'Company', u'after': u'', u'pos': u'NNP', u'characterOffsetEnd': 35, u'characterOffsetBegin': 28, u'originalText': u'Company', u'ner': u'ORGANIZATION', u'before': u' '}, {u'index': 7, u'word': u'.', u'lemma': u'.', u'after': u' ', u'pos': u'.', u'characterOffsetEnd': 36, u'characterOffsetBegin': 35, u'originalText': u'.', u'ner': u'O', u'before': u''}]}, {u'parse': u'SENTENCE_SKIPPED_OR_UNPARSABLE', u'index': 1, u'tokens': [{u'index': 1, u'word': u'He', u'lemma': u'he', u'after': u' ', u'pos': u'PRP', u'characterOffsetEnd': 39, u'characterOffsetBegin': 37, u'originalText': u'He', u'ner': u'O', u'before': u' '}, {u'index': 2, u'word': u'manages', u'lemma': u'manage', u'after': u' ', u'pos': u'VBZ', u'characterOffsetEnd': 47, u'characterOffsetBegin': 40, u'originalText': u'manages', u'ner': u'O', u'before': u' '}, {u'index': 3, u'after': u' ', u'word': u'5', u'lemma': u'5', u'normalizedNER': u'5.0', u'pos': u'CD', u'characterOffsetEnd': 49, u'characterOffsetBegin': 48, u'originalText': u'5', u'ner': u'NUMBER', u'before': u' '}, {u'index': 4, u'word': u'aircraft', u'lemma': u'aircraft', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 58, u'characterOffsetBegin': 50, u'originalText': u'aircraft', u'ner': u'O', u'before': u' '}, {u'index': 5, u'word': u'and', u'lemma': u'and', u'after': u' ', u'pos': u'CC', u'characterOffsetEnd': 62, u'characterOffsetBegin': 59, u'originalText': u'and', u'ner': u'O', u'before': u' '}, {u'index': 6, u'word': u'their', u'lemma': u'they', u'after': u' ', u'pos': u'PRP$', u'characterOffsetEnd': 68, u'characterOffsetBegin': 63, u'originalText': u'their', u'ner': u'O', u'before': u' '}, {u'index': 7, u'word': u'crew', u'lemma': u'crew', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 73, u'characterOffsetBegin': 69, u'originalText': u'crew', u'ner': u'O', u'before': u' '}, {u'index': 8, u'word': u'in', u'lemma': u'in', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 76, u'characterOffsetBegin': 74, u'originalText': u'in', u'ner': u'O', u'before': u' '}, {u'index': 9, u'word': u'London', u'lemma': u'London', u'after': u'', u'pos': u'NNP', u'characterOffsetEnd': 83, u'characterOffsetBegin': 77, u'originalText': u'London', u'ner': u'LOCATION', u'before': u' '}]}]} 
>>> annotated_sent0 = output['sentences'][0] 
>>> for token in annotated_sent0['tokens']: 
...  print token['word'], token['lemma'], token['pos'], token['ner'] 
... 
Jack Jack NNP PERSON 
Frost Frost NNP PERSON 
works work VBZ O 
for for IN O 
Boeing Boeing NNP ORGANIZATION 
Company Company NNP ORGANIZATION 
. . . O 

Forse questo è l'output y ou vogliono:

>>> " ".join(token['lemma'] for token in annotated_sent0['tokens']) 
Jack Frost work for Boeing Company 
>>> " ".join(token['word'] for token in annotated_sent0['tokens']) 
Jack Frost works for Boeing Company 

Se si desidera un wrapper che viene fornito con NLTK, allora dovete aspettare ancora un po 'fino a quando this issue si risolve; P

+1

Nota che il lemmatizer (linguisticamente) sarebbe stato migliore del tuo stemmer che vuoi applicare. Computazionalmente, è un po 'più costoso ma è ancora parte della catena di strumenti, potrebbe anche usarlo. – alvas