Python suddivide il testo sulle frasi

105

Il Natural Language Toolkit (nltk.org) ha quello che ti serve. This group posting indica questo lo fa:

import nltk.data 

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') 
fp = open("test.txt") 
data = fp.read() 
print '\n-----\n'.join(tokenizer.tokenize(data))

(io non l'ho provato!)

fonte

2011-01-01 22:27:43

+0

Grazie, spero che questa libreria funzioni con la lingua russa. – Artyom

+2

@Artyom: Probabilmente può funzionare con il russo - vedi [può NLTK/pyNLTK funzionare "per lingua" (cioè non inglese) e come?] (Http://stackoverflow.com/questions/1795410/can-nltk -pynltk-lavoro-per-lingua-ie-non-inglese-and-how). – martineau

+4

@Artyom: Ecco link al documentazione in linea per [ 'NLTK .tokenize.punkt.PunktSentenceTokenizer'] (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktSentenceTokenizer- class.html). – martineau

3

Per i casi semplici (in cui le frasi sono terminati normalmente), questo dovrebbe funzionare:

import re 
text = ''.join(open('somefile.txt').readlines()) 
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

L'espressione regolare è *\. +, che corrisponde a un periodo circondato da 0 o più spazi a sinistra e 1 o più a destra (per evitare che qualcosa come il punto in re.split venga contato come una modifica della frase).

Ovviamente, non è la soluzione più affidabile, ma andrà bene nella maggior parte dei casi. L'unico caso in cui questo non coprirà è abbreviazioni (forse attraversano l'elenco delle frasi e verificare che ogni stringa in sentences inizia con la lettera maiuscola?)

fonte

2011-01-01 22:34:48

+26

Non riesci a pensare ad una situazione in inglese in cui una frase non termina con un punto? Immaginalo! La mia risposta sarebbe "ripensaci". (Vedi cosa ho fatto lì?) –

+0

@Now wow, non posso credere che fossi così stupido. Devo essere ubriaco o qualcosa del genere. –

+0

Sto usando Python 2.7.2 su Win 7 x86 e la regex nel codice precedente mi dà questo errore: "SyntaxError: EOL durante la scansione di letterale stringa", che punta alla parentesi di chiusura (dopo 'text'). Inoltre, la regex di riferimento nel tuo testo non esiste nel tuo esempio di codice. – Sabuncu

1

@Artyom,

Hi! Si potrebbe fare una nuova tokenizzatore per il russo (e alcune altre lingue) di utilizzare questa funzione:

def russianTokenizer(text): 
    result = text 
    result = result.replace('.', ' . ') 
    result = result.replace(' . . . ', ' ... ') 
    result = result.replace(',', ' , ') 
    result = result.replace(':', ' : ') 
    result = result.replace(';', ' ; ') 
    result = result.replace('!', ' ! ') 
    result = result.replace('?', ' ? ') 
    result = result.replace('\"', ' \" ') 
    result = result.replace('\'', ' \' ') 
    result = result.replace('(', ' (') 
    result = result.replace(')', ') ') 
    result = result.replace(' ', ' ') 
    result = result.replace(' ', ' ') 
    result = result.replace(' ', ' ') 
    result = result.replace(' ', ' ') 
    result = result.strip() 
    result = result.split(' ') 
    return result

e quindi chiamare in questo modo:

text = 'вы выполняете поиск, используя Google SSL;' 
tokens = russianTokenizer(text)

Buona fortuna, Marilena.

fonte

2012-01-28 17:42:20

1

Non c'è dubbio che NLTK è il più adatto allo scopo. Ma iniziare con NLTK è molto doloroso (ma una volta installato - basta raccogliere i frutti)

Così qui è semplice codice basato re disponibili presso http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html

# split up a paragraph into sentences 
# using regular expressions 


def splitParagraphIntoSentences(paragraph): 
    ''' break a paragraph into sentences 
     and return a list ''' 
    import re 
    # to split by multile characters 

    # regular expressions are easiest (and fastest) 
    sentenceEnders = re.compile('[.!?]') 
    sentenceList = sentenceEnders.split(paragraph) 
    return sentenceList 


if __name__ == '__main__': 
    p = """This is a sentence. This is an excited sentence! And do you think this is a question?""" 

    sentences = splitParagraphIntoSentences(p) 
    for s in sentences: 
     print s.strip() 

#output: 
# This is a sentence 
# This is an excited sentence 

# And do you think this is a question

fonte

2012-05-14 01:59:41 vaichidrewar

+3

Yey ma questo fallisce così facilmente, con: "Mr. Smith sa che questa è una frase. " – thomas

7

Ecco un mezzo dell'approccio strada che non si basa su alcuna libreria esterna. Uso la list comprehension per escludere sovrapposizioni tra abbreviazioni e terminatori e per escludere sovrapposizioni tra le varianti di terminazioni, ad esempio: '.' '. "' Contro

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior', 
       'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'} 
terminators = ['.', '!', '?'] 
wrappers = ['"', "'", ')', ']', '}'] 


def find_sentences(paragraph): 
    end = True 
    sentences = [] 
    while end > -1: 
     end = find_sentence_end(paragraph) 
     if end > -1: 
      sentences.append(paragraph[end:].strip()) 
      paragraph = paragraph[:end] 
    sentences.append(paragraph) 
    sentences.reverse() 
    return sentences 


def find_sentence_end(paragraph): 
    [possible_endings, contraction_locations] = [[], []] 
    contractions = abbreviations.keys() 
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators] 
    for sentence_terminator in sentence_terminators: 
     t_indices = list(find_all(paragraph, sentence_terminator)) 
     possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices])) 
    for contraction in contractions: 
     c_indices = list(find_all(paragraph, contraction)) 
     contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices])) 
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations] 
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]: 
     max_end_start = max([pe[0] for pe in possible_endings]) 
     possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start] 
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')] 
    end = (-1 if not len(possible_endings) else max(possible_endings)) 
    return end 


def find_all(a_str, sub): 
    start = 0 
    while True: 
     start = a_str.find(sub, start) 
     if start == -1: 
      return 
     yield start 
     start += len(sub)

Ho usato la funzione find_all di Karl da questa voce: Find all occurrences of a substring in Python

fonte

2015-01-22 15:59:12 TennisVisuals

+1

perfetto approccio! Gli altri non prendono' ... 'e'?! '. –

40

Questa funzione può dividere l'intero testo di Huckleberry Finn in frasi in circa 0,1 secondi e gestisce molti di più casi limite dolorosi che rendono l'analisi delle frasi non banale ad es. "Mr. John Johnson Jr. è nato negli Stati Uniti, ma ha conseguito il dottorato. in Israele prima di unirsi a Nike Inc. come ingegnere. Ha anche lavorato a craigslist.org come analista aziendale. "

# -*- coding: utf-8 -*- 
import re 
caps = "([A-Z])" 
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]" 
suffixes = "(Inc|Ltd|Jr|Sr|Co)" 
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)" 
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)" 
websites = "[.](com|net|org|io|gov)" 

def split_into_sentences(text): 
    text = " " + text + " " 
    text = text.replace("\n"," ") 
    text = re.sub(prefixes,"\\1<prd>",text) 
    text = re.sub(websites,"<prd>\\1",text) 
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>") 
    text = re.sub("\s" + caps + "[.] "," \\1<prd> ",text) 
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text) 
    text = re.sub(caps + "[.]" + caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>\\3<prd>",text) 
    text = re.sub(caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>",text) 
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text) 
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text) 
    text = re.sub(" " + caps + "[.]"," \\1<prd>",text) 
    if "”" in text: text = text.replace(".”","”.") 
    if "\"" in text: text = text.replace(".\"","\".") 
    if "!" in text: text = text.replace("!\"","\"!") 
    if "?" in text: text = text.replace("?\"","\"?") 
    text = text.replace(".",".<stop>") 
    text = text.replace("?","?<stop>") 
    text = text.replace("!","!<stop>") 
    text = text.replace("<prd>",".") 
    sentences = text.split("<stop>") 
    sentences = sentences[:-1] 
    sentences = [s.strip() for s in sentences] 
    return sentences

fonte

2015-07-19 20:50:33

+6

Questa è una soluzione impressionante. Tuttavia ho aggiunto due linee ad esso cifre =" ([0 -9]) "nella dichiarazione delle espressioni regolari e text = re.sub (cifre +" [.] "+ Cifre," \\ 1 \\ 2 ", testo) nella funzione. Ora non divide il . linea in decimali, come 5,5 Grazie per questa risposta –

+0

Come hai fatto si analizza l'intero Huckleberry Fin Dov'è che in formato testo – PascalVKooten

+1

Un'ottima soluzione nella funzione, ho aggiunto se "ad esempio" nel testo:.?. text = testo .replace ("eg", "e ") "se" cioè nel testo: text = text.replace ("ie", "i e ") e ha risolto completamente il mio problema. –

2

Invece di usare espressioni regolari per spliting il testo in frasi, è anche possibile utilizzare libreria NLTK

>>> from nltk import tokenize 
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3." 

>>> tokenize.sent_tokenize(p) 
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

ref:. https://stackoverflow.com/a/9474645/2877052

fonte

2017-10-30 13:34:56

0

Puoi provare a utilizzare Spacy invece di regex. Io lo uso e fa il lavoro

import spacy 
nlp = spacy.load('en') 

text = '''Your text here''' 
tokens = nlp(text) 

for sent in tokens.sents: 
    print(sent.string.strip())

fonte

2018-01-10 12:03:41 Elf

Python suddivide il testo sulle frasi

risposta

Problemi correlati