Come posso ottenere tutto il testo normale da un sito Web con Scrapy?

Mi piacerebbe avere tutto il testo visibile da un sito Web, dopo il rendering dell'HTML. Sto lavorando in Python con Scrapy framework. Con xpath('//body//text()') Sono in grado di ottenerlo, ma con i tag HTML, e voglio solo il testo. Qualche soluzione per questo? Grazie !Come posso ottenere tutto il testo normale da un sito Web con Scrapy?

fonte

2014-04-18 tomasyany

L'opzione più semplice sarebbe quella di extract//body//text() e join tutto trovato:

''.join(sel.select("//body//text()").extract()).strip()

dove sel è un'istanza Selector.

Un'altra opzione è quella di utilizzare nltk 's clean_html():

>>> import nltk 
>>> html = """ 
... <div class="post-text" itemprop="description"> 
... 
...   <p>I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. 
... With <code>xpath('//body//text()')</code> I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !</p> 
... 
...  </div>""" 
>>> nltk.clean_html(html) 
"I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.\nWith xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !"

Un'altra opzione è quella di utilizzare BeautifulSoup' s get_text():

get_text()

If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string.

>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(html) 
>>> print soup.get_text().strip() 
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. 
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

Un'altra opzione è quella di utilizzare lxml.html s' text_content() :

.text_content()

Returns the text content of the element, including the text content of its children, with no markup.

>>> import lxml.html 
>>> tree = lxml.html.fromstring(html) 
>>> print tree.text_content().strip() 
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. 
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

fonte

2014-04-18 15:18:56 alecxe

Ho eliminato la mia domanda .. Ho usato il seguente codice html = sel.select ("// body // text()") tree = lxml.html.fromstring (html) elemento ['description'] = tree.text_content(). strip() Ma sto ricevendo il \t is_full_html = _looks_like_full_html_unicode (html) \t exceptions.TypeError: stringa attesa o buffer ..erro. Cosa è andato storto – Backtrack

'nltk' ha funzionato meglio per me – user4421975

Proprio come un aggiornamento,' nltk' deprecato loro 'metodo clean_html' invece consiglia: ' NotImplementedError: Per rimuovere markup HTML, utilizzare get_text di BeautifulSoup() Funzione ' – TheNastyOne

Hai provato?

xpath('//body//text()').re('(\w+)')

xpath('//body//text()').extract()

fonte

2014-04-18 15:08:41

In realtà funziona piuttosto bene, ma restituisce ancora alcuni tag HTML e altri. – tomasyany

Come posso ottenere tutto il testo normale da un sito Web con Scrapy?

risposta

Problemi correlati