Come posso rilevare gli errori con scrapy così posso fare qualcosa quando ottengo l'errore Timeout utente?

ERROR: Error downloading <GET URL_HERE>: User timeout caused connection failure.

Ottengo questo problema ogni tanto quando si utilizza il mio raschietto. C'è un modo per prendere questo problema ed eseguire una funzione quando succede? Non riesco a scoprire come farlo online ovunque.Come posso rilevare gli errori con scrapy così posso fare qualcosa quando ottengo l'errore Timeout utente?

fonte

2015-06-30 Ryan Weinstein

Che cosa si può fare è definire una errback nelle Request casi:

errback (callable) - una funzione che verrà chiamata se nessuna eccezione è stata sollevata durante l'elaborazione della richiesta. Ciò include pagine che hanno fallito con 404 errori HTTP e così via. Riceve a Twisted Failure instance come primo parametro.

Ecco alcuni esempi di codice (per Scrapy 1.0) che è possibile utilizzare:

# -*- coding: utf-8 -*- 
# errbacks.py 
import scrapy 

# from scrapy.contrib.spidermiddleware.httperror import HttpError 
from scrapy.spidermiddlewares.httperror import HttpError 
from twisted.internet.error import DNSLookupError 
from twisted.internet.error import TimeoutError 


class ErrbackSpider(scrapy.Spider): 
    name = "errbacks" 
    start_urls = [ 
     "http://www.httpbin.org/",    # HTTP 200 expected 
     "http://www.httpbin.org/status/404", # Not found error 
     "http://www.httpbin.org/status/500", # server issue 
     "http://www.httpbin.org:12345/",  # non-responding host, timeout expected 
     "http://www.httphttpbinbin.org/",  # DNS error expected 
    ] 

    def start_requests(self): 
     for u in self.start_urls: 
      yield scrapy.Request(u, callback=self.parse_httpbin, 
            errback=self.errback_httpbin, 
            dont_filter=True) 

    def parse_httpbin(self, response): 
     self.logger.error('Got successful response from {}'.format(response.url)) 
     # do something useful now 

    def errback_httpbin(self, failure): 
     # log all errback failures, 
     # in case you want to do something special for some errors, 
     # you may need the failure's type 
     self.logger.error(repr(failure)) 

     #if isinstance(failure.value, HttpError): 
     if failure.check(HttpError): 
      # you can get the response 
      response = failure.value.response 
      self.logger.error('HttpError on %s', response.url) 

     #elif isinstance(failure.value, DNSLookupError): 
     elif failure.check(DNSLookupError): 
      # this is the original request 
      request = failure.request 
      self.logger.error('DNSLookupError on %s', request.url) 

     #elif isinstance(failure.value, TimeoutError): 
     elif failure.check(TimeoutError): 
      request = failure.request 
      self.logger.error('TimeoutError on %s', request.url)

E l'uscita in guscio Scrapy (solo 1 tentativi e 5s scaricare timeout):

$ scrapy runspider errbacks.py --set DOWNLOAD_TIMEOUT=5 --set RETRY_TIMES=1 
2015-06-30 23:45:55 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot) 
2015-06-30 23:45:55 [scrapy] INFO: Optional features available: ssl, http11 
2015-06-30 23:45:55 [scrapy] INFO: Overridden settings: {'DOWNLOAD_TIMEOUT': '5', 'RETRY_TIMES': '1'} 
2015-06-30 23:45:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2015-06-30 23:45:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2015-06-30 23:45:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2015-06-30 23:45:56 [scrapy] INFO: Enabled item pipelines: 
2015-06-30 23:45:56 [scrapy] INFO: Spider opened 
2015-06-30 23:45:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2015-06-30 23:45:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httphttpbinbin.org/> (failed 1 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname. 
2015-06-30 23:45:56 [scrapy] DEBUG: Gave up retrying <GET http://www.httphttpbinbin.org/> (failed 2 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname. 
2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.DNSLookupError'>> 
2015-06-30 23:45:56 [errbacks] ERROR: DNSLookupError on http://www.httphttpbinbin.org/ 
2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None) 
2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404> (referer: None) 
2015-06-30 23:45:56 [errbacks] ERROR: Got successful response from http://www.httpbin.org/ 
2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>> 
2015-06-30 23:45:56 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/404 
2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 1 times): 500 Internal Server Error 
2015-06-30 23:45:57 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500> (failed 2 times): 500 Internal Server Error 
2015-06-30 23:45:57 [scrapy] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500> (referer: None) 
2015-06-30 23:45:57 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>> 
2015-06-30 23:45:57 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/500 
2015-06-30 23:46:01 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org:12345/> (failed 1 times): User timeout caused connection failure. 
2015-06-30 23:46:06 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org:12345/> (failed 2 times): User timeout caused connection failure. 
2015-06-30 23:46:06 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.TimeoutError'>> 
2015-06-30 23:46:06 [errbacks] ERROR: TimeoutError on http://www.httpbin.org:12345/ 
2015-06-30 23:46:06 [scrapy] INFO: Closing spider (finished) 
2015-06-30 23:46:06 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/exception_count': 4, 
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2, 
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2, 
'downloader/request_bytes': 1748, 
'downloader/request_count': 8, 
'downloader/request_method_count/GET': 8, 
'downloader/response_bytes': 12506, 
'downloader/response_count': 4, 
'downloader/response_status_count/200': 1, 
'downloader/response_status_count/404': 1, 
'downloader/response_status_count/500': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2015, 6, 30, 21, 46, 6, 537191), 
'log_count/DEBUG': 10, 
'log_count/ERROR': 9, 
'log_count/INFO': 7, 
'response_received_count': 3, 
'scheduler/dequeued': 8, 
'scheduler/dequeued/memory': 8, 
'scheduler/enqueued': 8, 
'scheduler/enqueued/memory': 8, 
'start_time': datetime.datetime(2015, 6, 30, 21, 45, 56, 322235)} 
2015-06-30 23:46:06 [scrapy] INFO: Spider closed (finished)

Nota come scrapy registra le eccezioni nelle sue statistiche:

'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2, 
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,

fonte

2015-06-30 21:52:28

Preferisco avere un Retry Middleware personalizzato come questo:

from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware 

from fake_useragent import FakeUserAgentError 

class FakeUserAgentErrorRetryMiddleware(RetryMiddleware): 

    def process_exception(self, request, exception, spider): 
     if type(exception) == FakeUserAgentError: return self._retry(request, exception, spider)

fonte

2017-03-25 06:27:48

Come posso rilevare gli errori con scrapy così posso fare qualcosa quando ottengo l'errore Timeout utente?

risposta

Problemi correlati