2016-03-29 41 views
5

Ho cercato di utilizzare Scrapy per ottenere alcuni dati da Google Analytics e nonostante sia un principiante completo di Python ho fatto dei progressi. Ora posso accedere a Google Analytics di Scrapy, ma ho bisogno di fare una richiesta AJAX per ottenere i dati che voglio. Ho cercato di replicare intestazione di richiesta HTTP del browser con il codice qui sotto ma non sembra funzionare, il mio log di errore diceScraping Google Analytics di Scrapy

troppi valori per disfare

potrebbe aiutare qualcuno? Ci ho lavorato per due giorni, ho la sensazione di essere molto vicino ma sono anche molto confuso.

Ecco il codice:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import FormRequest, Request 
from scrapy.selector import Selector 
import logging 
from super.items import SuperItem 
from scrapy.shell import inspect_response 
import json 

class LoginSpider(BaseSpider): 
    name = 'super' 
    start_urls = ['https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier'] 

    def parse(self, response): 
     return [FormRequest.from_response(response, 
        formdata={'Email': 'Email'}, 

        callback=self.log_password)] 


    def log_password(self, response): 
     return [FormRequest.from_response(response, 
        formdata={'Passwd': 'Password'}, 

        callback=self.after_login)] 

    def after_login(self, response): 
     if "authentication failed" in response.body: 
     self.log("Login failed", level=logging.ERROR) 
     return 
    # We've successfully authenticated, let's have some fun! 
     else: 
     print("Login Successful!!") 
     return Request(url="https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0", 
       method='POST', 
       headers=[{'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8', 
         'Galaxy-Ajax': 'true', 
         'Origin': 'https://analytics.google.com', 
         'Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1', 
         'User-Agent': 'My-user-agent', 
         'X-GAFE4-XSRF-TOKEN': 'Mytoken'}], 
       callback=self.parse_tastypage, dont_filter=True) 


    def parse_tastypage(self, response): 
     response = json.loads(jsonResponse) 

     inspect_response(response, self) 
     yield item 

E qui fa parte del registro:

2016-03-28 19:11:39 [scrapy] INFO: Enabled item pipelines: 
[] 
2016-03-28 19:11:39 [scrapy] INFO: Spider opened 
2016-03-28 19:11:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-03-28 19:11:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-03-28 19:11:40 [scrapy] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier> (referer: None) 
2016-03-28 19:11:46 [scrapy] DEBUG: Crawled (200) <POST https://accounts.google.com/AccountLoginInfo> (referer: https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr) 
2016-03-28 19:11:50 [scrapy] DEBUG: Redirecting (302) to <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> from <POST https://accounts.google.com/ServiceLoginAuth> 
2016-03-28 19:11:57 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> 
2016-03-28 19:12:01 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo) 
Login Successful!! 
2016-03-28 19:12:01 [scrapy] ERROR: Spider error processing <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo) 
Traceback (most recent call last): 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 577, in _runCallbacks 
    current.result = callback(current.result, *args, **kw) 
    File "/Users/aminbouraiss/super/super/spiders/mySuper.py", line 42, in after_login 
    callback=self.parse_tastypage, dont_filter=True) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/request/__init__.py", line 35, in __init__ 
    self.headers = Headers(headers or {}, encoding=encoding) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/headers.py", line 12, in __init__ 
    super(Headers, self).__init__(seq) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 193, in __init__ 
    self.update(seq) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 229, in update 
    super(CaselessDict, self).update(iseq) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 228, in <genexpr> 
    iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq) 
ValueError: too many values to unpack 
2016-03-28 19:12:01 [scrapy] INFO: Closing spider (finished) 
2016-03-28 19:12:01 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 6419, 
'downloader/request_count': 5, 
'downloader/request_method_count/GET': 3, 
'downloader/request_method_count/POST': 2, 
'downloader/response_bytes': 75986, 
'downloader/response_count': 5, 
'downloader/response_status_count/200': 3, 
'downloader/response_status_count/302': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 3, 28, 23, 12, 1, 824033), 
'log_count/DEBUG': 6, 
+2

no, utilizzare l'API –

+0

sto cercando di ottenere alcuni dati che non posso ottenere attraverso l'API –

risposta

2

Il tuo errore è perché le intestazioni deve essere un dizionario, non una lista all'interno di un dict:

headers={'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8', 

          'Galaxy-Ajax': 'true', 
          'Origin': 'https://analytics.google.com', 
          'Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1', 
          'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36', 
          }, 

Questo risolverà il problema attuale ma si otterrà un 411 in quanto è necessario specificare il contenuto-lunghezza anche, se si un dd cosa vuoi tirare da te sarò in grado di mostrarti come. È possibile visualizzare l'output di seguito:

2016-03-29 14:02:11 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> 
2016-03-29 14:02:13 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo) 
Login Successful!! 
2016-03-29 14:02:14 [scrapy] DEBUG: Crawled (411) <POST https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0> (referer: https://analytics.google.com/analytics/web/?hl=fr&pli=1) 
2016-03-29 14:02:14 [scrapy] DEBUG: Ignoring response <411 https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0>: HTTP status code is not handled or not allowed 
+0

Grazie Padraic, ti devo una birra! Ho cambiato le intestazioni delle richieste http e infine ha funzionato. –

+0

@gerardbaste, no prob, lieto di averlo risolto, felice di analizzare. –