Ho cercato di utilizzare Scrapy per ottenere alcuni dati da Google Analytics e nonostante sia un principiante completo di Python ho fatto dei progressi. Ora posso accedere a Google Analytics di Scrapy, ma ho bisogno di fare una richiesta AJAX per ottenere i dati che voglio. Ho cercato di replicare intestazione di richiesta HTTP del browser con il codice qui sotto ma non sembra funzionare, il mio log di errore diceScraping Google Analytics di Scrapy
troppi valori per disfare
potrebbe aiutare qualcuno? Ci ho lavorato per due giorni, ho la sensazione di essere molto vicino ma sono anche molto confuso.
Ecco il codice:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from scrapy.selector import Selector
import logging
from super.items import SuperItem
from scrapy.shell import inspect_response
import json
class LoginSpider(BaseSpider):
name = 'super'
start_urls = ['https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'Email': 'Email'},
callback=self.log_password)]
def log_password(self, response):
return [FormRequest.from_response(response,
formdata={'Passwd': 'Password'},
callback=self.after_login)]
def after_login(self, response):
if "authentication failed" in response.body:
self.log("Login failed", level=logging.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
print("Login Successful!!")
return Request(url="https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0",
method='POST',
headers=[{'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
'Galaxy-Ajax': 'true',
'Origin': 'https://analytics.google.com',
'Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1',
'User-Agent': 'My-user-agent',
'X-GAFE4-XSRF-TOKEN': 'Mytoken'}],
callback=self.parse_tastypage, dont_filter=True)
def parse_tastypage(self, response):
response = json.loads(jsonResponse)
inspect_response(response, self)
yield item
E qui fa parte del registro:
2016-03-28 19:11:39 [scrapy] INFO: Enabled item pipelines:
[]
2016-03-28 19:11:39 [scrapy] INFO: Spider opened
2016-03-28 19:11:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-28 19:11:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-28 19:11:40 [scrapy] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier> (referer: None)
2016-03-28 19:11:46 [scrapy] DEBUG: Crawled (200) <POST https://accounts.google.com/AccountLoginInfo> (referer: https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr)
2016-03-28 19:11:50 [scrapy] DEBUG: Redirecting (302) to <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> from <POST https://accounts.google.com/ServiceLoginAuth>
2016-03-28 19:11:57 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA>
2016-03-28 19:12:01 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Login Successful!!
2016-03-28 19:12:01 [scrapy] ERROR: Spider error processing <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/aminbouraiss/super/super/spiders/mySuper.py", line 42, in after_login
callback=self.parse_tastypage, dont_filter=True)
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/request/__init__.py", line 35, in __init__
self.headers = Headers(headers or {}, encoding=encoding)
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/headers.py", line 12, in __init__
super(Headers, self).__init__(seq)
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 193, in __init__
self.update(seq)
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 229, in update
super(CaselessDict, self).update(iseq)
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 228, in <genexpr>
iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq)
ValueError: too many values to unpack
2016-03-28 19:12:01 [scrapy] INFO: Closing spider (finished)
2016-03-28 19:12:01 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6419,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 75986,
'downloader/response_count': 5,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 3, 28, 23, 12, 1, 824033),
'log_count/DEBUG': 6,
no, utilizzare l'API –
sto cercando di ottenere alcuni dati che non posso ottenere attraverso l'API –