Scrapy Cloud - Storing Results in Google Cloud Storage
While building my pet project, I found the need to scrape files from certain websites. Using scrapy library in Python, I wanted to download all files in the website pages to Google Cloud Storage (GCS).
The scrapy documentation and scrapinghub is not really clear about the steps to authenticate the scrapy cloud to upload scrapped files to the GCS bucket. I tried to follow the tutorial here but to no success.
Here are the steps I follow to succesfully connect scrapy cloud and GCS:
- Set up the GCS bucket and take note of the bucket name.
-
In your scrapy project, open the
settings.py
and add theFilesPipeline
.ITEM_PIPELINES = { 'scrapy.pipelines.files.FilesPipeline': 1, }
-
Specify your bucket name and Google Cloud Project ID in
settings.py
.FILES_STORE = 'gs://bucket-name/' GCS_PROJECT_ID = 'project-id'
Tip: Don’t forget to add trailing slash (directory separator) in the bucket name in
FILES_STORE
. - Next, you have to obtain the credentials to connect to GCS from your scrapy project. Create the service account key here by selecting New service account and JSON key type. From the role list, select Project > Owner. A JSON file will download to your computer. See this page for more instructions.
-
In the same directory as your
settings.py
, find and open__init__.py
and add this code (source).import os import json import pkgutil import logging path = "{}/google-cloud-storage-credentials.json".format(os.getcwd()) credentials_content = ''' escaped content of the credentials JSON ''' with open(path, "w") as text_file: text_file.write(credentials_content) logging.warning("Path to credentials: %s" % path) os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = path
- Open the JSON file and copy paste all the contents to this site here to escape the string. Then copy the string to the
credentials content
in the above code. - Then in the spider code, returns a dict with the URL’s key (
file_urls
). See the scrapy documentation for more detail.