Scrapy and persistent cookie manager middleware

Scrapy is a nice python environment for web scraping, i.e. extracting information from web sites automatically by crawling them. It works best with anonymous data discovery, but nothing stops you from having active sessions as well. In fact, scrapy transparently manages cookies, which are usually used to track user sessions. Unfortunately, the sessions don't survive between runs. This, however, can be fixed quite easily by adding custom cookie middleware. Here is an example:

  1. from __future__ import absolute_import
  2.  
  3. import os
  4. import os.path
  5. import logging
  6. import pickle
  7.  
  8. from scrapy.http.cookies import CookieJar
  9.  
  10. from scrapy.downloadermiddlewares.cookies import CookiesMiddleware
  11.  
  12. import settings as settings
  13.  
  14. class PersistentCookiesMiddleware(CookiesMiddleware):
  15.     def __init__(self, debug=False):
  16.         super(PersistentCookiesMiddleware, self).__init__(debug)
  17.         self.load()
  18.  
  19.     def process_response(self, request, response, spider):
  20.         # TODO: optimize so that we don't do it on every response
  21.         res = super(PersistentCookiesMiddleware, self).process_response(request, response, spider)
  22.         self.save()
  23.         return res
  24.  
  25.     def getPersistenceFile(self):
  26.         return settings.COOKIES_STORAGE_FILE
  27.  
  28.     def save(self):
  29.         logging.debug("Saving cookies to disk for reuse")
  30.         with open(self.getPersistenceFile(), "wb") as f:
  31.             pickle.dump(self.jars, f)
  32.             f.flush()
  33.  
  34.     def load(self):
  35.         filename = self.getPersistenceFile()
  36.         logging.debug("Trying to load cookies from file '{0}'".format(filename))
  37.         if not os.path.exists(filename):
  38.             logging.info("File '{0}' for cookie reload doesn't exist".format(filename))
  39.             return
  40.         if not os.path.isfile(filename):
  41.             raise Exception("File '{0}' is not a regular file".format(filename))
  42.  
  43.         with open(filename, "rb") as f:
  44.             self.jars = pickle.load(f)

Then configure your spider to use the new middleware in settings.py:

  1. DOWNLOADER_MIDDLEWARES = {
  2.     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None,
  3.     'middlewares.cookies.PersistentCookiesMiddleware': 701,
  4. }
Topic: 

Comments

Suggest put following link to the end of this thread:
http://stackoverflow.com/questions/20748475/how-to-add-custom-spider-download-middlewares-to-scrapy

Your comment should be good enough in order to preserve the link :)

Hi,

Why you added this line?

# TODO: optimize so that we don't do it on every response

I thought it was obvious - right now on each call the persistence file is written to. This might not be very efficient, especially if cookies didn't change often. One way to deal with it would be to have a cache of what was written and write only if the value was new. This is left as an exercise to the reader :)

Add new comment