This blog post is the first in a series about using C++, Python and OpenCV to train a classifier to detect red trousers (it’s a fairly arbitrary choice of feature to detect - I have no strong feelings against them!). In this post, I’ll explain how you can use Python’s Scrapy module to acquire training data by building a spider to extract image URLs, and the Image Pipeline functionality to download the images. The site in question is “”, which contains around 200 images of people wearing red trousers. Whether or not this is enough images to train our classifier remains to be seen!

To start off with, Download and Install Scrapy. This can normally be done using pip install scrapy, or perhaps easy_install scrapy if you don’t have Pip installed (you should; it’s better!).

Once installed, you the scrapy tool should accessible from the command (if not, there’s probably a problem with your PATH environmental variable, but that’s out side the scope of this post). We can use this tool to help write the XPath selector we need to be able to access the relevant images. XPath is an XML query language that can be used to select elements from an XML tree – in this case, we’re interested in image nodes held within divs of the class ‘entry-content’:

By running the command scrapy shell, we’re dropped into an interactive Scrapy session (which is really just a jazzed up Python shell), through which we can query the DOM.

As mentioned, we are interested in divs of class entry-content, as these would seem to hold the content for each post. The Scrapy shell exposes an object called response, which contains the HTML of the response received from the web server, along with several methods that we can use to query said HTML. You’d need to consult other resources to learn all of the XPath syntax - but in this case,  the query '//*[contains(@class, "entry-content")]' will match every node with a class attribute set to entry-content (‘//*’ matches all nodes, these nodes are then filtered by class value using '[contains(@class, "entry-content")]'.

To query the HTML in our Scrapy session, we can run: response.selector.xpath('//*[contains(@class, "entry-content")]'). You should see a big list of Scrapy Selectors, 50 in total. Now that we have a selector to match the post divs, we need to be able to extract the image URLs. As it turns out, in this case the tags aren’t directly inside the “entry-content” div’s, they are held within another div (named separator), which contains an a node, which then contains the img node we’re interested in. Phew.

With a little bit of trial and error, the final XPath selector that we can use to grab the src attribute from the relevant images is: '//*[contains(@class, "entry-content")]/div[contains(@class, "separator")]/a/img/@src'.

Now we can create our Scrapy project. Do do this, simply run "scrapy startproject trousers". This will copy over some boilerplate code that we can use to create the spider. You should now have a folder called “Trousers”, inside which is another folder named “Trousers”. Open up the file named “”, and change the “TrouserItems” class to look like the following:

class TrousersItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

We need to use these specific variable names, as Scrapy’s ImagePipeline (which we can use to do the actual image downloading) expects these names. We will populate image_urls with data extracted with our XPath query, and the ImagePipeline will populate images with the actual image file.

Now that we have a class to hold data about our Images, we can write write the Spider that tells Scrapy how it should acquire the images. Create a new file (I called mine, you can name it anything) in the spiders/ directory, containing the following:

from scrapy import Spider, Item, Field, Request
from items import TrousersItem
class TrouserScraper(Spider):
    name, start_urls = "Trousers", [""]
    def parse(self, response):
        for image in response.selector.xpath('//*[contains(@class, "entry-content")]/div[contains(@class, "separator")]/a/img/@src'):
             yield TrousersItem(image_urls=[image.extract()])
        for url in response.selector.xpath("//*[contains(@class, 'blog-pager-older-link')]/@href"):
            yield Request(url.extract(), callback=self.parse)

In this file, we are creating a new class (inheriting from the Spider Scrapy class), defining a name, starting URL and parse method for our spider. The parse method is looping over each matching element from our XPath query, yielding a new TrousersItem. It is also finding the hyperlink to the “Older Posts” link (if such a link exists), and recursively calling itself under such circumstances. This is an easy way of dealing with pagination.

As we want to download the matching images, we can use the ImagePipeline feature in Scrapy. To enable it, modify the “” file in the Trousers subdirectory, adding the following two line (inserting a valid path in place of /Path/To/Your/Chosen/Directory):

ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}
IMAGES_STORE = '/Path/To/Your/Chosen/Directory'

The Image Pipeline will now consume the TrouserItems we yield from our TrouseScraper.parse() method, downloading them to the IMAGES_STORE folder.

To run the spider, execute the command “scrapy crawl Trousers”, cross your fingers, and check the IMAGE_STORE directory for a shed load of images of people wearing red trousers!

Screen Shot 2015-01-22 at 18.05.58

If you receive errors about PIL not being available, it’s because you’re missing the Python Imaging Library. Running pip install Pillow should sort the problem out.

In the next post, we’ll build our Cascade Classier using our scraped images, and start detecting some red trousers!