Using Giphy to generate random numbers

While talking on Slack one day, we were discussing generating random numbers with Python, and how we might go about doing it without the random module (or os.urandom()). Of course, there are many possible ways to achieve this, but Matthew Nunes jokingly suggested using Giphy as a source of random data.

I decided to give this a bash — it seems to work! Obviously, it’s very silly and shouldn’t be used for anything at all important. It was more of a “just because” exercise…

Downloading Files Quickly with aria2

At University, I have been blessed with a 1gb Ethernet connection, which is great for downloading large datasets and ISOs etc. However, I often find that the bandwidth of the server from which I am downloading a file is the limiting factor, meaning I cannot always max out the connection.

After some searching, I came across the tool aria2c, which has quickly become my wget replacement. Aria2 is a cross-platform tool that allows you to download files using multiple connections, allowing you to take full advantage of CDNs and load balancing.

Where you might normally run the command:


the aria2 equivalent is:

aria2c -x4

This tells aria2 to use 4 concurrent connections to download the ISO.

Aria2 supports more than just the http/https protocols — it comes with support for SFTP, BitTorrent and Metalink. Aria2 automatically detects the correct connection type based on the URL scheme. Additionally, aria2 can be controlled using remotely over an API.

As per the documentation, aria2 can be used to download batches of files. By placing a list of URLs in a text file, separated by newline, and then calling aria2c -i urls_filename.txt, aria2 will chomp through and download each entry.

All major platforms are supported, including most flavours of Linux, OS X, Windows and Android.

Under Ubuntu/Debian, aria2 can be installed with:

sudo apt-get install aria2

under CentOS/Fedora/Scientific Linux with:

sudo yum install aria2

or under OS X (using brew) with:

brew install aria2

For other platforms, see the guide listed on the aria2 website.

Research into Sonification of Video/Depth Data (University Dissertation)

I have recently completed my Undergraduate Degree in Computer Science at Cardiff University. My final year project was on the topic of “Video to Audio Conversion for the Visually Impaired”.

The project was quite broad, research-heavy, and in an area that I had little experience in – so it was quite a learning experience!

Using an Asus XTION Camera to retrieve both RGB and Depth information from the environment, I experimented with ways of extracting shapes from the footage (in real-time), extracting various properties from these shapes (including Hu invariant moments and Elliptical Fourier co-efficients), using properties to calculate shape-similarity, and conveying this information in the form of audio.

I intend to write some posts detailing these individual topics, but in the mean time, if you’re interested, my dissertation can be downloaded here. As I mentioned, it’s a fairly broad, whistle-stop tour of the approaches I took in attempting to solve the problem posed, and is by no means a “ground truth”.

I’m returning to Cardiff in October as a PhD Student, working on another computer-vision related project.

Three Information Disclosure Vulnerability

A few weeks ago, I got an email from Three asking me to fill out a survey for them, rating my satisfaction with their services. They offered “the chance to win an iPad”, so I decided I’d fill in the survey to provide some feedback (I’m generally a fairly satisfied customer).

The link opened in my default web browser (Firefox), which happened to be linked up to Burp – after filling and submitting the survey, I was able to view the requests and responses that Firefox had made during the process. After looking at these requests, I noticed something quite worrying.

The site (, now closed) was making an AJAX request to an API (, where xxxxxxxxxx is the 3 phone number). The request was made over cleartext HTTP, passing my mobile phone number in the URL. The response included my 3 account number, my full name, my email address and some other account identifier. I confirmed that this was the case for other numbers by entering a friends phone number (with their permission) – sure enough, their name and contact details were presented to me.  Information from the API was presented in the following form, as JSON:

{"success":true,"id":"813xxx","user_id":"9555xxxxxx","phone":"447598xxxxxx","email":"xxxxxx@yyyyy.zzz","title":"Mr","name":"Joseph","surname":"Redfern","email_vs_sms":"xxx","timestamp":"2015-05-xx xx:xx:xx"}

Clearly, this information disclosure isn’t ideal. The ability to find out the account holder and contact details behind ANY 3 phone number could come in handy for social engineering attacks, stalking, spamming etc. It would also be possible to scrape the API and build up a database of 3 customers by brute-forcing the API using H3G Number Prefixes, which can be found here – such a database could be very valuable to Three’s competitors,  marketing companies etc. I’d consider it a fairly severe breach of privacy.

The bizarre thing is that the survey didn’t appear to use any of the information returned by the API – the thank you page had no reference of my name, email address or account number.

I reported the issue to Three customer support, and requested that I be notified once their security team had acknowledged the issue. Customer Support said that they’d pass the request on, but that they couldn’t promise anything – sadly, they didn’t bother to get back to me (and I didn’t even win their competition!). The survey has now been taken down, along with the offending API. I can’t be sure if this was in response to these issues, or if the closure was planned – but either way, this no longer seems to be a problem.

Video Demo:

OpenCV Feature Detection – Part #1: Acquiring Red Trousers

This blog post is the first in a series about using C++, Python and OpenCV to train a classifier to detect red trousers (it’s a fairly arbitrary choice of feature to detect – I have no strong feelings against them!). In this post, I’ll explain how you can use Python’s Scrapy module to acquire training data by building a spider to extract image URLs, and the Image Pipeline functionality to download the images. The site in question is “”, which contains around 200 images of people wearing red trousers. Whether or not this is enough images to train our classifier remains to be seen!

To start off with, Download and Install Scrapy. This can normally be done using pip install scrapy, or perhaps easy_install scrapy if you don’t have Pip installed (you should; it’s better!).

Once installed, you the ‘scrapy’ tool should accessible from the command (if not, there’s probably a problem with your PATH environmental variable, but that’s out side the scope of this post). We can use this tool to help write the XPath selector we need to be able to access the relevant images. XPath is an XML query language that can be used to select elements from an XML tree – in this case, we’re interested in image nodes held within divs of the class ‘entry-content’:

By running the command ‘scrapy shell’, we’re dropped into an interactive Scrapy session (which is really just a jazzed up Python shell), through which we can query the DOM.

As mentioned, we are interested in divs of class ‘entry-content’, as these would seem to hold the content for each post. The Scrapy shell exposes an object called ‘response’, which contains the HTML of the response received from the web server, along with several methods that we can use to query said HTML. You’d need to consult other resources to learn all of the XPath syntax – but in this case,  the query '//*[contains(@class, "entry-content")]' will match every node with a class attribute set to "entry-content" (‘//*’ matches all nodes, these nodes are then filtered by class value using '[contains(@class, "entry-content")]'.

To query the HTML in our Scrapy session, we can run: response.selector.xpath('//*[contains(@class, "entry-content")]'). You should see a big list of Scrapy Selectors, 50 in total. Now that we have a selector to match the post divs, we need to be able to extract the image URLs. As it turns out, in this case the <img> tags aren’t directly inside the “entry-content” div’s, they are held within another div (named separator), which contains an a node, which then contains the img node we’re interested in. Phew.

With a little bit of trial and error, the final XPath selector that we can use to grab the src attribute from the relevant images is: '//*[contains(@class, "entry-content")]/div[contains(@class, "separator")]/a/img/@src'.

Now we can create our Scrapy project. Do do this, simply run "scrapy startproject trousers". This will copy over some boilerplate code that we can use to create the spider. You should now have a folder called “Trousers”, inside which is another folder named “Trousers”. Open up the file named “”, and change the “TrouserItems” class to look like the following:

class TrousersItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

We need to use these specific variable names, as Scrapy’s ImagePipeline (which we can use to do the actual image downloading) expects these names. We will populate image_urls with data extracted with our XPath query, and the ImagePipeline will populate images with the actual image file.

Now that we have a class to hold data about our Images, we can write write the Spider that tells Scrapy how it should acquire the images. Create a new file (I called mine, you can name it anything) in the spiders/ directory, containing the following:

from scrapy import Spider, Item, Field, Request
from items import TrousersItem
class TrouserScraper(Spider):
    name, start_urls = "Trousers", [""]
    def parse(self, response):
        for image in response.selector.xpath('//*[contains(@class, "entry-content")]/div[contains(@class, "separator")]/a/img/@src'):
             yield TrousersItem(image_urls=[image.extract()])
        for url in response.selector.xpath("//*[contains(@class, 'blog-pager-older-link')]/@href"):
            yield Request(url.extract(), callback=self.parse)

In this file, we are creating a new class (inheriting from the Spider Scrapy class), defining a name, starting URL and parse method for our spider. The parse method is looping over each matching element from our XPath query, yielding a new TrousersItem. It is also finding the hyperlink to the “Older Posts” link (if such a link exists), and recursively calling itself under such circumstances. This is an easy way of dealing with pagination.

As we want to download the matching images, we can use the ImagePipeline feature in Scrapy. To enable it, modify the “” file in the Trousers subdirectory, adding the following two line (inserting a valid path in place of /Path/To/Your/Chosen/Directory):

ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}
IMAGES_STORE = '/Path/To/Your/Chosen/Directory'

The Image Pipeline will now consume the TrouserItems we yield from our TrouseScraper.parse() method, downloading them to the IMAGES_STORE folder.

To run the spider, execute the command “scrapy crawl Trousers”, cross your fingers, and check the IMAGE_STORE directory for a shed load of images of people wearing red trousers!

Screen Shot 2015-01-22 at 18.05.58

If you receive errors about PIL not being available, it’s because you’re missing the Python Imaging Library. Running ‘pip install Pillow’ should sort the problem out.

In the next post, we’ll build our Cascade Classier using our scraped images, and start detecting some red trousers!