Capturing Captchas – How to plan a data science project

Hi ya all!

It has been a really long time since my last blogpost. But now it is time for some Python and Machine Learning. In this blog series, I will tell you about reading out and processing captchas and how I planned this data science project.

What are Captchas?

Captchas were built to test whether a human or a machine is using an online service. Usually, they are a combination of letters which have been visually blurred, transformed or otherwise made hard to read. As we all know, the human mind is a fascinating thing and is still able to recognise the different letters and make guesses about them. A machine should not be able to read a captcha and can therefore not know what is written.

Emphasis on should – because by now they can and this is exactly what I wanted to do when starting this project.

About the project

When I say I, what I really mean is we 🙂 I did this project for school and I worked together with a school mate on this. I have her permission to write this blog, and I hope she likes it!

The Data

There are several sources for pictures of captchas which are open and free to use. There are also different kinds of captchas. Some are considere to be easier than others. For this project we went for an easy dataset of captchas, which you can find on this link.

Divide and Conquer

I downloaded the data and saved in a folder. Now that we have the material to work with, we need to decide what to do.

If you look at the problem as a whole, it most often looks overwhelming. But as soon as you break it down into little pieces, it becomes feasable. This is something that is often referred to as “divide and conquer”.

Idea of how to proceed

It is clear that we will work with Python, as this is the language we are working with at school. As we are working with pictures here, a convolutional neural network (CNN) seems to be a good choice to teach the computer to read captchas.

The captchas in our dataset have four letters. It does not make sense to train the net on the captchas as a whole. Rather, we should train it on the single letters. Therefore we will have to cut the captchas into four single letters.

In this dataset the captchas are always written in black on a pale coloured background. An idea could be to try to convert this coloured image into a black and white one and then to erase all the non-black colours, such that the lines of the letters are very clear to recognise. Then it should be easier for the computer to see where to draw a line and separate the letters.

When we have the single letters, we need to know which letter they are representing. So we need to map the labels to the different letters.

Last but not least, we also need to make sure that the pictures of the letters have the same size. This is a prerequesite to use a CNN in Keras.

Having prepared the data, we can separate the dataset into a training, test and validation set and finally start to work on the CNN.

From the above, we roughly had the following steps in mind:

Step 1

Figure out what to do. This is what we are doing right now!

Step 2

Find out how to work with pictures and separate one captcha into four different letters.

Step 3

Do step 1 for all the captchas and label them in the process. Meanwhile, also make sure they all have the same size.

Step 4

Define, compile, fit and evaluate the CNN. This is an iterative process.

Step 5

Make predictions! (and write a blogpost about it ;))

I think we have now an outline of the project and will proceed with Step 2 the next time!

Thank you so much for reading!



Leave a Reply

Your e-mail address will not be published. Required fields are marked *