Learning Captchas – Training the CNN

Happy Sunday!

Today we dive into step 4 of the Captcha Project! Always been keen on implementing a convolutional neural network? Then stay tuned!

Neural Networks

A confession is in order: I am not really good with Neural Networks.

In January I started to read the Deep Learning Book which is online available for free. A month later school started again and during the first few weeks we learned about neural networks. This is also why we did the captchas as a school project.

So be kind if my neural network is really not the best. Whenever I reread my posts a while later I feel crunched because what I used to do a few months or weeks ago now looks so “newbie”. But hey, you have to start somewhere and this is also something that I want to show on this blog. No master just fell from the sky. And it is okay to make mistakes, as long as you learn from them.

Why a convolutional neural network?

Why did I go for a convolutional neural network (CNN) and not for a recurrent one? The data we are working with are pictures which we translate into matrices. The different pixels are stored in those matrices. The positions of the pixels are relevant. Just think about it. If we would change all the positions of the pixels in the picture we as humans would not be able anymore to read the letters.

A CNN takes the positions into account. For other neural networks the pixels are just plain data and whether they come in this coordination or in a permutation of it does neither matter nor change the result.

Setting a Baseline

But before implementing a CNN it is advisable to set a baseline to which we can compare the CNN we are creating.

We already did that in the last post. Click on here to check it out if you haven’t yet.

On to the CNN

More data prep

I think I mentioned it in one of my earlier posts, 80% of the work is data preparation. Although we already did a lot in this department, there are still a few steps missing before we can hand over the data to Keras.

When implementing the baseline we saw, that the data is not balanced. This will most likely also be an issue for the CNN.

Further, we need to make sure that we pass the correct shape for each picture to Keras. Otherwise, the CNN won’t fit. Keras expect a three-dimensional array. When Opencv reads our image, we get a two-dimensional result, because we read it with the grayscale. So we need to add a third “empty” dimension to our images. We can do this by changing one line of code.

We can reuse the logic for resampling and balancing the data which we use for the baseline.

#Path to where the single letter images are
einfach_letters = r'...\Einfach_letters'

# initialize the data and labels

data = []
labels = []

files = os.listdir(einfach_letters)

for f in files:
    #loop through the letters
    data_sub = []
    labels_sub = []
    for image_file in paths.list_images(einfach_letters+'\\'+f):
        # loop over the input images
        # Load the image
        image = cv2.imread(image_file, cv2.IMREAD_GRAYSCALE)

        #Reshape to later use in Random Forest
        #image = np.asarray(image).reshape(-1)

        # Add a dimension to the image to process it in Keras later
        image = np.expand_dims(image, axis=2)

        # Get the name of the letter based on the folder it was in
        label = image_file.split(os.path.sep)[-2]
        #add to the sub
    if len(data_sub) < 110:
        boot = resample(data_sub, replace = True, n_samples = 110, random_state = 1)
        boot_label = labels_sub[0]
        for i in boot:
        for i in data_sub:

Then we should make a quick check on the balance. We can do this the same way we did before.

#check whether the dataset is balanced
import pandas as pd
label_bal = pd.DataFrame(labels)

At the moment both data and labels are lists. We need to transform them into an n-dimensional array. Numpy can help us with that. If you have not imported numpy in the beginning, you can do it now and then use the following lines. Also, if you are not yet familiar with numpy, here is the numpy page about n-dimensional arrays.

import numpy as np
data = np.array(data)

The shape should now be (2090, 60, 60, 1). 2090 because we now have 19*110 pictures, each has 60×60 pixels and we added the third dimension to each image to later process in Keras.


Last but not least we need to tend to the labels. Just as with the random forest, Keras wants to get numeric values for labels. We can again copy the label conversion from before.

alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
           'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u',
           'v', 'w', 'x', 'y', 'z']
dic = dict(zip(alphabet, list(range(1,len(alphabet)+1))))
dicRev = dict(zip(list(range(1,len(alphabet)+1)),alphabet))
new_labels = [dic[v] for v in labels]
new_labels = np.asarray(new_labels)
new_labels = new_labels.reshape(-1,1)

After this preparation, the new_labels are already an array and not a list anymore.

Keras specific pre-processing

We just did the labels a few seconds ago. Actually, Keras does not only want numerical values, but it also wants the labels to come in a matrix. Or better, to come in OneHotEncoding.


This really just stands for a format for the labels. At the moment, each letter is now represented by the corresponding number it has in the alphabet. Meaning “a” is shown as 1, “b” is shown as 2 and so on.

Keras now wants us to pass something like this to it:

observation nr a b c d
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0

(Yay, please admire my html/css skills I used for the table above ;))

from sklearn import preprocessing

## One Hot Encoder from sklearn preprocessing transforms an array-like of integers or strings
## creates a binary column for each category and returns a sparse matrix or dense array

##then we can apply the OneHotEncoding
ohe = preprocessing.OneHotEncoder(sparse = False)
labels_ohe = ohe.fit_transform(new_labels)

Actually, now that I look at it, sklearn expects numeric values too for the OneHotEncoding *imagen the monkey which is covering its eyes emojy here*. So the last conversion did actually make sense 🙂

Shuffling the data

This is not Keras specific but I only want to do it now. We need to shuffle the data again. We can copy past this again from the baseline and adjust it a little.

## because all the letters are sorted right now we need to shuffle the dataset a bit

## numpy has a shuffle method
permutation = np.arange(len(data)) #n = number of pictures/length of data

data_shuffled = [data[i] for i in permutation]
labels_ohe_shuffled = [labels_ohe[i] for i in permutation]

Training, test and validation sets

For the baseline we only had training and test sets. Now with the CNN we will add a validation set. When we set up the neural network later, you will see why. Also, I do it in a very complicated fashion… but hey, it’s not dumb if it works 😉

## now we can define a training, test and validation set
n = len(data)
test_size = 1/3
val_size = 1/5
train_size = 2/3

X_train = data_shuffled[0:round(n*train_size)]
Y_train = labels_ohe_shuffled[0:round(n*train_size)]

X_val = data_shuffled[round(n*val_size)*2:round(n*val_size)*2+round(n*val_size)]
Y_val = labels_ohe_shuffled[round(n*val_size)*2:round(n*val_size)*2+round(n*val_size)]

X_test = data_shuffled[round(n*train_size):n]
Y_test = labels_ohe_shuffled[round(n*train_size):n]

Finally putting the network together

After all this preparation, we can finally start with the real deal. First, we need modules out of the Keras package. Also, we will set the random seed in Tensorflow, to make our results reproducable.

import matplotlib.pyplot as plt
import matplotlib.image as imgplot

import tensorflow as tf

from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, BatchNormalization
from keras.layers import Conv1D, Conv2D, Convolution2D, MaxPooling2D, Flatten
import keras
import sys
print("Keras {} TF {} Python {}".format(keras.__version__, tf.__version__, sys.version_info))

I also like to check the versions of Tensorflow and Keras. Some combinations of versions work better than others. I am using Keras 2.2.4 and Tensorflow 1.12.0.

# Build the neural network!
model = Sequential()

# First convolutional layer with max pooling
model.add(Conv2D(20, (6, 6), padding="same", input_shape=(60, 60, 1), activation="relu"))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(3, 3)))

# Second convolutional layer with max pooling
model.add(Conv2D(15, (3, 3), padding="same", activation="relu"))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(3, 3)))

# Hidden layer with 300 nodes
model.add(Dense(300, activation="relu"))

# Output layer with l nodes (one for each possible letter we predict)
model.add(Dense(19, activation="softmax"))

For the first layer we need to pass the input shape of each picture to Keras. This is the (60, 60, 1) part. In the last layer, which is the output layer, we need to tell Keras how many classes there are. As we do only have 19 letters instead of the whole alphabet in our data, we type 19.

The different layers were found by try and error. As for now, this is all on how to find a good neural network. Whenever I will have a little more experience, I might attempt to write a guideline on how to find a suitable network. Alternatively, you can also load pre-trained networks and use them. However, this is neither the time nor the place. For now, I was happy with this network.

Compile and evaluate the network

Last but not least we need to compile the network.

#Compile the network
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

Further we also wan’t to know how well our CNN is doing. This is why we earlier created the validation set. So we can use it when fitting the model.

#evaluate network
history=model.fit(np.array(X_train), np.array(Y_train), validation_data=(np.array(X_val), np.array(Y_val)), batch_size=32, epochs=10, verbose=1)

It is easier to see this in a chart…

# visualize history 
plt.title('model accuracy')
plt.legend(['train', 'valid'], loc='lower right')
model.evaluate(np.array(X_test), np.array(Y_test))

The line above will return the loss and the metric. For my model it gave back [0.16969188620233466, 0.9555236721141438]. This means I have an accuracy of 96% or an error of 4%.

Wow! We did a lot! I think a break is in order. In the next post we will wrap up the project and implement the function which will “predict” what the captcha stands for.

Hope you enjoyed reading!



Leave a Reply

Your e-mail address will not be published. Required fields are marked *