Capturing Captchas – Setting a baseline

captcha neural network random forest baseline machine learning

Hiya!

This time we are finally applying some of those machine learning skill! Before we can start with the CNN, we should set a baseline. So let’s start!

Why a baseline?

You might have asked this yourself. If you go back to the first post of this series, I did not mention a baseline in the planning. Mainly because I forgot 🙂

A baseline helps you to evaluate how good your neural network is compared to the “normal” machine learning methods. You might sometims find that a random forest is just as good or even better than a deep learning method.

Setting the baseline

Choosing a baseline method

Assigning each of the images the correct letter is a classification problem. For these tasks I personally quite like random forests. So I will have a random forest as a baseline.

Toolbox

The Python package sklearn already has all the necessary methods and functions we need to train a random forest. In case you do not have sklearn yet for your Anaconda distribution, open the Anaconda Prompt and type the following:

conda install scikit-learn

In your code we will use the followind command to use sklearn in our script.

import sklearn

Actually, we will only need the RandomForestClassifier. If you are not familiar with it, you can read more about it here.

from sklearn.ensemble import RandomForestClassifier

I think those are all the tools which we might not have mentioned in one of the blog posts before. If you find a package missing, let me know down below in the comments.

Data Prep

Last time we ended with slicing the captchas into four images each containing one letter. We saved them into a folder called Einfach – letters and created subfolders for each letter. Meaning all the “a” we cut out from the captchas are stored in a folder named “a”, all the “b” we cut out form the captchas are stored in a folder named “b” and so on…

So first we need to read in those letters and their label, which is now given by the folder path and put them into a shape the random forest will like.

#Path to the folder where the letter images are, was output_folder in the old script
einfach_letters = r'...\Einfach_letters'

# initialize the data and labels

data = []
labels = []

images = paths.list_images(einfach_letters)

# loop over the input images
for image_file in images:
    # Load the image
    image = cv2.imread(image_file, cv2.IMREAD_GRAYSCALE)
    
    # reshape to later use in random forest
    image = np.asarray(image).reshape(-1)

    # Get the name of the letter based on the folder it was in
    label = image_file.split(os.path.sep)[-2]

    # Add the letter image and it's label to our training data
    data.append(image)
    labels.append(label)

If you are working with the same data as I have, you will see that we do not get all the letters in the alphabet. Apparently, the captchas we have do not use all the letters. In real life this would be a big issue and we would have to make sure to have a “full” dataset. As this was for a school project and mainly served the purpose to get some experience, I let it slip this time 😉

Label conversion

The labels for each letter are – letters. Duh! The random forest does not work with letters. It wants numeric input. So we need to convert the letters to numbers. Many ways lead to Rome, this is how I did it:

alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
           'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u',
           'v', 'w', 'x', 'y', 'z']
dic = dict(zip(alphabet, list(range(1,len(alphabet)+1))))
dicRev = dict(zip(list(range(1,len(alphabet)+1)),alphabet))
new_labels = [dic[v] for v in labels]
new_labels = np.asarray(new_labels)
new_labels = new_labels.reshape(-1,1)

Shuffle shuffle

Because all the letters are alphabeticly sorted and processed in this order, I want to shuffle the dataset. This way I can make sure that the training and the test data, which we will generate shortly, contains all the (available) letters.

## because all the letters are sorted right now we need to shuffle the dataset a bit

## numpy has a shuffle method
permutation = np.arange(len(data))
np.random.shuffle(permutation)

data_shuffled = [data[i] for i in permutation]
labels_shuffled = [new_labels[i] for i in permutation]

Training and Test Set

So this is the only point where I am not 100% sure. For a random forest I would divide the dataset into training and test set. For a neural network, I learned that we need to create test, training AND validation set. I am not sure however whether they need to be the same or not in order to be really comparable. Well… I just go ahead and will think about this when we are working on the CNN.

#create training and test set

x_train = data_shuffled[0:int(len(data)*0.7)]
y_train = labels_shuffled[0:int(len(data)*0.7)]
x_test = data_shuffled[int(len(data)*0.7):len(data)]
y_test = labels_shuffled[int(len(data)*0.7):len(data)]

Building the forest

Now the data is in the right size and shape to be handed over to a random forest. We already imported the method from the sklearn package.

First we need to “initiate” the random forest and to defined its parameters.

rf = RandomForestClassifier(n_estimators = 100, max_depth = 3, random_state = 1)

The n_estimators defines how many trees are calculated and the max_depth defines how “deep” they go. The random_state is to set a seed for the random generator. If you leave it blank, the seed will be a random number for each “run” you do.

I just started with some low numbers. Now, I can fit the random forest to our captcha data. I can do this in one line.

rf.fit(x_train, y_train)

Of course, I also want to know how well my forest is performing. This again takes not more than one line.

rf.score(x_test, y_test)

The score returns the mean accuracy.

Is this a good baseline?

When I let it run with my data, I first had around 35% accuracy. In my opinion, this is not a very good value. I then tweeked the parameters a bit, meaning I made it deeper and with more trees. I then achieved a value of around 72% perent. But still, I was not happy with the result.

One possible reason for the mediocre accuracy can be, that our datast is not balanced. Meaning the different classes do not have an equal amount of data for training. A quick check shows us, that there really is no balance. For the check I used the following lines of code.

#check whether the dataset is balanced
import pandas as pd
label_bal = pd.DataFrame(labels)
label_bal[0].value_counts()

Rebalancing the dataset

Seeing these results, we can try to rebalance the dataset. We can either do this by putting a limit on the number of data per letter. Or we can bootstrap our dataset.

Limit the data

If we want to limit the data, we can do so in the step where we read in the data by slightly changing the path, introducing a counter and a small if condition:

# initialize the data and labels

data = []
labels = []

files = os.listdir(einfach_letters)

for f in files:
    #loop through the letters
    i = 0
    for image_file in paths.list_images(einfach_letters+'\\'+f):
        if i <= 52:
            # loop over the input images
            # Load the image
            image = cv2.imread(image_file, cv2.IMREAD_GRAYSCALE)

            #Reshape to later use in Random Forest
            image = np.asarray(image).reshape(-1)

            # Get the name of the letter based on the folder it was in
            label = image_file.split(os.path.sep)[-2]

            # Add the letter image and it's label to our training data
            data.append(image)
            labels.append(label)
            i = 1+i

This balanced dataset is now obviously smaller than the one we had before. But I reached an accuracy of around 82%. This is okay.

Bootstrap the data

Another, and more preferable way, is to bootstrap the available data, such that all the letters have a more or less equal amount of reputation. Bootstrapping means that you use an observation multiple times. Which observation is used more than once is random.

Again I changed the part where we read the data. I used the same folder structure as we did before when limiting the data and used sklearns built-in bootstrap method to “blow up” my dataset. Furthermore, I also needed some sublists to store my data in during the loops. But I think the code explains it much better than I do 🙂

# initialize the data and labels

data = []
labels = []

files = os.listdir(einfach_letters)

for f in files:
    #loop through the letters
    data_sub = []
    labels_sub = []
    for image_file in paths.list_images(einfach_letters+'\\'+f):
        # loop over the input images
        # Load the image
        image = cv2.imread(image_file, cv2.IMREAD_GRAYSCALE)

        #Reshape to later use in Random Forest
        image = np.asarray(image).reshape(-1)

        # Get the name of the letter based on the folder it was in
        label = image_file.split(os.path.sep)[-2]
        
        #add to the sub
        data_sub.append(image)
        labels_sub.append(label)
        
    if len(data_sub) < 110:
        boot = resample(data_sub, replace = True, n_samples = 110, random_state = 1)
        boot_label = labels_sub[0]
        for i in boot:
            data.append(i)
            labels.append(boot_label)
    else:
        for i in data_sub:
            data.append(i)
            labels.append(labels_sub[0])

With this “bigger” dataset I got an accuracy of over 93%. For the random forest i used max_dept = 10 and n_etimators = 200.

The end of the baseline

This is how I implemented the baseline. I also learned some important tricks to balance a dataset. The neural network also prefers a balanced dataset when doing classification. Therefore the lines of codes above will come in handy later on.

I hope you enjoyed reading the post! If you have any suggestions on improving my code or on how to better implement a baseline, I am all ears and eyes 🙂

Best,

Blondie

Leave a Reply

Your e-mail address will not be published. Required fields are marked *