This time we are finally applying some of those machine learning skill! Before we can start with the CNN, we should set a baseline. So let’s start!
Why a baseline?
You might have asked this yourself. If you go back to the first post of this series, I did not mention a baseline in the planning. Mainly because I forgot 🙂
A baseline helps you to evaluate how good your neural network is compared to the “normal” machine learning methods. You might
Setting the baseline
Choosing a baseline method
Assigning each of the images the correct letter is a classification problem. For these tasks I personally quite like random forests. So I will have a random forest as a baseline.
The Python package
conda install scikit-learn
In your code we will use the followind command to use sklearn in our script.
Actually, we will only need the RandomForestClassifier. If you are not familiar with it, you can read more about it here.
from sklearn.ensemble import RandomForestClassifier
I think those are all the tools which we might not have mentioned in one of the blog posts before. If you find a package missing, let me know down below in the comments.
Last time we ended with slicing the captchas into four images each containing one letter. We saved them into a folder called E
So first we need to read in those letters and their label, which is now given by the folder path and put them into a shape the random forest will like.
#Path to the folder where the letter images are, was output_folder in the old script einfach_letters = r'...\Einfach_letters' # initialize the data and labels data =  labels =  images = paths.list_images(einfach_letters) # loop over the input images for image_file in images: # Load the image image = cv2.imread(image_file, cv2.IMREAD_GRAYSCALE) # reshape to later use in random forest image = np.asarray(image).reshape(-1) # Get the name of the letter based on the folder it was in label = image_file.split(os.path.sep)[-2] # Add the letter image and it's label to our training data data.append(image) labels.append(label)
If you are working with the same data as I have, you will see that we do not get all the letters in the alphabet. Apparently, the captchas we have do not use all the letters. In real
The labels for each letter are – letters. Duh! The random forest does not work with letters. It wants numeric input. So we need to convert the letters to numbers. Many ways lead to Rome, this is how I did it:
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'] dic = dict(zip(alphabet, list(range(1,len(alphabet)+1)))) dicRev = dict(zip(list(range(1,len(alphabet)+1)),alphabet)) new_labels = [dic[v] for v in labels] new_labels = np.asarray(new_labels) new_labels = new_labels.reshape(-1,1)
Because all the letters are alphabeticly sorted and processed in this order, I want to shuffle the dataset. This way I can make sure that the training and the test data, which we will generate shortly, contains all the (available) letters.
## because all the letters are sorted right now we need to shuffle the dataset a bit ## numpy has a shuffle method permutation = np.arange(len(data)) np.random.shuffle(permutation) data_shuffled = [data[i] for i in permutation] labels_shuffled = [new_labels[i] for i in permutation]
Training and Test Set
So this is the only point where I am not 100% sure. For a random
#create training and test set x_train = data_shuffled[0:int(len(data)*0.7)] y_train = labels_shuffled[0:int(len(data)*0.7)] x_test = data_shuffled[int(len(data)*0.7):len(data)] y_test = labels_shuffled[int(len(data)*0.7):len(data)]
Building the forest
Now the data is in the right size and shape to be handed over to a random forest. We already imported the method from the sklearn package.
First we need to “initiate” the random forest and to defined its parameters.
rf = RandomForestClassifier(n_estimators = 100, max_depth = 3, random_state = 1)
The n_estimators defines how many trees are calculated and the max_depth defines how “deep” they go. The random_state is to set
I just started with some low numbers. Now, I can fit the random forest to our captcha data. I can do this in one line.
Of course, I also want to know how well my forest is performing. This again takes not more than one line.
The score returns the mean accuracy.
Is this a good baseline?
When I let it run with my data, I first had around 35% accuracy. In my opinion, this is not a very good value. I then tweeked the parameters a bit, meaning I made it deeper and with more trees. I then achieved a value of around 72% perent. But still, I was not happy with the result.
One possible reason for the mediocre accuracy can be, that our datast is not balanced. Meaning the different classes do not have an equal amount of data for training. A quick check shows us, that there really is no balance. For the check I used the following lines of code.
#check whether the dataset is balanced import pandas as pd label_bal = pd.DataFrame(labels) label_bal.value_counts()
Rebalancing the dataset
Seeing these results, we can try to rebalance the dataset. We can either do this by putting a limit on the number of data per letter. Or we can bootstrap our dataset.
Limit the data
If we want to limit the data, we can do so in the step where we read in the data by slightly changing the path, introducing a counter and a small if condition:
# initialize the data and labels data =  labels =  files = os.listdir(einfach_letters) for f in files: #loop through the letters i = 0 for image_file in paths.list_images(einfach_letters+'\\'+f): if i <= 52: # loop over the input images # Load the image image = cv2.imread(image_file, cv2.IMREAD_GRAYSCALE) #Reshape to later use in Random Forest image = np.asarray(image).reshape(-1) # Get the name of the letter based on the folder it was in label = image_file.split(os.path.sep)[-2] # Add the letter image and it's label to our training data data.append(image) labels.append(label) i = 1+i
This balanced dataset is now obviously smaller than the one we had before. But I reached an accuracy of around 82%. This is okay.
Bootstrap the data
Another, and more preferable way, is to bootstrap the available data, such that all the letters have a more or less equal amount of reputation. Bootstrapping means that you use an observation multiple times. Which observation is used more than once is random.
Again I changed the part where we read the data. I used the same folder structure as we did before when limiting the data and used
# initialize the data and labels data =  labels =  files = os.listdir(einfach_letters) for f in files: #loop through the letters data_sub =  labels_sub =  for image_file in paths.list_images(einfach_letters+'\\'+f): # loop over the input images # Load the image image = cv2.imread(image_file, cv2.IMREAD_GRAYSCALE) #Reshape to later use in Random Forest image = np.asarray(image).reshape(-1) # Get the name of the letter based on the folder it was in label = image_file.split(os.path.sep)[-2] #add to the sub data_sub.append(image) labels_sub.append(label) if len(data_sub) < 110: boot = resample(data_sub, replace = True, n_samples = 110, random_state = 1) boot_label = labels_sub for i in boot: data.append(i) labels.append(boot_label) else: for i in data_sub: data.append(i) labels.append(labels_sub)
With this “bigger” dataset I got an accuracy of over 93%. For the random forest i used max_dept = 10 and n_etimators = 200.
The end of the baseline
This is how I implemented the baseline. I also learned some important tricks to balance a dataset. The neural network also prefers a balanced dataset when doing classification. Therefore the lines of codes above will come in handy later on.
I hope you enjoyed reading the post! If you have any suggestions on improving my code or on how to better implement a baseline, I am all ears and eyes 🙂