Slicing Captchas – How to process images

captcha part two

Hiya!

Welcome to part two of the Captcha project. In the last blog post we laid out the plan on how to process captchas and make predictions on them. In this second part, we will check on how to turn a captcha into four separate letter images!

Fill our toolbox

Image processing

First things first, I will work with Python. Therefore I need a package which can handle images. A very quick internet search showed me, that OpenCV might do the trick.

If you are working with the Anaconda distribution, you can get it by opening the Anaconda Prompt and typing

pip install opencv-python 

To import the package in your python script, type

import cv2

Further, we will need to reshape the images we are reading. For this task, we will use misc from the package scipy. this should already be in your Conda distribution. To use it put the following into your code:

from scipy import misc

Deep Learning Packages

As we are filling our toolbox right now, we might as well load the necessary Deep Learning packages. I will work with Keras and TensorFlow. Keras is a Deep Learning API and is built on TensorFlow. Hence I loaded TensorFlow before I loaded Keras, though I am not sure whether this is really necessary. You can do this by using

conda install tensorflow
conda install keras

Little helper

One last package which will come in handy is the glob package. Glob finds all path names matching a specific pattern. For example, if you have word files and PDFs in the same folder but you only want to extract the PDFs. you can get glob via the conda install as usual.

Slicing

Getting captchas and labels

The necessary tools are ready! So let’s begin. As you might have seen when downloading the images, each captcha is divided into label and picture. The label is a text file with four letters and the pictures are JPG files. So for each captcha, we have to process two files and match them together. I saved all these pictures into one folder on my computer and named it “Einfach”, as we are working with the simpler type of captchas.

Hence I want to loop through this folder and always take the text and jpg file with the same name and process them as captcha and the corresponding label. I want to divide the captcha into four separate images and divide the label into four corresponding pieces as well.

Storage space

This means we need to tell Python where the information is coming from and where to put it once the captchas have been processed. We define the image_folder and the output_folder:

image_folder = r'...\Einfach'
output_folder = r'...\Einfach_letters'

For each JPG in the image folder, we want to know the path. This is why we imported glob 🙂

image_files = glob.glob(os.path.join(image_folder, '*.jpg'))

Slices

With an idea in mind on how to loop through the captchas, we also want to slice them into four letters. OpenCV can help us with this. This package can read in images as well as manipulate them and learn the area in which a shape is in.

#Load the image and convert it to greyscale 
image = cv2.imread(image_file) 
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

#convert the image to only black and white using threshold
ret, thresh = cv2.threshold(gray, 90, 200, cv2.THRESH_BINARY_INV) 
ret, thresh2 = cv2.threshold(thresh, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)
thresh3 = cv2.threshold(thresh2, 0, 255, cv2.THRESH_BINARY_INV)[1]

#find the contours
contours, hierarchy = cv2.findContours(thresh3.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)

With these code pieces we already get quite far. For some captchas the letters might stand quite close together and OpenCV might recognise them as one letter. See the captcha below. The two w’s will most likely be considered as one letter.

To prevent this we build in a condition. If the width of the contour we found is bigger than x times the height of the contour, we will just cut it in the middle. This might not be a very clean way but it is a reasonable fail safe. The number for x was detected by a simple trial and error.

Labels

At this point we still need to get the labels i.e. the letters in the captcha. Therefore we open the corresponding text file and read in the first four letters. Just in case there are more than four – which should not be.

f = open(letter_file)
captcha_correct_text = f.read(4)

Putting it together

This is what my loop looked like:

for (i, image_file) in enumerate(image_files):
    # Filename contains the captcha text (i.e. "0.XXXX" has the text "0.XXXX") ead the label

    letter_file = image_file.replace(".jpg", "")
    original_file_name = letter_file[-16:]
    f = open(letter_file)
    correct_text = f.read(4)
    
    #Load the image and convert it to greyscale
    image = cv2.imread(image_file)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    
    #Add some extra padding around the image
    gray = cv2.copyMakeBorder(gray, 8, 8, 8, 8, cv2.BORDER_REPLICATE)
    
    #convert the image to only black and white (using the threshold)
    
    ret, thresh = cv2.threshold(gray, 90, 200, cv2.THRESH_BINARY_INV)
    ret, thresh2 = cv2.threshold(thresh, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)
    thresh3 = cv2.threshold(thresh2, 0, 255, cv2.THRESH_BINARY_INV)[1]
        
    #find the contours
    contours, hierarchy = cv2.findContours(thresh3.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    
    letter_image_regions = []
    counts={}
  
    for contour in contours:        
        x, y, w, h = cv2.boundingRect(contour)
        #more than one letter in the image?
        if w / h > 1.9:
            half_width = int(w / 2)
            letter_image_regions.append((x, y, half_width, h))
            letter_image_regions.append((x + half_width, y, half_width, h))
        else:
            letter_image_regions.append((x, y, w, h))
            
    # If we found less than 4 letters in the captcha, something went wrong, skip the captcha
    if len(letter_image_regions) > 4:
        continue
    
    # Sort the letter images by the x coordinate and match them with the right label
    letter_image_regions = sorted(letter_image_regions, key=lambda x: x[0])

    # Save out each letter as a single image
    for letter_bounding_box, letter_text in zip(letter_image_regions, captcha_correct_text):
        # Get the coordinates of the letter in the image
        x, y, w, h = letter_bounding_box

        # Extract the letter from the original image with a 2-pixel margin around the edge
        letter_image = thresh2[y - 2:y + h + 2, x - 2:x + w + 2]
        #For the network the pictures need to have all the same size
        letter_image = misc.imresize(letter_image,[60,60])
    
        # Get the folder to save the image in
        save_path = os.path.join(output_folder, letter_text)

        # if the output directory does not exist, create it
        if not os.path.exists(save_path):
            os.makedirs(save_path)

        # write the letter image to a file
        count = counts.get(letter_text, 1)
        p = os.path.join(save_path, "{}.png".format((str(count)+captcha_original_file_name).zfill(4)))
        cv2.imwrite(p, letter_image)

        # increment the count for the current key
        counts[letter_text] = count + 1

Did you notice?

Oh my! Now that I am reading through the text above again, I just realized that I actually squeezed step 2 and 3 into one blog post! Well, what is done is done 😉

How do you get there?

With the code above we can process the captchas as wanted. Hurray!

Did I write all of this code by myself?

No!

I googled a lot, read a lot on stackoverflow, found people who had done similar things and looked at their code. In the end, this is just a huge jigsaw puzzle of all these pieces of information. Do you see a better, easier or faster way to do the same? Then let me know down below in the comments!

However, this is the great thing about the internet and all these forums and open source projects. We get to learn and exchange. So here it comes again, a very big thank you to all the people out there who contribute and share on the internet!

I hope you enjoyed reading this! I would love to hear about your current projects!

See you soon!

Blondie

Leave a Reply

Your e-mail address will not be published. Required fields are marked *