Cat Vs Dog Classification Using Edge Detection And Other Techniques.

By Samuel Muiruri | April 5, 2019 | Machine Learning

I've been working on Kaggle's Cat Vs Dog Classification Challenge and in the process learned a lot on how image classification works especially how to build a good model.

Starting with a common example the MNIST dataset where you classify number digits

This is how the digits look and when you process each digit it comes in the dimensions of 28 x 28 which means 28 pixels by 28 pixels.

With such a small image it's been proven you can classify up to 10 different types of images (numbers) and to take on the cat vs dog classification you should expect even a relatively small but compared to this image large image would still be easy to classify especially when your working with only two classes. 

If you've used keras before and know enough about how Machine Learning works you should find it easy to follow along, I will mention briefly what the code does and why I implemented a test in a certain way but if you truly want a real practical hands-on introduction to Machine Learning I recommend Jeff Heaton's on Youtube. He's a Professor at Washington University and I particularly liked his approach on showing how to deal with data to the point you should be proficient enough to start dealing with real data science problems like the ones posted on kaggle on your own. Note this isn't the same as being on the AI Researcher who's looking for unique solutions to unique problems such as coming up with a new algorithm that does better than the ones used today, this is using what's available with tools like keras to do what most data scientists can do with the same tools.

In this field it all comes down to how quickly your code takes to make a prediction, you might have heard of Y.O.L.O. (You Only Look Once) an image classification model that also works on video that can annotate objects in practically real-time (you can't notice the delay). The second key point is accuracy, which goes into actually not missing something like an object you were looking for in an image and also reducing false positives where it misclassifies an object. Most of these you can fine tune by just adjusting the sensitivity of you model using the variables you pass in.

If you've ever used Python before, it's going to feel alot like importing a class and passing in the variables that will be used by __init__ 

You start by importing what you'll be using

os will be used to check the folders and list the images files

cv2 which is OpenCV will be used to read the images and return them as a numpy array. Each pixel will be a value from 0 to 255 with 0 being black and 255 being white. If using all the color channels; rgb it means each channel also has its value from the same range and like an rgb led light the three values are mixed to produce a different color.

From keras the imports made are the ones that will be used to define the model.

At first, I started with an image of 200 x 200 dimensions and a model set up like this

This never actually trained well enough, I was getting an accuracy of around 48-49% and it hovered around there bobbing up and down never stabling or showing hope of moving up. The reason this is bad is because it means it's just as good as an uneducated guess; like I gave you an image and asked you if Robert Mugabe is here and assume you've never seen him. If I just asked you to guess you have a 50% chance of being right. If I show you another image of another person and asked the same question the probability is still the same, so with this Cat vs Dog model, it didn't seem to learn what the difference was.

The issue was this model was too small for a 200 x 200 image, also in this case adding Dropout wasn't even helping out. What dropout does is on each iteration it randomly shuts off some of the connections to prevent overfitting, forcing the model to generalize. If your model is barely finding a pattern then that makes the chance it will find a pattern even harder. The issue was mainly this; the window of analysis being 3 by 3 pixels in a 200 by 200 pixel means the window was about 66 times smaller than the main image, meaning you could stack 66 of these sliding windows vertically and horizontally. Since these small windows which look for a pattern if you consider how small the window is with respect to the main image, would likely at best see an ear, whiskers, nose. A face or part of the face of the cat wouldn't fit into this window and even to a human being shown such a small window and asked to identify what's inside is pretty damn hard.

So I went down to 64 by 64 pixels and it's now around 3 times smaller than the main image. The smaller the difference between the window and the image, the better for your model, also in the same respect, the smaller the image you feed in dimensions the shorter amount of time it will take to run a single epoch and since you never run a model for a single epoch, you mostly run it for at least 20 epochs and you should put Early Stopping so if the validation loss/error starts going up it stops and uses the model that had the lowest validation loss. In the same logic the smaller the window you use to scan the image the quicker it will take to loop through a single epoch, and finally the smaller the model in terms of units it has inside model.add(Conv2D(64, (3, 3), input_shape=(200, 200, 1))) in this snippet it's 64 units which is how many neurons you have per layer the smaller this is the quicker the model is to run.

It's tempting with the hype of deep powerful models to go all out and just start big and scale down if need be but that's not the best approach if you ask me. The big dogs who do this, companies like Google and NVIDIA have two good reasons to do this:

  1. They have these crazy powerful servers that small models would run so quick that they honestly learn what the limits of the small models are and can scale up for better performance.
  2. They tackle interesting problems from the angle of having massive data and regulated but still more massive models for better models. Remember these are top notch Data Scientists who built things like Tensorflow and at the very least understand the underlying logic of it well enough. So the need for a bigger model comes from the bottleneck of the attempts with a smaller model.

If you've been coding for a while you know the rule of good code: avoid complications. If you can do something with two lines of code, designing 20 lines of complicated code just because you want to flex code isn't a wise move.

You work from the angle your a lazy coder but with high standards for his code, you'd be surprised how well that rule works. The only true exception of when you have to break this rule is the client or your boss knows enough about code and for their own reasons like big scary lines of code that just flexes on each other.

That's why you should watch Jeff Heaton's course, he doesn't assume you have a beefy machine with the latest GTX NVIDIA running on it ready to start learning, you could run this on just your average PC/Laptop and most of the examples will run with about 2 exceptions that use up a lot of RAM and you might need at least 24GB of RAM. You take this rule and you'll see the sense in that you should approach Machine Learning still with the Lazy Programmer attitude of I want my code to work and work well but I'm not waiting 2 hours per epoch unless I have to. If I can get 95% accuracy with 15 min of training per epoch then why not start from there?

With this we can finally start on the logic of the model will run. If you're not new to Machine Learning you already know the two most popular way for image classification; with or without color. You might be wondering why not use color, there's a few reasons for it.

  1. Your model might end up with a higher sensitivity to something else. One popular example is if you were training a model to differentiate between a wolf and a dog and all your wolf picture were set in snow it would pick snow as a key identifier of the fact of whether it's a wolf. In the case of cats vs dogs if your data is biased to cats with a single color it might pick that as the identifier of the Cat.
  2. It also takes longer to train than a Grayscale image
  3. It usually given the same model never gets a higher accuracy than a Grayscale model.

I've gone a step further and using OpenCV reduced the image to only the edges just to see how well that would work compared to the other models.

You can view the whole code in jupyter on github. The images come in two folders `train` and `test` and the train images has 25,000 images.

So I just load the list of images to a list and as you can tell from the code snippet above, the images have an annotation of `cat.` or `dog.` to identify of whether it's a cat or dog. Using this using OpenCV I convert the image to Edges, for example here's a photo of a Cat

Using the uncanny detector you get these edges

And it's size when resized

It's small but I've been able to get high accuracy levels still.

There's tensorboard which when you use keras plots the training of the model and is really useful when you want to figure out how different configurations of your model fairs. I've worked on the three image styles; edges, grayscale and colored with the model using dropout or without dropout.

This is how the model performed with edges only

The model without dropout got a higher accuracy (because of overfitting) of 85% vs 79% with dropout

If you look at the validation accuracy which is based on the validation set the model using dropout has 75% vs 74% without dropout.

That's the advantage of dropout.

That said the more important question is between the three models which one performed the best

and that's grayscale without dropout at 91%

and grayscale with dropout gets the highest on validation set accuracy at 83%.

You can also tell how long it took to train each model. With Early Stopping that stops when it starts to overfit based on the set tolerance level, edges took the least time at around 2 and 1/2 hours, followed by colored and grayscale depending on whether your comparing with dropout but in general adding dropout means it will avoid overfitting meaning it will take longer to converge at the least amount of loss and hopefully get a higher accuracy at the validation set not the main testing set.

That's it! That's all it takes to make a training model, this is however more of an art that just normal code so getting it to be just right will require intuition of what works and what doesn't which is best learnt by actually doing this often and more importantly having an intuition even if wrong of why one model is better than another. You fill in the holes of what you didn't know, learn a bit more as time goes and with time you should be able to confidently say you are a Data Scientist.