Recently I wanted to create object detection capabilities for a robot I am working on that will detect electrical outlets and plug itself in. The robot needs to perform with a high level of accuracy and success, at least 99% or more each step of the way. One thing to remember about robot operations is that if each step required to complete a goal succeeds only 99% of the time and there are multiple processes, the ultimate-goal success rate will be .99^n, which could result in ultimate-goal completion rate that is significantly less than 99%. So each step of the way must be nearly >99% successful. Object detection is the first step in many robotic operations and is a step that subsequent steps depend on.
Because the performance of the object detection directly affects the performance of the robots using it, I chose to take the time to understand how OpenCV’s object detection works and how to optimize its performance. I also found the available documentation, tutorials incomplete or outdated; and a few SO questions similar to mine remain unanswered. So it seemed that taking the time to write a detailed reference with my findings might benefit others.
Here's a great example of how well OpenCV's object detection can work when you get it right!!
In this post, I use *nix programs; I apologize to Windows users in advance.
I want to point out that installing OpenCV for certain platforms can be complicated and slow. I suggest reading this post thoroughly, collect your images and then install OpenCV on a remote server. Installation will be much easier if you use a remote server running Ubuntu and you can rent a server with much more CPU than your laptop will have to complete the training much faster.
As I began to learn about OpenCV’s object detection capabilities, I had numerous questions:
create_samplesutility to generate distorted versions of a single positive image suffice?
First off, let’s briefly delve into how the Viola-Jones algorithm works and try and understand what it’s doing. If one reads the abstract to the original Viola-Jones paper, we find some new but important terms: integral image, cascade, classifier, feature, etc. Let’s take a minute to learn about them.
Integral images and rectangular boxes are the building blocks that the Viola-Jones algorithm uses to detect features. An object’s features are seen by the computer as differences in pixel intensities between different parts of images. The algorithm doesn’t care what color our objects and images are, just the relative darkness between parts of the images.
The original paper uses the most obvious feature of human faces, the difference in darkness between the human eye and cheek regions. The training program looks at all combinations of adjacent rectangles as sub-images within each training image and compares the difference between adjacent rectangles.
A simplification that could help us understand how object features are detected is to reduce the image to how the computer sees it. Computers don’t see images, they see numbers. In this case, the algorithm determines the darkness of adjacent rectangles and compares those. Individual features are differences in the darkness of adjacent rectangles.
Below is a brief overview of the steps required to generate a cascade file for object detection. Don’t worry about the details, now, we'll walk through each step below.
.vecfile that contains images of your objects in binary format using the annotation file above
I mentioned that the training can take a long time. It can actually take weeks, I've read. I strongly recommend you use a remote server to train your cascade. Here are two reason why: one, it will speed up the training immensely (mine took only 18 minutes); and two, installing OpenCV on Ubuntu is way faster than compiling from source on a Mac. There are no pre-compiled binaries available for OS X.
I used an 8-core Digital Ocean server to train mine. This server cost about $5 per day. You should only need one for a few hours, or perhaps a few days if you struggle to get the training right. I believe when you sign up for Digital Ocean that you get $10 in credit too, so you can probably do this for free.
Tip: It’s not super difficult to find $10 coupons for Digital Ocean if you look around a bit. Another benefit of using Digital Ocean for this is that your local machine mustn’t be devoted to the training - a remote server will keep training even if you accidentally close your machine.
There are a few ways to develop positives.
I used a combination of these two approaches.
Here are a few things to remember when taking pictures of your object(s). Probably the most important: you can take multiple images of the same thing that count as multiple positives. You can slightly (but not too much) tilt and rotate your object (approximately 10-20º). If you have multiple instances of the object, like shoes, take pictures of all of them, positioned in the same way (toes facing left or right).
I found different color outlets when googling; also different backgrounds and angles. When googling for your object, you can specify the size of the images Google returns, too.
To set the size once you have clicked "Images",
I recommend using at least 100-200 positives to start off. You may get a decent result with fewer, some have. I used ~380 for my final, nearly perfect cascade file, with zero false positives that more than flickered on the screen.
Once you have your positive images, you should make an annotations file. I say "should" because I think this is an important step. I didn't generate a working
cascade.xml file until I used this tool to create an
annotations file. At first it seems like this tool will take a long time to make such a file, but it doesn’t. I suggest starting out by using this tool and not trying to train your cascade without it.
Along with OpenCV's
createsamples applications, when you type
opencv_[tab] in your terminal (once you have OpenCV installed), you will find another tool:
opencv_annotation tool helps you to quickly generate an annotation file with paths to your positive images and the location and size of the objects within those positive images. Note that the starting pixel is the top-left corner of the rectangle that contains your object. When done, the file will look something like this:
The “2” after the file path is the number of positives in each image (lots of mine were two because outlets come in pairs). Then we have the top left hand corner starting pixel of our object. Next are the sizes of each object within the image.
So in the first line in the annotations image above, the “230 169” is the pixel at the top left corner in
GOPR4620.JPG where an outlet starts. It is 33x40 pixels. You get the point.
The annotation tool writes the paths that you outline in each image for you which saves us a ton of time.
Here’s the command that I used to create the annotations file.
opencv_annotation -images . -annotations annotations.txt
I had one problem with this tool that will hopefully not happen to you or be fixed. The annotation tool would not write to the file when “n” was pressed after outlining an object. It would only write to the file when all of the images in the directory had been processed.
As a workaround, I moved my images into a series of directories and added each directory’s annotations file to the main one using a command like the following, which takes the contents of one file and adds those to another.
cat ./sub-dir/annotations.txt >> ./main-annotations.txt
Be sure to use two arrows, like “
>>” or else cat will overwrite your annotations file and you’ll have to start over!
After you create this annotations file you can use the opencv_createsamples tool to create a .vec file but with more varied positive images.
Ideally your positive and negative images will contain the actual objects you’re trying to detect in their natural environment.
There are a few ways to generate negative images. One thing to remember is that you will get the best results when using negatives from the environment you intend to use your cascade file in.
In this post's repository is a directory with a few tarballs that contain a total of 3,100 negatives. Note that you will need to scroll through each one to ensure they don't contain you object.
Here's another way to develop images using downloaded videos and grabbing frames.
ffmpeg. You can decide what percent of the video’s frames you would like to keep depending on how many negatives you think you need.
If you use this frame-grabbing approach, make sure to only get one out of every few dozen frames, unless your video is really moving around the environment. Most videos show the same view for at least a few seconds, so ensure that your negatives generated using this approach are different.
Here are a couple of commands you can use to easily copy your positive and negative images to the remote server.
First, I suggest creating a tarball for each directory of images. This will speed up and simplify the transfer process.
While in your image directories do something like this:
tar -cvzf positives.tar.gz /path/to/positives-folder/*.jpg
tar -cvzf negatives.tar.gz /path/to/negatives-folder/*.jpg
These will each create a single file that contains your positive and negative images (with only the file extension you specify at the end) in the path you specified as the last argument above.
Here’s the command to copy your tarball to your remote server. This
sshs into your remote server and copies the file to the path you specify:
scp positives.tar.gz root@[your-remote-ip]:/remote-project-dir/positive-image-dir
Don’t forget the “
:” in the command above.
Once you've connected to your remote server, while in the appropriate directories, unzip your tarballs:
tar -xvf negatives.tar.gz
Some people use consistently sized images. I didn't. One important thing is that the sizes of your images need to be at least the size of the test, which defaults to 24x24 pixels.
According to an OpenCV author, Steven Puttemans, he never uses images with dimensions larger than 80px. I tried using 80px dimensions to speed things up. I got tons of false positives when doing so. But, I believe much of the image information was lost. I ended up using 256x256 pixel images. Smaller images may work, but 256 pixels square worked for me.
Note that if your images are small to begin with, increasing their size with
mogrifywill not necessarily magically make them useful to the algorithm. I used this resize for images that started off larger than this, to increase the training speed.
What definitely does matter is the width
-w and height
-h arguments you pass to
traincascade . You will not be able to detect objects smaller than the dimensions you pass. They both default to 24x24.
annotationtool to create an annotations file.
createsamplesinserts your positives on your negatives.
.vecfile*. This is because the training algorithm may discard some positives if some are too similar. If you use
create_samplesto create a
.vecfile, you are more likely to run into this problem. See this link for more of an explanation on why to use 85%
-dataThe directory where OpenCV will store your cascade file and other related files.
-bg bg.txtThis file contains the paths to your negatives. This file is pretty easy to create, just:
ls *.jpg > bg.txt, while in your negatives directory.
-acceptanceRatioBreakValueYou can use this to stop training at .00001 or 10e-5.
-vecThis is the file output by
opencv_createsamplesthat contains your positives.
To quickly test the performance of our cascade files, I have included a Python file that you can use to test your object detection locally with your computer's webcam. I'd like to credit Shantnu for originally posting a file very similar to the one included (with a version-error fix). This file will let you quickly test your cascade file. To test your cascade file, just run this command:
python webcam.py cascade.xml
What this file does is run OpenCV's detection in your computer's webcam, so this will only work if you have one of your objects handy. Sometimes images of objects on your phone or perhaps a printed image will work too.
Shantnu wrote a post about this file and explains what's going on inside. I recommend you take a minute to understand it, especially the
faces = faceCascade.detectMultiScale
This is the core OpenCV function that actually uses our cascade files to detect our objects. The parameters are important here. Pay close attention to the
minSize is self-explanatory. But the others aren't:
scaleFactor scales your image down to enable your object to be detected. So
scaleFactor = 1.1 shrinks your image by 10% - it zooms out, so to speak.
minNeighbors is also very important. This SO answer definitely will do a better job explaining it than I will. So please check that out. The gist of it is that the higher
minNeighbors is the higher the threshold for detecting objects is. If
minNeighbors is too low, you will get too many false positives. This image shows you exactly what I'm talking about.
See how the actual faces have more squares? Even with a working cascade file we still have some false positives. The
detectMultiScale function is sliding a square over our image source looking for parameters. Stronger matches (our actual objects) will have neighboring squares that also match. Those are the neighbors we're looking for.
Included with OpenCV are a few working cascade.xml files too. It's fun to run things just to see them work, so check those out.
This depends on how small your samples are in your
.vec file. I set mine at 20px x 20px because I want my robot to detect outlets from a long ways away. Your situation may be different.
traincascadewants to keep training past .00001? Simple: stop training with [ctl]-c. Then add the
-numStages n-1parameter to the
traincascadecommand you were just using, where n is the number of the stage after it reached .00001.
detectMultiScalea bit if you're getting too many false positives or otherwise poor detection results. Try reducing the
minNeighorsto 3 or below to see if your cascade is detecting anything at all.
bg.txtfile. It is common for this to have some extraneous files in it. So use this command so you only get your
jpgs in it.
ls *.jpg > bg.txtcommand while in the negatives dir. Make sure you don't have any
BOMs if you're on Windows.
traincascade. I ended up running
traincascadewhile in my negatives folder which solved most of those problems. Just pass
-data ../data, etc.
Add more information! Increase your positive and negative image sets. Your classifier does not have enough information to correctly determine that your object is not in your test images. When I increased my positives and negatives when I had too many false positives, their number immediately declined and I started getting more stages.
Here are the likely causes of various error messages.
“Required leaf false alarm rate achieved. Branch training terminated“
The training algorithm can run out of information that will help it to add to its classifiers. If it has already gleaned as much as it can from the images, it simply stops. This is the output you will get when this happens.
This will happen earlier when you are using smaller image set sizes. If you only pass it a few dozen or hundred images it can only train a few stages. The more images you pass, the later you will run into this error and the better the cascade file will do to detect your objects.
But maybe your object is super static and it doesn’t take many positives to develop a good classifier. What you can do is to add the argument
-numStages n-1, to
opencv_traincascade where n is the stage number that gave you that error message. This will cause a
cascade.xml file to be made that may work, or could at least provide you with some information about whether your arguments and images are on track.
Train dataset for temp stage can not be filled. Branch training terminated.
This is most common when you have not provided enough positives, which is really the most time-consuming aspect of training. Add more positives!
OpenCV is a mature, robust computer vision library. If you don't get solid results, you are either passing
traincascade not enough images or the wrong images. Keep working at it until you get good detection. It may take a few tries like it did for me, but stick at it, it's magical when it works!