These are my notes on refreshing my object detection knowledge. We will start with bounding boxes for localization and cover everything we need before jumping in to implement YOLO algorithms.
This tutorial includes answers to the following questions:
 What is localization?
 What are a bounding box and sliding window?
 How to measure the success of a predicted bounding box: Intersection over the union.
 How to get rid of extra bounding boxes: Nonmax suppression.
 Evaluation Metric for Object Detection: Mean average precision
Check references for addresses of the images.
Object Detection
Object Detection is finding objects in an image and where they are located in the image. Adding localization or location on detected objects for a classification task will give us object detection.
We mainly use Deep Learning approaches for modern applications (what a surprise 🙂). On the other hand, object detection focuses on how to solve localization problems for the most part, so we will focus on some methods to help us solve this issue to begin with.
Localization
Localization is an easy concept. For example, imagine we have an image of a cat; I can classify this image as a cat with some confidence level. If I want to show where the cat is in the image, I need to use localization; to determine what part of the image the cat is at. Similarly, if we had multiple objects on the scene, we could detect each separately, classifying the image as numerous. We call this location identification of various objects localization.
While we try to classify an image, our model will also predict where is the predicted object located, or rather where the bounding box is located. To do this, we have additional parameters to describe the location of the bounding box. For example, we can define a bounding box using the coordinates of its corners or the location of its middle point and height and weight. I’ll talk about that later.
Bounding Box & Sliding Window
As we discussed, a bounding box surrounds an object in the image. The red, green and blue boxes in the image above are examples of bounding boxes. That’s great; how do we handle drawing these boxes, though?
There are many ways proposed, and I am sure the research will continue on it for some time, but the primary approach we have is called a sliding window. A sliding window is to have a box run around all the images and try to find which part of the image actually has the object we are predicting.
As we can guess, this is a slow method, considering how many boxes there are for every image (you also have to run your model on each window). So there is some work on improving this method’s speed.
The next problem is getting multiple bounding boxes for an image. We will see how to handle this as well.
Intersection over Union
This is a simple method to calculate the error of a given prediction. We check the intersection of the real bounding box and the prediction, and divide it into the union of the two. Very simple, isn’t it?
What we need to do now is to write a simple geometric formula to determine the area of intersection and the union. Let’s jump in using PyTorch. For the sake of understanding, I will first give the nonvectorized implementation, then upgrade the lines to vectorized version
The first thing we need to do is to convert the midpoint representation to corners. W


We are basically just iterating through all the boxes we have in the list and changing their values using width and height from the middle point. If we know the middle point, we can remove half of the width and height to find the top left corner of the image (in python images are 0,0 on the top left and getting higher numbers towards south and east). So the formula for the top left corner is $x_1 = m_x  \frac{w}{2}$ where $m_x$ is the x for the middle point and $w$ is the width. The same logic goes for y. If we add $w$ to this value we will find the $x_2$.
Now if we do this way, we are not making use of tensor operations, so let’s alter the code to get a faster calculation


If you didn’t get what’s happening here, please ponder over the code a little to grasp how the two of them are the same.
Now that we setup the corners, we need to find the intersection and the union of the areas. To find the area we can simply multiply the height and width which are equal to the distance between the x’s and y’s. So $A = abs(y_2y_1)\times abs(x_2x_1)$. We can get the area for boxes with this logic


We now have everything but the intersection. To find the intersection we can use a simple idea.
 The $x$ for the first point (top left corner of the intersection) will be the maximum of the $x_1$ of the target and the prediction.
 The $y$ for the first point will be the maximum of the $y_1$’s.
 In the same way, the $x_2$ will be the minimum of the two $x_2$’s.
 $y_2$ will be the minimum of the current $y_2$’s.
So we can just find the corners and use the same logic as the boxes to find the area of the intersection intersection = (x2  x1) * (y2  y1)
. Though we need a little extra here, we have a probability that there is nothing at the intersection in which case we need to just say so, meaning we need to increase the value to 0
.


We could just use clamp(0)
instead of an ifelse
statement there, but I wanted to make it as easy to comprehend as possible.
Let’s combine everything and PyTorchify at the same time


You can check the easy version on the GitHub repo.
That’s all for the IOU! Now let’s jump over to nonmax suppression.
Nonmax Suppression
As we mentioned before we might get multiple bounding boxes that fit an object. We need to clean these up and keep only one (one box to rule them all…). We introduce nonmax suppression precisely to do this.
For each object in our scene we get multiple boxes around, and we need to see if these boxes are actually for the same object and if so we should remove them and keep a single one.
For this, we get all the boxes that say this part of the image is a dog with some confidence. We pick the box with the most confidence and compare all the others with this box using IoU. After that, by using some threshold value, we remove all the boxes that are above the threshold.
Before all this, we can also discard all the boxes that are below some confidence level, which would ease our job a little.
One last thing to mention before jumping in the code, we do this separately for each class. So for bikes, we would go over the boxes one more time, and for cars too etc.
Time for the code!
So to begin with, we will assume we got some boxes bboxes
as a tensor, iou_threshold
for the IoU comparison and threshold
for confidence threshold.
We first handle the conversion from h
and w
as before.
Now that we have proper variables, we then eliminate the boxes that are below the prediction threshold, then we sort the boxes based on their probabilities (so we can consider the highest probability first.)
Then we simply iterate through each box and remove all the boxes that have a higher IoU value than the threshold we gave (we also keep the boxes from other classes). We then append the box we examined for among the boxes to keep.
That’s all, let’s bring it all together.


Done with that as well, we now can focus on the boxes we actually care about, next up is mean average precision.
Mean Average Precision
So we have an object detection model, how do we evaluate this? The most common metric out there (currently) is the Mean Average Precision (mAP). As we do, we will quickly cover the basics and jump into code.
We trained our model, now we are testing using the test or validation data. We will use precision/recall to evaluate. So before doing more, let’s go over precision and recall really quickly.
Precision/Recall
When we make a prediction, we are either right or wrong. Though we can be wrong in different ways. We can say false to something that was true, or true to something that was false. This introduces the idea of False Positive and False Negative. False positive is when the predicted value is positive but the actual result is negative, and False negative is vice versa. Of course, for this, we need to define truth values to the results.
Other notions introduced here are True Positive and True Negative. True positives are the true values our model got to predict right, and true negatives are the negative values where our model got it right.
In our case, for object detection, the predictions we make are the positives, and the predictions we didn’t make are the negatives. So false negatives would be the target boxes that we could not predict (I will explain in a bit how we say if we actually predicted a box right, though you can already guess). If we combine true positives and false positives we get all the predictions we made. If we divide the correct predictions from all predictions we get precision, so $p=\frac{TP}{TP + FP}$. If we combine all the truths, so all the target values whether or not we predicted right, we can reach recall $r = \frac{TP}{TP + FN}$. The diagram below explains it perfectly.
Back to mAP
Now that we know what precision and recall are, how are they used for evaluation in our case?
First of all, how do we know if a prediction is wrong? Yes, we will use IoU as described above. If the IoU value (with a target) is greater than some threshold we will assume that box is correct.
Here are the steps for finding mean average precision:
 First, we will find the truth values (TP, FP) of all the predictions we made.
 Then we will sort the boxes based on their confidence score (just as before).
 Then we will iterate through the boxes (starting from the highest confidence) and calculate precision and recall for the values up to that point. For example, if we have ten target boxes, and the first box in our list has a TP as the result; we will have a precision of $1/1$ and recall of $1/10$. Let’s say the second one is an FP, then precision will get to $1/2$ and recall will stay the same. Long story short, whenever we see an FP we will increment only the denominator on the precision and if we see a TP we will increment both nominators.
 Next, we will calculate the area under the P/R curve which will be an average precision for a class.
 Then, we do all these again for all the classes and average the results.
 Lastly, we must use different IoU values to do the same thing and get the average of the results.
Well, that seemed longer than it actually is, let’s dive into code to get a better grasp.
Code
To make things easier to follow, I want to start with the function definition, so we have all the variables set in place before we piece everything else together.


After that, we will continue with the first step as usual: convert the point format…
In the main part, we will iterate through all the classes, and keep our attention on those only. So to do that we get the targets and predictions for a single class to begin with. We also create a list to keep track of the average precisions.


We will use only these boxes for our next steps (so we only focus on one class at a time). This is preferred since we need to check each box with possible targets. It will make more sense in a bit.
Next up, we sort our predictions based on their probabilities and create a couple of variables for tracking and all. We define precisions
list for keeping true positives and false positives. 1’s will be TP and 0’s will be FP.


Now that we are set, we will iterate through all the predictions. While only considering the target boxes that are for the same image we will check each target and decide if the prediction we are checking passes the IoU threshold for that target. If so we will mark that target done so we don’t consider it for the next prediction. We also will add a true positive to our precisions (adding 1).


Now we need to calculate precisions and recalls, just like we mentioned while explaining the algorithm (adding to the nominator/denominator thing). We will also add an extra zero to make the graph (for AUC) go from 0 to 1. Lastly, we use the trapezoidal rule to calculate the AUC for precision recall.


Let’s put all the bells and whistles together and get our fully formed function:


That’s it! Wait… One last thing. This was for a single IoU threshold, we will need more than that. Let’s write a simple function that calls our mean_average_precision
.


And we are fully done. Now we know every bit we need to actually go ahead and implement our first object detection algorithm, which will be the first version of the still stateoftheart YOLO algorithm. It now has YOLOv7, but we will start with implementing v1.
You can find all the code from this tutorial here.
References
 https://www.youtube.com/c/AladdinPersson
 https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)
 https://labelyourdata.com/articles/meanaverageprecisionmap
 https://en.wikipedia.org/wiki/Precision_and_recallhttps://web.archive.org/web/20191114213255/https://www.flinders.edu.au/science_engineering/fms/SchoolCSEM/publications/tech_repsresearch_artfcts/TRRA_2007.pdf
 Image 1 from https://www.datacamp.com/tutorial/objectdetectionguide
 Image 2 from https://pyimagesearch.com/2015/03/23/slidingwindowsforobjectdetectionwithpythonandopencv/
 Image 3 from https://pyimagesearch.com/2016/11/07/intersectionoverunioniouforobjectdetection/
 Image 4 from https://pjreddie.com/darknet/yolov1/
 Image 5 from https://towardsdatascience.com/whatsthedealwithaccuracyprecisionrecallandf1f5d8b4db1021