For this purpose I came up with an experimentation road map. I tried asking every question I could think of and tried to answer them in a systematic way. In this post we will go over this journey and discuss the results.
Qualitative analysis is a method of analyzing data that is not numerical. It is a method of analysis that is used to understand the meaning of data. Qualitative analysis is used in many different fields, such as psychology, sociology, and anthropology. It is also used in business to understand the meaning of data. We perform qualitative analysis via different methods, such as interviews, focus groups, and surveys. After the collection of data, we need to analyze it to understand the meaning of the data since it is not numerical and extracting meaning is nontrivial.
This is where we start “qualitative coding” process. Qualitative coding is the process of assigning labels to data. These labels are called “codes” or “themes”, and they are used to describe the meaning of the data. The process of qualitative coding is very time consuming and requires a lot of effort. It is also very subjective, since it is done by humans. This is why we want to automate this process as much as we can, and make it more robust, accurate and fast.
As much research showed recently, LLMs are still not at the point where they can outperform a quality coding done by a human expert. However, we are speculating that, they can be used to speed up the process, and provide a more robust and accurate analysis for experts to start with. This is what we are aiming to do in this research, and we will discuss the results in detail.
Topic modelling is a technique that allows us to extract topics from a corpus of text. It is an unsupervised technique that allows us to discover hidden semantic structures in a text. This is a probabilistic model, that does not provide much accuracy. It is a very simple method in essence. Different methods have differnt approaches to this problem, but one of the most popular ones is BERTopic which uses BERT embeddings to cluster documents. That pretty much is it, even though the library is amazingly implemented and well maintained, the method is very simple. It uses sentence similarity to cluster documents, and analyze word frequency to assign topics and extract keywords.
We can mention why you would want to use topic modelling, and why not.
Pros
Cons
So if you were interested in a quality analysis, you would not want to use topic modelling. But if you were interested in a quick analysis, and you did not have any labelled data, then topic modelling is a great tool. If you want to read more about topic modelling I strongly suggest that you checkout BERTopic. It is a great library, and the documentation is very well written.
What we are aiming however is seeing the potential of GPT3.5/4 in topic classification and generation, in a high quality analysis. We hypothesize that GPT models could speed up the process of qualitative analysis, and provide a more robust and accurate analysis for experts to start with.
Along the way, we use topic modelling as a baseline, since we don’t have a better choice. One thing to mention here is that we are not making use of existing topic classification models. This is due to the fact that topic classification assumes the new results to be in already classified (labelled) topics. This is not the case for us, since we are trying to discover new topics, and then classify them with no prior knowledge. This begs for fewshot or zeroshot learning, which is what we test with GPT models.
One thing we did not mention, and it is crucial in any part of this process is that topic classification is a multilabel multiclass classification task. Which makes it much harder than any other classification method. We will discuss this in further detail later on when we talk about the evaluation metrics.
It is a clear statement that GPT (and all the other LLMs) performs better with divided subtasks when it comes to handling complex tasks. This means that we are expected to get better results if we can divide our endgoal into smaller pieces. For our case, this seemed like we could actually divide our task into classification and generation. This will help us evaluate existing methods, so that we actually can have two seperate baselines to compare with.
One thing to consider in this seperation is that, these pieces must work well together. So this begs for the question of cohesion. How well do two models do together rather than alone. So in the end of testing the models on their seperate tasks, we will also test them together and see how well they perform for the end goal.
Another consideration we have is that, these tasks might actually be harmful for the task at hand (at least cost wise), since we are repeating a lot of the information to divide the task. This is why we will also try a combined approach (one prompt) and try to tackle the complexity issues with prompting techniques.
In this experiment we assume that BERTopic has all the correct labels to given dataset, and should classify them into these classes. We will then compare the results with the actual labels, and see how well it performs. This is a very simple experiment, but it will give us a good idea of how well BERTopic performs in classifying topics.
BERTopic is not designed to perform a classification task with no training. What we do instead is, perform topic modelling on the dataset, and then map the topic labels generated to the closest class we have by looking at their cosine similarity. This gives us a proper class for each cluster and document. One good thing is that we also have the exact number of topics, so we can use that as a hyperparameter.
We use our internal data for the experiments, but you can use any dataset you want. We have a relatively large survey data, with small and big surveys (ranges from 5020k responses per survey). We want to make sure the method we end up with can handle both ends of the spectrum. We also have a lot of different topics, which is another thing we want to make sure we can handle.
We first load the data, and then we will use BERTopic to perform topic modelling. We will use the default settings, and then we will map the topics to the closest class. We will then compare the results with the true labels, and see how well it performs.


Now that we trained and reduced the outliers, we can map the topics to the closest class.


Now we can check the accuracy of the topic modelling.


With our data, we get an average of 0.09
accuracy. Which I think is not much of a surprise. We have a lot of topics, and the topics are very similar to each other. This is a very hard task for topic modelling, and we cannot expect it to perform well. But we needed this experiment to see what we are dealing with, and what we can expect from topic modelling.
After this experiment we speculated that skipping topic modelling and testing just using cosine similarity and BERT embeddings might be a better approach. Due to the similarity of the approach, we include this experiment under the same section. The change is only in the main loop, so let’s just see that.


This method yielded a 0.21
accuracy. As we can see, it is better than topic modelling, but still not good enough. Still, works better as the baseline, so we will use this method for the comparison when it comes to the final results.
Since we mentioned accuracy couple times here, let’s talk about what metrics should we be using to properly evaluate our model (hint: it is not accuracy).
We cannot really calculate simple accuracy for multilabel classification. We need to use a different metrics. For our case we care the most about not labelling a response with a wrong class. We can tolerate not labelling a response with the correct class, but we cannot tolerate labelling a response with a wrong class. This is why we will be using precision as our main metric. We will also use recall and f1score to get a better idea of how well our model performs.
Besides these, we will use another common metric for multilabel classification, that replaces the accuracy. It is called Jaccard similarity, and it is the intersection over union of the predicted and true labels. It is a good metric to use when we have a lot of classes, and we want to see how well our model performs in general. We will use this metric to compare our model with the baseline.
Before talking about each metric, we introduce two other friends of ours, price and time. Since we are actually hoping to productionize this method, it is important to talk about these two metrics as well. We will be using the same dataset for all the experiments, so we can compare the time it takes to train and predict for each method. We will also talk about the price of each method, and how much it would cost to run it in production.
Precision is the number of true positives divided by the sum of true positives and false positives. In other words, it measures how well the model predicts the positive instances of each class. A high precision means that the model is good at avoiding false positives, which is important in our case since we want to avoid labeling a response with the wrong class.
Precision can be calculated as:


where TP is the number of true positives and FP is the number of false positives.
Recall is the number of true positives divided by the sum of true positives and false negatives. It measures how well the model identifies the positive instances of each class. A high recall means that the model is good at finding the relevant instances, but it might also produce more false positives.
Recall can be calculated as:


where TP is the number of true positives and FN is the number of false negatives.
The F1score is the harmonic mean of precision and recall, which provides a balance between these two metrics. It ranges from 0 to 1, with 1 being the best possible score. A high F1score indicates that the model is good at both avoiding false positives and finding relevant instances.
F1score can be calculated as:


Jaccard similarity, also known as the Jaccard index, is a measure of similarity between two sets. In our case, it is used to measure the similarity between the predicted and true labels. The Jaccard similarity ranges from 0 to 1, with 1 being a perfect match between the two sets.
Jaccard similarity can be calculated as:


In addition to the abovementioned evaluation metrics, time and cost are also important factors when considering a model for production use. The time required for training and predicting with each method should be compared, as well as the cost associated with using a particular method, such as the price of using GPT3.5/4 API, which could be significant depending on the size of the dataset.
With the metrics and the baseline ready, we can start talking about the implementation of our second experiment, how well GPT performs on classification.
Second experiment is to see how well GPT3.5/4 performs on the same classification task, multilabel multiclass classification. We use the same dataset and the same metrics to compare the results. We also compare the time and cost of each method, to see how well they perform in production.
When an Analyst handles the data, there are couple human error that are expected to happen time to time:
I am mentioning these here, since we are bout to use GPT for the classification task, and these errors in general will lead to wrong labeling. We will see how well GPT performs regardless of these errors here because we are using human generated labels to begin with.
Later on when we are checking the results for cohesion, we will actually be using GPT generated themes and a human will manually evaluate the results. This will help us see how well GPT performs in a real world scenario. There are some issues with this method but we will discuss them later on.
Prompting is the single most important component when it comes to zero/fewshot learning. If you are not familiar with prompting I highly suggest you go through Lilian Weng’s Blog Post. It is a great resource to understand the techniques and the importance of prompting.
Let’s talk about what prompting techniques we will be using for this experiment. I won’t be able to provide the exact prompts we have used, since they are company knowledge, but I will mention the general idea.
For this step, we feed the existing themes to GPT, and ask it to classify into these bins, and then we compare the results with the true labels. We use the same metrics as before, and we also compare the time and cost of each method.
The parameters that change during the experiment is only the GPT model used (3.5 or 4) and if we do fewshot or zeroshot learning.
Here are the results:
Model  Batch Size  Prompt ID  Zeroshot/ Fewshot  Precision  Recall  Jaccard Similarity (Acc)  Price  Time 

GPT 3.5  1  2  zeroshot  0.412  0.636  0.412  0.664  8 min 
GPT 3.5  10  1  zeroshot  0.425  0.678  0.487  
GPT 3.5  25  1  zeroshot  0.394  0.665  0.459  0.096  5 min 
GPT 3.5  25  3  fewshot  0.425  0.574  0.459  0.128  12 min 
GPT 3.5  1  4  fewshot  0.411  0.663  0.411  0.661  21 min 
GPT 4  1  2  zeroshot  0.46  0.74  0.46  6.46  24 min 
GPT 4  25  1  zeroshot  0.475  0.770  0.551  0.823  11 min 
GPT 4  25  3  fewshot  0.506  0.663  0.561  1.166  8.5 min 
GPT 4  1  4  fewshot  0.463  0.738  0.463  6.43  18 min 
The prompts here are simply just explaining the task. We do not use any chain of thought or any other prompting technique. We simply ask GPT to classify the response into one of the themes. This is due to the time constraints we were aiming for (since we already have two layers here, we tested out prompts to just assign) and since this is not really a complex task.
We can see that the results clearly indicates that GPT models are outperforming the baseline by a large margin. Though we also see that the results are not near what we wanted. We are thinking some of this is due to the human error we mentioned before. Once we have the complete pipeline (with generation) we will be able to see how well GPT performs in a real world scenario, eliminating the human error.
We have a baseline for classification, but we do not have a baseline for generation. This is due to the fact that there is no existing method for theme generation. We can use topic modelling, but that is not really a generation method. When we mention generation, we mean that we want to generate a theme from scratch after reading the responses. We kind of come close to doing this in Topic Modelling, since we group the themes together (clustering) and then assign a name to the cluster. But this is not really generation, since we are not generating a theme from scratch and just assigning a name to a cluster.
Anyhow baseline is irrelevant here (since we can only perform zero/oneshot anyways.) We will just go ahead and test GPT3.5/4 for generation and see how well it performs. Since we use human evaluator to evaluate the results, we will be able to see how acceptable GPT performs in a real world scenario (this is pretty much the final workflow we will be going through anyways.)
Again for this task we used simple prompting. After going through multiple iterations of prompting we picked the best performing one (from a quick evaluation), and used that for the final results. This is the step we test the “cohesion” between our models, and get a human evaluator to evaluate the results.
We now came to the end of the first phase where we can evaluate the results. We run the generation and classification one after the other, report the results and ask a human expert to analyze these results.
We have evaluated the results of 130 responses, and got to FBeta Score
of 0.81
. This is a very good result, and we are very happy with it. We also got a lot of feedback from the evaluator, and we used these feedbacks to improve the prompting. For Beta
value we used 0.65
as we give more importance to precision.
This evaluation happens in two steps: Analyst first looks through the generated themes, and evaluates how good they are (and how descriptive). Then they look at the classification results in the context of the generated themes, and evaluate how well the classification results fit into the generated themes.
Overall we are happy with the current state of the model. But this process gave us the idea that the seperation might not have been a good idea.
Next we test out a combined approach, where we use a single prompt for both generation and classification. This will help us see if the seperation is actually helping us or not.
To handle some of the complications and give a clearer direction to GPT, we use a prompting technique called “Chain of Thought”. This is a very powerful technique, and it is very easy to implement. We will be using this technique for both generation and classification.
We also gave a quite descriptive expert analyst personality to GPT that directs the model to think like an analyst we would approve of. This is a very important step, since we want to make sure that GPT is not generating themes that are not useful for us.
After all the experiments, we finally have a system in production. I might have missed some of the details while experimenting, but this took a long time to get to this point and I am a little lazy to fill in so much detail that don’t really matter at this point. Especially since I am working on something new now.
I will just go ahead and explain the final system, and what we found to be the best approach. If you had any further questions, feel free to reach out to me.
We have implemented a three stage system, where we first generate themes, and since we are doing this in parellel compute we then merge the redundant themes. We then classify the responses into these themes. While doing this we are using GPT function calling to reduce the parsing errors in the end. As much as it sounds simple, this whole process is a quite complex system to implement into production. We are using a lot of different techniques to make sure the system is robust and accurate.
Overall we found this to be the best resulting approach using GPT. We are now focused on iterating and reducing the errors we found in production. As a final goal, we are hoping to train our own proprietary finetuned model using our own data. This will help us reduce the cost and increase the accuracy of the system. Stay tuned for the results.
Our startup specializes in delivering topnotch qualitative coding services to businesses, presenting the results on a userfriendly dashboard for our clients. In an effort to better serve their needs, we decided to incorporate sentiment analysis as a key feature.
Sentiment analysis is a popular NLP task that classifies text based on its sentiment. This can be accomplished in various ways, such as categorizing text as positive, negative, or neutral. Alternatively, more nuanced classifications like very positive, positive, neutral, negative, and very negative can be used. Other sentiment analysis tasks, like emotion classification or aspectbased sentiment analysis, focus on different aspects of the text. You can learn more about these tasks here.
Ultimately, we chose the most common sentiment analysis task, which classifies text as positive, negative, or neutral. This approach offers the greatest flexibility in terms of data use and compatibility with existing models.
Having settled on our sentiment analysis task, the next step was to find a pretrained model to serve as a baseline for comparison. However, we first encountered a challenge: our data was not in the same format as the models or publicly available data. Consequently, we needed labeled data to test the models and determine which one performed best for our specific needs.
Our first task was to label our data. Given the sheer volume of data and time constraints, we opted to label a small subset. We employed Doccano, a userfriendly tool designed for effortless data labeling. You can find more details about Doccano on its GitHub page.
With the labeling complete, we had a modest dataset of 200 samples, chosen via stratified sampling, to test our models. While our initial plan was to label 1,000 samples, we reduced it to 200 to save time.
Armed with our labeled data, we set out to test various models. Our first port of call was HuggingFace’s Transformers, which offers a range of attentionbased Transformer models known for their exceptional performance in NLP tasks, including sentiment analysis. Later in this post, I’ll discuss some specific base models I used, their distinctions, and my rationale for selecting them.
For our initial testing, I chose several topranked models from HuggingFace’s Transformers and a base model, ‘VADER,’ a rulebased sentiment analysis tool. I compared the Transformer models’ results with those of the base model. In light of GPT3.5 and GPT4’s success, I also incorporated a few zeroshot and fewshot models from GPT using the OpenAI framework.
Here’s a list of the models I utilized:
Now, let’s delve into basic usage examples for each model type and our initial results.






To effectively assess our models’ performance, we need to employ appropriate evaluation metrics. Common metrics for sentiment analysis include:
Using a combination of these metrics allows for a more comprehensive understanding of the model’s performance, especially when dealing with an imbalanced dataset. For instance, if we have 1,000 samples with 900 positive and 100 negative, we could achieve a high accuracy score by consistently predicting positive outcomes. However, this doesn’t necessarily indicate a good model. Therefore, we need to utilize additional metrics to evaluate our model’s performance.
The F1 score combines precision and recall, making it an ideal choice for our evaluation. Consequently, we opted to use both F1 score and accuracy as our evaluation metrics.
Below is the function we’ll use to calculate accuracy and F1 score.


With our models and evaluation metrics in place, we can now test the pretrained models using the 200 labeled samples. Since no training is involved, we’ll use all the data for testing.
These results serve as a sanity check and a general evaluation of how closely our data aligns with the training data used for the models. If our data happens to be highly similar to the training data, we can expect favorable results and stop there. However, if the results are unsatisfactory, we’ll need to put in more effort to obtain better results or find a more suitable model.
Below are the accuracy and F1 score plots for all the models:
As evident from the plots, the VADER model performs the worst, while the GPT4 model emerges as the bestperforming one. GPT3.5 also delivers relatively strong results. The Hugging Face models, on the other hand, don’t perform quite as well. The best opensource model is PySentimento, but its performance still falls short of our desired level.
It’s worth noting that our data labeling is complex, making it difficult even for humans. This could introduce some bias in the data, but we won’t delve into that in this post since the data itself won’t be disclosed.
The GPT3.5 and GPT4 models, both zeroshot, show promising performance. We could potentially achieve better results with fewshot training.
Considering the potential of GPT models and the underwhelming performance of the pretrained sentiment analysis models, we decided to first explore GPT3.5 and GPT4 models and then attempt to train our own sentiment analysis model using GPT as the labeler. This approach will provide us with a smaller opensource model for our system, offering performance comparable to GPT models without any associated costs.
We began by testing different prompting methods on the same small dataset to determine the best approach for labeling our sentiment analysis model.
Aside from the prompts, we also tested the general prompting technique. We introduced a parameter called sample batch size
for this individually dependent task. This parameter controls the number of samples sent to the model at once. It is crucial since sending all samples simultaneously makes it more challenging for the model to generate all labels. However, a benefit of this approach is cost efficiency since the same preprompt (or instructions) doesn’t need to be repeated for each sample.
While we won’t delve into the specifics of the prompts used, we ensured that our instructions to the model were clear. GPT models allow us to explain what we want from the model, so we provided detailed definitions of positive, negative, and neutral sentiments.
The results for different prompting methods are shown below:
We included four metrics in the plot:
sample batch size
of 1. The performance drops significantly with a sample batch size
of 10.sample batch size
of 1 is more expensive than the sample batch size
of 10.sample batch size
of 1 outperforming the sample batch size
of 10. Though GPT4 performs slightly better, we chose GPT3.5 due to its lower cost and faster processing time.To train an opensource model, we’ll use GPT3.5 to generate the majority of the labels (120,000 data points) and GPT4 for an additional 10,000 data points. This approach will help us assess how closely we can achieve GPT4 performance with a smaller model.
Now that we have the labels for our data, we can start training our sentiment analysis model. We will discuss in order of the following steps:
Before training our sentiment analysis model, we need to preprocess and prepare the data. The process involves the following steps:
Here’s the code for the above steps:






The distribution of labels in our dataset is somewhat imbalanced. We’ve observed that negative labels are the most common, while positive labels are the least common. We took this into account when deciding how to evaluate our model. Still, it’s a good idea to be aware of the label distribution in the dataset.
Next, we’ll create a data loader for Pytorch and prepare the data for training by creating a Dataset
class.


Now it’s time to choose a model to train. We’ll use the transformers
library to train our models, building a classifier on top of pretrained language models such as BERT. In this section, we’ll discuss different models we considered, their differences, pros and cons, and then we’ll implement, train, and evaluate each model. The models we considered include:
We have 8 different models. Let’s go over each model and explain how they differ.
BERT, introduced by Google in 2018, is a transformerbased model that marked a significant milestone in the field of NLP. It achieved stateoftheart results on various tasks and can be finetuned for specific tasks like sentiment analysis and questionanswering. BERT uses bidirectional context, allowing the model to better understand the textual context. However, it’s not the most recent model, so it’s usually outperformed by newer models.
Pros:
Cons:
RoBERTa, an optimized version of BERT, was introduced by Facebook AI in 2019. It builds upon BERT’s architecture but includes several modifications that improve its performance. RoBERTa uses a larger training dataset, longer training time, and removes the nextsentence prediction task during pretraining. It also employs dynamic masking, resulting in better performance on downstream tasks.
Pros:
Cons:
DistilBERT, a smaller version of BERT, was introduced by Hugging Face in 2019. It aims to maintain most of BERT’s performance while reducing its size and computational requirements. DistilBERT has about half the parameters of BERT and is faster during training and inference.
Pros:
Cons:
XLMRoBERTa is a multilingual version of RoBERTa, introduced by Facebook AI in 2019. It’s pretrained on a dataset comprising 100 languages and aims to offer improved performance on crosslingual tasks, such as machine translation and multilingual sentiment analysis.
Pros:
Cons:
GPT2, a transformerbased language model, was introduced by OpenAI in 2019. It is a large, generative model that generates text one token at a time, using a lefttoright autoregressive language modeling (LM) objective. GPT2 is generally better at generating creative text compared to BERT. Since our goal is to imitate GPTgenerated output, we’ll give it a try.
Pros:
Cons:
We’ll train each model on the training set and evaluate them on the validation set, using the transformers
library. The library provides a unified API for all the models, making it easy to switch between them.


To train our models, we’ll use the Trainer
class, a highlevel API that handles the training loop, evaluation loop, and prediction loop. It also manages data loading, model saving, and model loading. We’ll use the TrainingArguments
class to specify the training arguments, such as the number of epochs, the batch size, and the learning rate.


After setting the arguments, we also set up the optimizer and the learning rate scheduler. We use the AdamW optimizer with a linear learning rate scheduler. We also set the random seed to 42 for reproducibility.


We are now ready to train our model. We instantiate the Trainer
class and call the train
method to start training.


We can then evaluate the model on the validation set one more time. I am keeping the test set for when we are fully done with training and evaluation, so we don’t make biased decisions.
We can also save the model and the tokenizer.
So we have finalized the training of our models. After running each model and tweaking the parameters to get the best performance, we can see the results.
Model  Accuracy  F1 Score  Price  Time 

BERT  0.8559  0.8595  $0.379  2 mins 
RoBERTa  0.8630  0.8661  $0.379  0.5 min 
DistilBERT  0.8685  0.8698  $0.379  1.5 mins 
GPT2  0.8619  0.8649  $0.379  1.5 mins 
PySentimento  0.4049  0.4121     
—  —  —  —  — 
XLMRoBERTa  0.8638  0.8665  $0.526  0.5 min 
RoBERTa Large  0.8691  0.8715  $0.526  1 min 
GPT2 Medium  0.8654  0.8583  $0.526  1 min 
—  —  —  —  — 
GPT3.5  0.8609  0.8326  $3.6  1 h 26 mins 
GPT4  1*  1*  $24  5 h 6 mins 
*GPT4 is the ground truth, so we can’t compare it with the other models.
These results are on the test set we generated using GPT4.
We can see that the results are pretty much the same for all the models we have trained, and they all perform really good. We see that they are really similar to the results we got with GPT3.5, which shows that the training was successful since we trained them on the dataset generated by GPT3.5.
We have also tested how the models perform on the initial handcrafted dataset and saw that the results are pretty much the same as the ones we got with GPT3.5.
In this post, we saw how we can use GPT models to generate a dataset for sentiment analysis. We then trained a bunch of models on the generated dataset and compared them for the results.
We decided to go with BERT as the model we use since we already utilize it for another part of our pipeline, but we could go for any of the models we trained.
Our decision to label our dataset using GPT3.5 allowed us to generate a reliable training set for finetuning other models, ultimately leading to the successful implementation of BERT for Aelous. This process demonstrates the versatility and value of GPT models in realworld applications even if we are not using it directly for the feature itself.
Thanks for reading! If you have any questions or suggestions, feel free to reach out to me, or leave a comment/like below.
This tutorial includes answers to the following questions:
Object Detection is finding objects in an image and where they are located in the image. Adding localization or location on detected objects for a classification task will give us object detection.
We mainly use Deep Learning approaches for modern applications (what a surprise 🙂). On the other hand, object detection focuses on how to solve localization problems for the most part, so we will focus on some methods to help us solve this issue to begin with.
Localization is an easy concept. For example, imagine we have an image of a cat; I can classify this image as a cat with some confidence level. If I want to show where the cat is in the image, I need to use localization; to determine what part of the image the cat is at. Similarly, if we had multiple objects on the scene, we could detect each separately, classifying the image as numerous. We call this location identification of various objects localization.
While we try to classify an image, our model will also predict where is the predicted object located, or rather where the bounding box is located. To do this, we have additional parameters to describe the location of the bounding box. For example, we can define a bounding box using the coordinates of its corners or the location of its middle point and height and weight. I’ll talk about that later.
As we discussed, a bounding box surrounds an object in the image. The red, green and blue boxes in the image above are examples of bounding boxes. That’s great; how do we handle drawing these boxes, though?
There are many ways proposed, and I am sure the research will continue on it for some time, but the primary approach we have is called a sliding window. A sliding window is to have a box run around all the images and try to find which part of the image actually has the object we are predicting.
As we can guess, this is a slow method, considering how many boxes there are for every image (you also have to run your model on each window). So there is some work on improving this method’s speed.
The next problem is getting multiple bounding boxes for an image. We will see how to handle this as well.
This is a simple method to calculate the error of a given prediction. We check the intersection of the real bounding box and the prediction, and divide it into the union of the two. Very simple, isn’t it?
What we need to do now is to write a simple geometric formula to determine the area of intersection and the union. Let’s jump in using PyTorch. For the sake of understanding, I will first give the nonvectorized implementation, then upgrade the lines to vectorized version
The first thing we need to do is to convert the midpoint representation to corners. W


We are basically just iterating through all the boxes we have in the list and changing their values using width and height from the middle point. If we know the middle point, we can remove half of the width and height to find the top left corner of the image (in python images are 0,0 on the top left and getting higher numbers towards south and east). So the formula for the top left corner is $x_1 = m_x  \frac{w}{2}$ where $m_x$ is the x for the middle point and $w$ is the width. The same logic goes for y. If we add $w$ to this value we will find the $x_2$.
Now if we do this way, we are not making use of tensor operations, so let’s alter the code to get a faster calculation


If you didn’t get what’s happening here, please ponder over the code a little to grasp how the two of them are the same.
Now that we setup the corners, we need to find the intersection and the union of the areas. To find the area we can simply multiply the height and width which are equal to the distance between the x’s and y’s. So $A = abs(y_2y_1)\times abs(x_2x_1)$. We can get the area for boxes with this logic


We now have everything but the intersection. To find the intersection we can use a simple idea.
So we can just find the corners and use the same logic as the boxes to find the area of the intersection intersection = (x2  x1) * (y2  y1)
. Though we need a little extra here, we have a probability that there is nothing at the intersection in which case we need to just say so, meaning we need to increase the value to 0
.


We could just use clamp(0)
instead of an ifelse
statement there, but I wanted to make it as easy to comprehend as possible.
Let’s combine everything and PyTorchify at the same time


You can check the easy version on the GitHub repo.
That’s all for the IOU! Now let’s jump over to nonmax suppression.
As we mentioned before we might get multiple bounding boxes that fit an object. We need to clean these up and keep only one (one box to rule them all…). We introduce nonmax suppression precisely to do this.
For each object in our scene we get multiple boxes around, and we need to see if these boxes are actually for the same object and if so we should remove them and keep a single one.
For this, we get all the boxes that say this part of the image is a dog with some confidence. We pick the box with the most confidence and compare all the others with this box using IoU. After that, by using some threshold value, we remove all the boxes that are above the threshold.
Before all this, we can also discard all the boxes that are below some confidence level, which would ease our job a little.
One last thing to mention before jumping in the code, we do this separately for each class. So for bikes, we would go over the boxes one more time, and for cars too etc.
Time for the code!
So to begin with, we will assume we got some boxes bboxes
as a tensor, iou_threshold
for the IoU comparison and threshold
for confidence threshold.
We first handle the conversion from h
and w
as before.
Now that we have proper variables, we then eliminate the boxes that are below the prediction threshold, then we sort the boxes based on their probabilities (so we can consider the highest probability first.)
Then we simply iterate through each box and remove all the boxes that have a higher IoU value than the threshold we gave (we also keep the boxes from other classes). We then append the box we examined for among the boxes to keep.
That’s all, let’s bring it all together.


Done with that as well, we now can focus on the boxes we actually care about, next up is mean average precision.
So we have an object detection model, how do we evaluate this? The most common metric out there (currently) is the Mean Average Precision (mAP). As we do, we will quickly cover the basics and jump into code.
We trained our model, now we are testing using the test or validation data. We will use precision/recall to evaluate. So before doing more, let’s go over precision and recall really quickly.
When we make a prediction, we are either right or wrong. Though we can be wrong in different ways. We can say false to something that was true, or true to something that was false. This introduces the idea of False Positive and False Negative. False positive is when the predicted value is positive but the actual result is negative, and False negative is vice versa. Of course, for this, we need to define truth values to the results.
Other notions introduced here are True Positive and True Negative. True positives are the true values our model got to predict right, and true negatives are the negative values where our model got it right.
In our case, for object detection, the predictions we make are the positives, and the predictions we didn’t make are the negatives. So false negatives would be the target boxes that we could not predict (I will explain in a bit how we say if we actually predicted a box right, though you can already guess). If we combine true positives and false positives we get all the predictions we made. If we divide the correct predictions from all predictions we get precision, so $p=\frac{TP}{TP + FP}$. If we combine all the truths, so all the target values whether or not we predicted right, we can reach recall $r = \frac{TP}{TP + FN}$. The diagram below explains it perfectly.
Now that we know what precision and recall are, how are they used for evaluation in our case?
First of all, how do we know if a prediction is wrong? Yes, we will use IoU as described above. If the IoU value (with a target) is greater than some threshold we will assume that box is correct.
Here are the steps for finding mean average precision:
Well, that seemed longer than it actually is, let’s dive into code to get a better grasp.
To make things easier to follow, I want to start with the function definition, so we have all the variables set in place before we piece everything else together.


After that, we will continue with the first step as usual: convert the point format…
In the main part, we will iterate through all the classes, and keep our attention on those only. So to do that we get the targets and predictions for a single class to begin with. We also create a list to keep track of the average precisions.


We will use only these boxes for our next steps (so we only focus on one class at a time). This is preferred since we need to check each box with possible targets. It will make more sense in a bit.
Next up, we sort our predictions based on their probabilities and create a couple of variables for tracking and all. We define precisions
list for keeping true positives and false positives. 1’s will be TP and 0’s will be FP.


Now that we are set, we will iterate through all the predictions. While only considering the target boxes that are for the same image we will check each target and decide if the prediction we are checking passes the IoU threshold for that target. If so we will mark that target done so we don’t consider it for the next prediction. We also will add a true positive to our precisions (adding 1).


Now we need to calculate precisions and recalls, just like we mentioned while explaining the algorithm (adding to the nominator/denominator thing). We will also add an extra zero to make the graph (for AUC) go from 0 to 1. Lastly, we use the trapezoidal rule to calculate the AUC for precision recall.


Let’s put all the bells and whistles together and get our fully formed function:


That’s it! Wait… One last thing. This was for a single IoU threshold, we will need more than that. Let’s write a simple function that calls our mean_average_precision
.


And we are fully done. Now we know every bit we need to actually go ahead and implement our first object detection algorithm, which will be the first version of the still stateoftheart YOLO algorithm. It now has YOLOv7, but we will start with implementing v1.
You can find all the code from this tutorial here.
In this post, I will work my way into basic Sentiment Analysis methods and experiment with some techniques. I will use the data from the IMDB review dataset acquired from Kaggle.
We will be examining/going over the following:
Your model will be, at most, as good as your data, and your data will be only as good as you understand them to be, hence the features. I want to see the most useless or naive approaches and agile methods and benchmark them for both measures of prediction success and for training and prediction time.
Before anything else, let’s load, organize and clean our data really quick:


Let’s start with creating a proper and clean vocabulary that we will use for all the representations we will examine.
We just read all the words as a set, to begin with,
So for the beginning of the representation, we have 331.056 words in our vocabulary. This number is every nonsense included, though. We also didn’t consider any lowercase  uppercase conversion. So let’s clean these step by step.
We reduced the number from 331.056 to 84.757. We can do more. With this method, we encode every word we see in every form possible. So, for example, “called,” “calling,” “calls,” and “call” will all be a separate words. Let’s get rid of that and make them reduce to their roots. Here we start getting help from the dedicated NLP library NLTK since I don’t want to define all these rules myself (nor could I):
The last step towards cleaning will be to get rid of stopwords. These are ’end,’ ‘are,’ ‘is,’ etc. words in the English language.
Now that we have good words, we can set up a lookup table to keep encodings for each word.
Now we have a dictionary for every proper word we have in the data set. Therefore, we are ready to prepare different feature representations.
Since we will convert sentences in this clean form, again and again, later on, let’s create a function that combines all these methods:


Ideally, we could initialize tokenizer
stemmer
and stop_words
globally (or as a class parameter), so we don’t have to keep initializing.
This will represent every word we see in the database as a feature… Sounds unfeasible? Yeah, it should be. I see multiple problems here. The main one we all think about is this is a massive vector for each sentence with a lot of zeros (hence the name). This means most of the data we have is telling us practically the same thing as the minor part; we have these words in this sentence vs. we don’t have all these words. Second, we are not keeping any correlation between words (since we are just examining word by word).
We go ahead and create a function for encoding every word for a sentence:
We then convert all the data we have using this encoding (in a single matrix):
That’s it for this representation.
This version practically reduces the 10.667 dimensions to 3 instead. We are going to count the number of negative sentences a word passes in as well as positive sentences. This will give us a table indicating how many positive and negative sentences a word has found in:
The next thing to do is to convert these enormous numbers into probabilities. There are multiple points to add here: First, we are getting the probability of this single word being in many positive and negative sentences, so the values will be minimal. Hence we need to use a log scale to avoid floating point problems. Second is, we might get words that don’t appear in our dictionary, which will have a likelihood of 0. Since we don’t want a 0 division, we add laplacian smoothing, like normalizing all the values with a small initial. Here goes the code:
After getting the frequencies and fixing the problems we mentioned, we now define the new encoding method for this version of the features
We end by converting our data as before
Let’s take a sneak peek at what our data looks like:


A better would be to use PCA for this kind of representation, but for now, we will ignore that fact since we want to explore that in episode 2.
This episode mainly focuses on cleaning the data and developing decent representations. This is why I will only include Logistic Regression for representation comparison, we then can compare Naive Bayes and Logistic Regression to pick a baseline for ourselves.
Logistic regression is a simple singlelayer network with sigmoid activation. This is an excellent baseline as it is one of the simplest binary classification methods. I am not explaining this method in depth, so if you want to learn more, please do so. I will use a simple PyTorch
implementation.
We then define the loss function and the optimizer to use. I am using Binary Cross Entropy for the loss function and Adam for the optimization with a learning rate of 0.01
.


Sparse Representation Training We first start with training the sparse representation. I trained for 100
epochs and reached 0.614
training accuracy and 0.606
validation accuracy. Here is the learning curve
Word Frequency Representation Training I trained using the same parameter settings above, reaching 0.901
training accuracy and 0.861
validation accuracy. Here is the learning curve in the log scale
The next really good baseline is Naive Bayes. This is a very simple model that is very fast to train and has a very good accuracy. Naive Bayes is a probabilistic model that uses Bayes’ theorem to calculate the probability of a class given the input. The main assumption of this model is that the features are independent of each other. This is why it is called Naive. To give a basic intuition of how this model works, let’s say we have a sentence I love this movie
and we want to classify it as positive or negative. We first calculate the probability of the sentence being positive and negative using the conditional frequency probability we calculated above and multiply them by the prior probability of the class. The class with the highest probability is the predicted class.
To put it in other terms, this is the Bayes Rule:
$$P(CX) = \frac{P(XC)P(C)}{P(X)}$$
We then calculate $P(w_ipos)$ and $P(w_ineg)$ for each word in the sentence where $w_i$ is the $i^{th}$ word in the sentence and $pos$ and $neg$ are the positive and negative classes respectively. We then multiply the ratio of these, so:
$$\prod_{i=1}^{n} \frac{P(w_ipos)}{P(w_ineg)}$$
If the result is greater than 1, we predict the sentence to be positive, otherwise negative. When we convert this to log space and add the log prior, we get the Naive Bayes equation:
$$\log \frac{P(pos)}{P(neg)} + \sum_{i=1}^{n} \log \frac{P(w_ipos)}{P(w_ineg)}$$
We now implement this in python and numpy.


Here we recreate the frequency table as lambda_
and converting the counts to frequencies as well as log likelihood. So we have a self containing naive bayes method.
We then test and get 0.9
for training accuracy and 0.859
for test accuracy.


So we got pretty much the same exact result as Logistic regression. The upside of Naive Bayes is that it is very fast to train and has a very good accuracy. The downside is that it is not very flexible and does not capture the relationship between the features. This is why we use more complex models like Neural Networks. Later on I might have another post on more mature methods.
]]>This post includes my notes from the lecture “Makemore Part 3: Activations & Gradients, BatchNorm” by Andrej Karpathy.
Fixing the initial Loss:
* .01
)0
and 1
in softmax (a lot of them) is really bad, since the gradient will be 0 (vanishing gradient). This is called saturated tanh.1
or below 1
. So we first reduce the initial values and then use softmax on them (and continue training process).Okay we know how to fix initialization now, but how much should we reduce these numbers? Meaning what is the value we should scale the layers with. Here comes Kaiming init.
Here are two plots, left is for x
and right is for y
(pre activation, x @ w
) layer values. We see that even though x
and w
are uniform gaussian with unit mean and standard deviation, the result of their dot product, y, has a nonunit standard deviation (still gaussian).
We don’t want this in a neural network, we want the nn to have relatively simple activations, so we want unit gaussian throughout the network.
To keep std of y
unit, we need to scale w
down, as shown in the figure below (w
scaled by 0.2
), but with what exactly?
Mathematically this scale is equal to the square root of fanin (number of input dimensions, e.g 10
for a tensor of (10, 1000)
).
Depending on the activation function used, this value needs to be scaled by a gain. This gain is $\frac{5}{3}$ for tanh and 1 for linear, and $\sqrt{2}$ for relu. These values are due to shrinking and clamping the values (on relu and tanh).
Kaiming init is implemented in pytorch as torch.nn.init.kaiming_normal_
.
Since the development of more sophisticated techniques in neural networks the importance of accurately initializing weights became unnecessary. To name some; residual connections, some normalizations (batch normalization etc.), optimizers (adam, rmsprop).
In practice, just normalizing by square root of fanin is enough.
Now that we see how to initialize the network, and mentioned some methods that makes this process more relaxed, let’s talk about one of these innovations; batch normalization
We mentioned while training the network that we want balance in the preactivation values, we don’t want them to be zero, or too small so that tanh actually does something, and we don’t want them to be too large because then tanh is saturated.
So we want roughly a uniform gaussian at initialization.
Batch Normalization basically says, why don’t we just take the hidden states and normalize them to be gaussian.
Right before the activation, we standardize the weights to be unit gaussian. We will do this by getting the mean and std of the batch, and scaling the values. Since all these operations are easily differentiable there will be no issues during the backprop phase.
For our example;
Will have the batch norm before the activation is introduced. For this we need to calculate the mean and standard deviation of the batch;
Here we use 0
for the dimension since the shape of preact
is [num_samples, num_hidden_layers]
and we want the mean and std for all the samples for the weight connecting to one hidden layer. So the dimensions of hmean
and hstd
will be [1, num_hidden_layers]
. So in the end we update our hpreact to;


If we leave it at that, we now have the weights forced to be unit gaussian at every step of the training. We want this to be the case only at the initialization. In general case we want the neural network to be able to move the distribution and scale it. So we introduce one more component called scale and shift.
These will be two new parameter set we add on our list that we start the scale with 1
and shift with 0
. We then backpropagate through these values and give the network the freedom to shift and scale the distribution;
We then update our hpreact_bn
:


We also add the new parameters in our parameters, to update while optimize the network:


It’s common to use batch norm throughout the neural network to be able to have a more relaxed initializations.
When introduced batch norm, we make the results of the forward and backward pass of any one input dependent on the batches. Meaning the result of a single sample is now not just dependent on itself but the batch it came with as well. Surprisingly, this is unexpectedly proven to be a good thing, acting as a regularizer.
This coupling effect is not always desired, which is why some scholars looked into other noncoupling regularizers such as Linear normalization.
One thing that still needs adjustment is how to use batch norm in testing phase. We trained the network on batches using batch mean and std but when the model is deployed, we want to use a single sample and get the result based on that. First method for accomplishing this is to calculate the exact mean and std on the complete dataset after training, like:
And using bn_mean
and bn_std
instead of hmean
and hstd
from the training loop.
We can further eliminate this step using a running mean and std. For this purpose we introduce two new parameters:
Then in the main training loop we update these values slowly. This will give us a close estimate.
There is a minor addition of $\epsilon$ on the paper to the denominator of the batch normalization. The reason is to avoid division by zero. We did not make use of this epsilon since it is highly unlikely that we get a zero std in our question.
Last fix we need to do is on bias. When we introduced the batch norm we made the bias b1
useless. This is due to the subtracting the mean after applying the bias. Since the mean includes bias in it, we are practically adding and removing the same value, hence doing an unnecessary operation. In the case of batch norm, we do not need to use explicit bias for that layer, instead the batch norm bias, or bnorm_bias
will handle the shifting of the values.
Use batchnorm carefully. It is really easy to make mistakes, mainly due to coupling. More recent networks usually prefer using layer normalization or group normalization. Batchnorm was very influential around 2015, since it introduced a reliable training for deeper networks because batchnorm was effective on controlling the statistics of the activations.
Training neural networks without the use of tools that makes initialization more relaxed, such as adam or batch normalization, is excruciating. Here we introduce multitudes of techniques to evaluate the correctness of the neural network.
First of, activation distribution throughout the layers. We are using a somehow deep network to be able to see the effects, with 5 layers. Each linear layer is followed by a tanh
. As we saw before, tanh
kaiming scale is $\frac{5}{3}$. Here is how the activations look like when we have it right:
We see that the layers have somehow similar activations throughout, saturation is around 5% which is what we wanted. If we change the scaling value to $1$ instead:
We get an unbalanced activations with 0 saturation. To see even more clear, let’s set the value to $0.5$:
The next test is on gradients. Same as before we want the gradients throughout the layers to be similar. Here is the gradient distribution when we actually use $5/3$ as our scaling value:
As opposed to $$ $3$:
We can see here that the gradients are shrinking.
What are we checking:
Grad:data ratio gives us an intuition of what is the scale of the gradient compared to the actual values. This is important because we will be taking a step update of the form w = w  lr * grad
. If the gradient is too large compared to the actual values, we will be overshooting the minimum. If the gradient is too small compared to the actual values, we will be taking too many steps to reach the minimum.
The std of the gradient is a measure of how much the gradient changes across the weights. If the std for a layer is too different from the std of the other layers, this will be an issue because this layer will be learning at a different rate than the other layers.
This is for initialization phase. If we let the network train for a while, it will fix this issue itself. Nevertheless, this is an issue especially if we are using a simple optimizer like SGD. If we are using an optimizer like Adam, this issue will be fixed automatically.
Here are examples;


We can see that the ratio of the last layer is way too large, as well as its standard deviation. Which is why the pink line on the graph is too wide.
We calculate the update std’s ratio with real value, and this gives us a measure for learning rate. Roughly the layers are all should be around 3
The formula is for each epoch: [(lr * p.grad.std() / p.data.std()).log().item() for p in params]
.
Supervised learning is the most common machine learning problem. In Supervised Learning we already know what the correct output should be like.
There are two categories for supervised learning, first one is “regression problem” and the second one is “classification problem”. I will explain both with examples.
In regression we have an output that is continous and we are trying to predict what will be the correct answer/output/label for our input. Lets see an example to understand the concept better.
Let’s say we have a friend who has a chocalate company. He has a lot of money and he wants to make his product sell as many as Snickers. OK. But his chocalates are not famous as Snickers. Now , what he should do is, take a look at the competitor. There is a chart which has two dimensions. One is the price. Another is the popularity. Now that since we have continous output for the prices. We will predict the one that we are looking for. (I will just give the popularities according to myself.)
Now, looking at this output. What should we do is, putting a straight or polinomial line to the outputs.
Then we will have our line that will help us to predict the price. According to the surveys, our chocalates have 8 point for the popularity. So what will be the best price according to the survey and the industry…
It seems something like 70¢ …
This is the regression problem …
In classification, the simplest one, binary classification, we have two options, either true or false. We also can say that we will predict the result in a binary map. Let’s check an example.
Let’s give an absurd example so that it will be more permament. So we have a friend who just ate 5 kilos of Nutella and he is 24 years old. We want to predict if he will get sick or not. And we have a dataset that have people’s ages that ate 5 kilos of Nutella and got the sick or not !!
So according to this graph our friend will get sick or not. It is a binary example. There is just two probabalities. This is a classification problem. Let’s see the expected result …
(He will probably get sick, according to our prediction.)
The unsupervised learning is the second most common machine learning problem. In unsupervised learning we don’t know the result for each input. We will obtain a structure form the data. We do not know what are the exact effects of our inputs. We will use clustering for this.
We basically will seperate the data according to variables.
Let’s say you got hundred different composition classes’ final essays. They all have different topics. What clustering do is, classifying all the essays according to their topics. So that if we use clustering, all these classes’ articles will be separated. This is just one variable (topic). If you want, you can add more variables to make the groups more specific. In this case we can add words count for example.
]]>