I recently had the opportunity to develop a sentiment analysis tool for my company. Although I had some prior experience in this area, I quickly realized that I had more to learn. After extensive research and experimentation, we achieved the desired results. In this post, I’ll share my journey, thought process, and the techniques I employed to meet our objectives.
Identifying the Issue & Setting the Stage
Our startup specializes in delivering top-notch qualitative coding services to businesses, presenting the results on a user-friendly dashboard for our clients. In an effort to better serve their needs, we decided to incorporate sentiment analysis as a key feature.
Sentiment analysis is a popular NLP task that classifies text based on its sentiment. This can be accomplished in various ways, such as categorizing text as positive, negative, or neutral. Alternatively, more nuanced classifications like very positive, positive, neutral, negative, and very negative can be used. Other sentiment analysis tasks, like emotion classification or aspect-based sentiment analysis, focus on different aspects of the text. You can learn more about these tasks here.
Ultimately, we chose the most common sentiment analysis task, which classifies text as positive, negative, or neutral. This approach offers the greatest flexibility in terms of data use and compatibility with existing models.
Having settled on our sentiment analysis task, the next step was to find a pre-trained model to serve as a baseline for comparison. However, we first encountered a challenge: our data was not in the same format as the models or publicly available data. Consequently, we needed labeled data to test the models and determine which one performed best for our specific needs.
Data Labeling
Our first task was to label our data. Given the sheer volume of data and time constraints, we opted to label a small subset. We employed Doccano, a user-friendly tool designed for effortless data labeling. You can find more details about Doccano on its GitHub page.
With the labeling complete, we had a modest dataset of 200 samples, chosen via stratified sampling, to test our models. While our initial plan was to label 1,000 samples, we reduced it to 200 to save time.
Pre-trained Models
Armed with our labeled data, we set out to test various models. Our first port of call was HuggingFace’s Transformers, which offers a range of attention-based Transformer models known for their exceptional performance in NLP tasks, including sentiment analysis. Later in this post, I’ll discuss some specific base models I used, their distinctions, and my rationale for selecting them.
For our initial testing, I chose several top-ranked models from HuggingFace’s Transformers and a base model, ‘VADER,’ a rule-based sentiment analysis tool. I compared the Transformer models’ results with those of the base model. In light of GPT-3.5 and GPT-4’s success, I also incorporated a few zero-shot and few-shot models from GPT using the OpenAI framework.
Here’s a list of the models I utilized:
- VADER
- Huggingface “sbcBI/sentiment_analysis_model”
- Huggingface “cardiffnlp/Twitter-xlm-roberta-base-sentiment”
- Huggingface “Seethal/sentiment_analysis_generic_dataset”
- Huggingface “LiYuan/amazon-review-sentiment-analysis”
- Huggingface “ahmedrachid/FinancialBERT-Sentiment-Analysis”
- Huggingface “mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis”
- PySentimento
- GPT-3.5 (zero-shot, few-shot)
- GPT-4 (zero-shot, few-shot)
Now, let’s delve into basic usage examples for each model type and our initial results.
VADER
|
|
HuggingFace Transformers
|
|
PySentimento
GPT-3.5/4
|
|
Evaluation Metrics
To effectively assess our models’ performance, we need to employ appropriate evaluation metrics. Common metrics for sentiment analysis include:
- Accuracy
- Precision
- Recall
- F1 Score
Using a combination of these metrics allows for a more comprehensive understanding of the model’s performance, especially when dealing with an imbalanced dataset. For instance, if we have 1,000 samples with 900 positive and 100 negative, we could achieve a high accuracy score by consistently predicting positive outcomes. However, this doesn’t necessarily indicate a good model. Therefore, we need to utilize additional metrics to evaluate our model’s performance.
The F1 score combines precision and recall, making it an ideal choice for our evaluation. Consequently, we opted to use both F1 score and accuracy as our evaluation metrics.
Below is the function we’ll use to calculate accuracy and F1 score.
|
|
Initial Results
With our models and evaluation metrics in place, we can now test the pre-trained models using the 200 labeled samples. Since no training is involved, we’ll use all the data for testing.
These results serve as a sanity check and a general evaluation of how closely our data aligns with the training data used for the models. If our data happens to be highly similar to the training data, we can expect favorable results and stop there. However, if the results are unsatisfactory, we’ll need to put in more effort to obtain better results or find a more suitable model.
Below are the accuracy and F1 score plots for all the models:
As evident from the plots, the VADER model performs the worst, while the GPT-4 model emerges as the best-performing one. GPT-3.5 also delivers relatively strong results. The Hugging Face models, on the other hand, don’t perform quite as well. The best open-source model is PySentimento, but its performance still falls short of our desired level.
It’s worth noting that our data labeling is complex, making it difficult even for humans. This could introduce some bias in the data, but we won’t delve into that in this post since the data itself won’t be disclosed.
The GPT-3.5 and GPT-4 models, both zero-shot, show promising performance. We could potentially achieve better results with few-shot training.
Considering the potential of GPT models and the underwhelming performance of the pre-trained sentiment analysis models, we decided to first explore GPT-3.5 and GPT-4 models and then attempt to train our own sentiment analysis model using GPT as the labeler. This approach will provide us with a smaller open-source model for our system, offering performance comparable to GPT models without any associated costs.
Evaluating GPT-3.5 and GPT-4
We began by testing different prompting methods on the same small dataset to determine the best approach for labeling our sentiment analysis model.
Aside from the prompts, we also tested the general prompting technique. We introduced a parameter called sample batch size
for this individually dependent task. This parameter controls the number of samples sent to the model at once. It is crucial since sending all samples simultaneously makes it more challenging for the model to generate all labels. However, a benefit of this approach is cost efficiency since the same pre-prompt (or instructions) doesn’t need to be repeated for each sample.
While we won’t delve into the specifics of the prompts used, we ensured that our instructions to the model were clear. GPT models allow us to explain what we want from the model, so we provided detailed definitions of positive, negative, and neutral sentiments.
The results for different prompting methods are shown below:
We included four metrics in the plot:
- Accuracy: This primary measure of our model’s prediction capabilities shows that both GPT-3.5 and GPT-4 perform well with a
sample batch size
of 1. The performance drops significantly with asample batch size
of 10. - F1 Score: This combination of precision and recall follows the same pattern as accuracy.
- Price: This cost metric is essential if we plan to use this model in production. For example, the
sample batch size
of 1 is more expensive than thesample batch size
of 10. - Time: This measures the time it takes to generate labels, which is important if we use this model in production.
Both GPT-3.5 and GPT-4 perform well, with the
sample batch size
of 1 outperforming thesample batch size
of 10. Though GPT-4 performs slightly better, we chose GPT-3.5 due to its lower cost and faster processing time.
To train an open-source model, we’ll use GPT-3.5 to generate the majority of the labels (120,000 data points) and GPT-4 for an additional 10,000 data points. This approach will help us assess how closely we can achieve GPT-4 performance with a smaller model.
Training a Sentiment Analysis Model
Now that we have the labels for our data, we can start training our sentiment analysis model. We will discuss in order of the following steps:
- Data Preprocessing and Preparation
- Model Selection
- Model Training
- Model Evaluation
Data Preprocessing and Preparation
Before training our sentiment analysis model, we need to preprocess and prepare the data. The process involves the following steps:
- Split the data into train and test sets. We will allocate 80% of the data for training and 20% for validation. The extra 10,000 data points generated with GPT-4 will serve as our test set.
- Convert the labels to numbers using the following mapping:
- Positive: 2
- Neutral: 1
- Negative: 0
- Address any incorrect labels generated by GPT-3.5. We can either drop these data points or manually correct them. In our case, we chose to manually fix most of them and leave some outliers untouched. These examples include dots, different phrasings, etc. A few examples are provided below.
- Check the distribution of labels in the train and test sets. This will give us an idea of how balanced our data is. It’s essential to ensure that our dataset is balanced to avoid biases during training and evaluation.
Here’s the code for the above steps:
|
|
|
|
|
|
The distribution of labels in our dataset is somewhat imbalanced. We’ve observed that negative labels are the most common, while positive labels are the least common. We took this into account when deciding how to evaluate our model. Still, it’s a good idea to be aware of the label distribution in the dataset.
Next, we’ll create a data loader for Pytorch and prepare the data for training by creating a Dataset
class.
|
|
Model Selection
Now it’s time to choose a model to train. We’ll use the transformers
library to train our models, building a classifier on top of pre-trained language models such as BERT. In this section, we’ll discuss different models we considered, their differences, pros and cons, and then we’ll implement, train, and evaluate each model. The models we considered include:
- BERT
- RoBERTa
- DistilBERT
- XLM-RoBERTa
- GPT2
- RoBERTa-Large
- DistilBERT-Large
- GPT2-Medium
We have 8 different models. Let’s go over each model and explain how they differ.
BERT
BERT, introduced by Google in 2018, is a transformer-based model that marked a significant milestone in the field of NLP. It achieved state-of-the-art results on various tasks and can be fine-tuned for specific tasks like sentiment analysis and question-answering. BERT uses bidirectional context, allowing the model to better understand the textual context. However, it’s not the most recent model, so it’s usually outperformed by newer models.
Pros:
- High performance on many NLP tasks.
- Fine-tuning capabilities.
- Bidirectional context for better understanding.
Cons:
- Large model size.
RoBERTa
RoBERTa, an optimized version of BERT, was introduced by Facebook AI in 2019. It builds upon BERT’s architecture but includes several modifications that improve its performance. RoBERTa uses a larger training dataset, longer training time, and removes the next-sentence prediction task during pre-training. It also employs dynamic masking, resulting in better performance on downstream tasks.
Pros:
- Improved performance compared to BERT.
- Retains benefits of BERT.
- Fine-tuning capabilities.
Cons:
- Large model size and high computational requirements.
DistilBERT
DistilBERT, a smaller version of BERT, was introduced by Hugging Face in 2019. It aims to maintain most of BERT’s performance while reducing its size and computational requirements. DistilBERT has about half the parameters of BERT and is faster during training and inference.
Pros:
- Reduced model size and faster training and inference.
- Retains a substantial portion of BERT’s performance.
- Fine-tuning capabilities.
Cons:
- Slightly lower performance compared to BERT and RoBERTa.
XLM-RoBERTa
XLM-RoBERTa is a multilingual version of RoBERTa, introduced by Facebook AI in 2019. It’s pre-trained on a dataset comprising 100 languages and aims to offer improved performance on cross-lingual tasks, such as machine translation and multilingual sentiment analysis.
Pros:
- Multilingual model for cross-lingual tasks.
- Retains benefits of RoBERTa.
- Fine-tuning capabilities.
Cons:
- Large model size and high computational requirements.
GPT2
GPT2, a transformer-based language model, was introduced by OpenAI in 2019. It is a large, generative model that generates text one token at a time, using a left-to-right autoregressive language modeling (LM) objective. GPT2 is generally better at generating creative text compared to BERT. Since our goal is to imitate GPT-generated output, we’ll give it a try.
Pros:
- Generates creative text.
- Closer to GPT’s which we used to generate our dataset.
- Fine-tuning capabilities.
Cons:
- Generally worse at classification tasks compared to BERT and RoBERTa.
Training
We’ll train each model on the training set and evaluate them on the validation set, using the transformers
library. The library provides a unified API for all the models, making it easy to switch between them.
|
|
To train our models, we’ll use the Trainer
class, a high-level API that handles the training loop, evaluation loop, and prediction loop. It also manages data loading, model saving, and model loading. We’ll use the TrainingArguments
class to specify the training arguments, such as the number of epochs, the batch size, and the learning rate.
|
|
After setting the arguments, we also set up the optimizer and the learning rate scheduler. We use the AdamW optimizer with a linear learning rate scheduler. We also set the random seed to 42 for reproducibility.
|
|
We are now ready to train our model. We instantiate the Trainer
class and call the train
method to start training.
|
|
We can then evaluate the model on the validation set one more time. I am keeping the test set for when we are fully done with training and evaluation, so we don’t make biased decisions.
We can also save the model and the tokenizer.
Evaluation
So we have finalized the training of our models. After running each model and tweaking the parameters to get the best performance, we can see the results.
Model | Accuracy | F1 Score | Price | Time |
---|---|---|---|---|
BERT | 0.8559 | 0.8595 | $0.379 | 2 mins |
RoBERTa | 0.8630 | 0.8661 | $0.379 | 0.5 min |
DistilBERT | 0.8685 | 0.8698 | $0.379 | 1.5 mins |
GPT2 | 0.8619 | 0.8649 | $0.379 | 1.5 mins |
PySentimento | 0.4049 | 0.4121 | - | - |
— | — | — | — | — |
XLM-RoBERTa | 0.8638 | 0.8665 | $0.526 | 0.5 min |
RoBERTa Large | 0.8691 | 0.8715 | $0.526 | 1 min |
GPT2 Medium | 0.8654 | 0.8583 | $0.526 | 1 min |
— | — | — | — | — |
GPT-3.5 | 0.8609 | 0.8326 | $3.6 | 1 h 26 mins |
GPT-4 | 1* | 1* | $24 | 5 h 6 mins |
*GPT-4 is the ground truth, so we can’t compare it with the other models.
These results are on the test set we generated using GPT-4.
We can see that the results are pretty much the same for all the models we have trained, and they all perform really good. We see that they are really similar to the results we got with GPT-3.5, which shows that the training was successful since we trained them on the dataset generated by GPT-3.5.
We have also tested how the models perform on the initial hand-crafted dataset and saw that the results are pretty much the same as the ones we got with GPT-3.5.
Conclusion
In this post, we saw how we can use GPT models to generate a dataset for sentiment analysis. We then trained a bunch of models on the generated dataset and compared them for the results.
We decided to go with BERT as the model we use since we already utilize it for another part of our pipeline, but we could go for any of the models we trained.
Our decision to label our dataset using GPT-3.5 allowed us to generate a reliable training set for fine-tuning other models, ultimately leading to the successful implementation of BERT for Aelous. This process demonstrates the versatility and value of GPT models in real-world applications even if we are not using it directly for the feature itself.
Thanks for reading! If you have any questions or suggestions, feel free to reach out to me, or leave a comment/like below.
References
- A review on sentiment analysis and emotion detection from text: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8402961/?report=classic
- Deep learning approach to text analysis for human emotion detection from big data: https://www.degruyter.com/document/doi/10.1515/jisys-2022-0001/html?lang=en
- Transformer models for text-based emotion detection - A review of BERT‑based approaches: https://link.springer.com/article/10.1007/s10462-021-09958-2
- Sentiment analysis using deep learning architectures - A review: https://link.springer.com/article/10.1007/s10462-019-09794-5
- https://monkeylearn.com/sentiment-analysis/
- https://towardsdatascience.com/sentiment-analysis-concept-analysis-and-applications-6c94d6f58c17
- https://huggingface.co/blog/sentiment-analysis-python
- https://brand24.com/blog/sentiment-analysis/
Comments
Comments