Text Classification Using Data Annotation with ChatGPT

Updated usmanmalik57 3 Tallied Votes 677 Views Share

Data annotation for text classification is time-consuming and expensive. In the case of smaller training datasets, pre-trained ChatGPT models might achieve higher classification accuracy on test sets than training classifiers from scratch or fine-tuning existing models. Additionally, ChatGPT can aid in annotating data for fine-tuning text classification models.

In this article, I demonstrate two experiments. First, I make predictions on text data using ChatGPT and compare the results with the test set. Next, I annotate text data using ChatGPT and utilize the annotated data to train a machine learning model. The findings reveal that directly predicting text labels using ChatGPT outperforms data annotation followed by model training. These experiments highlight the practical benefits of using ChatGPT in data annotation and text classification tasks.

Text Classification Using Base Machine Learning Model

To start, I will use a basic machine-learning model to classify text. This will give us a starting point to compare the results later. In the next part of the experiment, we will use ChatGPT to annotate the data and see how it performs compared to the baseline. This way, we can find out if ChatGPT helps improve the classification results.

We'll use the IMDB dataset with labeled movie reviews to train a text classification model. The dataset consists of positive and negative movie reviews. Employing a Random Forest model and TF-IDF features, we'll convert the text data into numerical representations. By splitting the dataset into training and testing sets, we can assess the model's performance using the accuracy score as a metric for sentiment prediction.

Here is the code that trains the text classification model for predicting IMDB movie review sentiments.

# Import necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the IMDb dataset from a CSV file and display the first few rows
dataset = pd.read_csv(r"D:\Datasets\IMDB Dataset.csv")
dataset.head()

# retain first 300 rows from the dataset for experimentation
dataset = dataset.head(300)

# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(dataset['review'], 
                                                    dataset['sentiment'], 
                                                    test_size=0.2, 
                                                    random_state=42)

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)

# Train the Random Forest model
rf_model = RandomForestClassifier(n_estimators  = 500)
rf_model.fit(X_train_tfidf, y_train)

# Transform the test data using the same vectorizer
X_test_tfidf = vectorizer.transform(X_test)

# Predict the sentiment on the test data
y_pred = rf_model.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Training a base random forest classifier model on 240 records returns an accuracy of 0.65% on the test set.

Text Classification with ChatGPT

Let’s now use ChatGPT to make predictions directly on the test set and see what performance we achieve.

We will access the OpenAI API by retrieving the API key from the environment variable "OPENAI_KEY2" and assigning it to the openai.api_key variable.

import os
import openai

api_key = os.getenv('OPENAI_KEY2')
openai.api_key = api_key
print(api_key)

Next, we define a function called find_sentiment(review) that uses the OpenAI GPT-3.5 Turbo language model to determine the sentiment expressed in a given IMDB movie review. Behind the scenes, ChatGPT uses the GPT-3.5 Turbo language model.

import time

def find_sentiment(review):

    content = """What is the sentiment expressed in the following IMDB movie review? 
    Select sentiment value from positive or negative. Return only the sentiment value.
    Movie review: {}""".format(review)

    sentiment = openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      temperature = 0,
      messages=[
            {"role": "user", "content": content}
        ]
    )

    return sentiment["choices"][0]["message"]["content"]

The find_sentiment(review) function takes the review as input and prompts the model to provide the sentiment value, which should be either "positive" or "negative." By using the OpenAI ChatCompletion API, we interact with the language model to extract the sentiment value from the generated response, and the function returns the sentiment as the output.

Next, we iterate through all the reviews in the test set, pass them to the find_sentiment(review) method, and retrieve their sentiments.

all_sentiments = []

X_test_list = X_test.tolist()

i = 0
while i < len(X_test):

    try:
        review = X_test_list[i]
        sentiment_value = find_sentiment(review)
        all_sentiments.append(sentiment_value)
        i = i + 1
        print(i, sentiment_value)

    except:
        print("===================")
        print("time limit reached")

Finally, we compare the ChatGPT-predicted sentiments with the sentiments in the test set.

accuracy = accuracy_score(all_sentiments, y_test)
print("Accuracy:", accuracy)

I achieved an accuracy of 0.95% using ChatGPT predicted sentiments, which is 30% higher than the accuracy achieved via a base model. This is huge and shows how powerful ChatGPT can be for text classification tasks.

In the next section, I will explain how you can annotate data using ChatGPT and use it to train your text classification model.

Data Annotation with ChatGPT

The approach for data annotation remains similar to label prediction since essentially annotation involves assigning a label to a record. The following script annotates reviews in the training set with positive or negative sentiments.

all_sentiments = []

X_train_list = X_train.tolist()

i = 0
while i < len(X_train):

    try:
        review = X_train_list[i]
        sentiment_value = find_sentiment(review)
        all_sentiments.append(sentiment_value)
        i = i + 1
        print(i, sentiment_value)

    except:
        print("===================")
        print("time limit reached")

Next, we can use the records annotated by ChatGPT to train machine learning model for text classification, as shown in the following script:

# Train the Random Forest model
rf_model = RandomForestClassifier(n_estimators  = 500)
rf_model.fit(X_train_tfidf, all_sentiments)

# Transform the test data using the same vectorizer
X_test_tfidf = vectorizer.transform(X_test)

# Predict the sentiment on the test data
y_pred = rf_model.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

In the output, I achieved an accuracy of 0.6833% which is 3% better than the accuracy achieved via manual annotations.

Conclusion

In conclusion, in the case of small datasets, ChatGPT often performs better than training a machine learning model from scratch. It is further observed that ChatGPT annotated data when used for training machine learning models often performs better than manually annotated data, particularly in the case of smaller datasets.

AndreRet 526 Senior Poster

Great tutorial again, thank you!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.