On April 18, 2024, Meta AI released Llama 3, which they claimed to be the most capable openly available LLM to date. Concurrently, OpenAI announced GPT-4o (omni) on May 13, 2024, which is touted as the state-of-the-art proprietary model for various NLP benchmarks.
As a guy who loves to compare open-source and proprietary models, I decided to test the performance of both these models on a simple zero-shot text classification task. I present my findings in this article.
Note: Checkout one of my previous articles to see the comparison of GPT-4 vs. Gemini-Pro vs. Claude-3 for zero shot text classification.
So, let’s begin comparing GPT-4o vs Llama 3.
Importing and Installing Required Libraries
The following script installs the libraries required to run the scripts in this article. We will call the GPT-4o model using the official OpenAI API, while the Llama 3 model will use the Groq API. Both require API keys, which you can obtain by signing up for the respective services.
!pip install openai
!pip install groq
!pip install pandas
!pip install scikit-learn
Next, we will import the required libraries.
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from openai import OpenAI
from groq import Groq
Importing and Preprocessing the Data
We will use the same dataset we used to compare GPT-4 vs Claude 3 and Gemini Pro models. You can download the dataset from this Kaggle link. The dataset consists of sentiments expressed in public tweets towards various US airlines.
The script below imports the dataset.
dataset = pd.read_csv(r"/home/mani/Datasets/Tweets.csv")
print(dataset.shape)
Output:
(14640, 15)
The dataset originally consisted of 14640 tweets, but we will only take 100 tweets with a nearly equal distribution of neutral positive and negative tweets for comparison. The following script preprocesses the dataset and filters the tweets.
def preprocess_data(dataset):
# Remove rows where 'airline_sentiment' or 'text' are NaN
dataset = dataset.dropna(subset=['airline_sentiment', 'text'])
# Remove rows where 'airline_sentiment' or 'text' are empty strings
dataset = dataset[(dataset['airline_sentiment'].str.strip() != '') & (dataset['text'].str.strip() != '')]
# Filter the DataFrame for each sentiment
neutral_df = dataset[dataset['airline_sentiment'] == 'neutral']
positive_df = dataset[dataset['airline_sentiment'] == 'positive']
negative_df = dataset[dataset['airline_sentiment'] == 'negative']
# Randomly sample records from each sentiment
neutral_sample = neutral_df.sample(n=34)
positive_sample = positive_df.sample(n=33)
negative_sample = negative_df.sample(n=33)
# Concatenate the samples into one DataFrame
dataset = pd.concat([neutral_sample, positive_sample, negative_sample])
# Reset index if needed
dataset.reset_index(drop=True, inplace=True)
return dataset
dataset = preprocess_data(dataset)
# print value counts
print(dataset["airline_sentiment"].value_counts())
Output:
airline_sentiment
neutral 34
positive 33
negative 33
Name: count, dtype: int64
Zero-Shot Text Classification with GPT-4o
As in the previous article, we will define a function that accepts a tweet and returns its sentiment. But this time, we will use the OpenAI GPT-4o model instead of GPT-4.
The following script creates a client object for the OpenAI class and defines the find_sentiment_gpt()
function, which accepts a single tweet as a parameter. Inside the function, we use the chat.completions.create()
method of the OpenAI client object to find the tweet's sentiment. We also specified a prompt for instructing the LLM regarding the classification strategy.
client = OpenAI(
# This is the default and can be omitted
api_key = os.environ.get('OPENAI_API_KEY'),
)
def find_sentiment_gpt(tweet):
content = """What is the sentiment expressed in the following tweet about an airline?
Select sentiment value from positive, negative, or neutral. Return only the sentiment value in small letters.
tweet: {}""".format(tweet)
sentiment = client.chat.completions.create(
model= "gpt-4o",
temperature = 0,
max_tokens = 10,
messages=[
{"role": "user", "content": content}
]
)
return sentiment.choices[0].message.content
The next step is to iterate through all the tweets in the dataset, pass each tweet to the find_sentiment_gpt()
function, and predict the sentiment. The following script does this.
%%time
all_sentiments = []
tweets_list = dataset["text"].tolist()
i = 0
exceptions = 0
while i < len(tweets_list):
try:
tweet = tweets_list[i]
sentiment_value = find_sentiment_gpt(tweet)
all_sentiments.append(sentiment_value)
i = i + 1
print(i, sentiment_value)
except Except as e:
print("===================")
print("Exception occured", e)
exception = exception + 1
print("Total exception count:", exceptions)
Output:
Total exception count: 0
CPU times: user 658 ms, sys: 43.4 ms, total: 701 ms
Wall time: 57.8 s
The above output showed that it took 57.8s to process 100 tweets.
The following script displays the accuracy of predictions.
accuracy = accuracy_score(all_sentiments, dataset["airline_sentiment"])
print("Accuracy:", accuracy)
Output:
Accuracy: 0.78
Next, we will perform the same steps with the Llama 3 model using the Groq API.
Zero-Shot Text Classification with Groq Llama 3
We will create an object of the Groq
class and pass the llama3-70b-8192
model ID to the model
attribute of the client.chat.completions.create()
method.
We define a find_sentiment_llama3()
function that uses the client.chat.completions.create()
method to predict tweet sentiment. The rest of the process remains the same.
client = Groq(
api_key=os.environ.get("GROQ_API_KEY"),
)
def find_sentiment_llama3(tweet):
content = """What is the sentiment expressed in the following tweet about an airline?
Select sentiment value from positive, negative, or neutral. Return only the sentiment value in small letters.
tweet: {}""".format(tweet)
sentiment = client.chat.completions.create(
model="llama3-70b-8192",
temperature = 0,
max_tokens = 10,
messages=[
{"role": "user", "content": content}
]
)
return sentiment.choices[0].message.content
The script below predicts sentiments for all tweets using the find_sentiment_llama3
function:
%%time
all_sentiments = []
tweets_list = dataset["text"].tolist()
i = 0
exceptions = 0
while i < len(tweets_list):
try:
tweet = tweets_list[i]
sentiment_value = find_sentiment_llama3(tweet)
all_sentiments.append(sentiment_value)
i = i + 1
print(i, sentiment_value)
except Except as e:
print("===================")
print("Exception occured", e)
exception = exception + 1
print("Total exception count:", exceptions)
Output:
Total exception count: 0
CPU times: user 899 ms, sys: 44.4 ms, total: 944 ms
Wall time: 4min 14s
Finally, the following script prints the model accuracy:
accuracy = accuracy_score(all_sentiments, dataset["airline_sentiment"])
print("Accuracy:", accuracy)
Final Thoughts
With this comparison between GPT-4o and Llama 3, I have come to the following conclusion.
-
Accuracy: Both models achieve the same accuracy for simple tasks like text classification.
-
Speed: Though Groq claims to be 10x to 15x faster than other contemporary LLM APIs, I found it extremely slow in this comparison. It could be a one-off, and we need more experiments. OpenAI was much faster.
-
Price: GPT-4o is very expensive, currently priced at $5/15 per million of input/output tokens. Llama 3, on the other hand, is open source and free but requires high computational power to run. With Groq API, it costs $0.59/0.79 per million of input/output tokens.
To summarize, Llama 3 is a much better and cheaper choice for simple tasks like sentiment classification. However, the requirement of high computational power and low latency of some APIs like Groq for Llama 3 is a concern.
Feel free to share your thoughts regarding the two models.