In my previous article on GPT-4o mini, I compared the performance of GPT-4o mini against GPT-3.5 Turbo and GPT-4o for zero-shot text classification. We saw that GPT-4o mini, being 36% times cheaper, achieves only 2% less accuracy than GPT-4o. Furthermore, while being 1/3 of the price, the GPT-4o mini significantly outperformed the GPT-3.5 turbo model.
This article will compare GPT-4o mini, GPT-4o, and GPT-3.5 turbo for zero-shot text summarization. We will evaluate the models' text summarization capabilities using metrics such as ROUGE scores and LLM-based evaluation.
So, let's begin without ado.
Importing and Installing Required Libraries
You must install the following Python libraries to run code in this article.
!pip install openai
!pip install rouge-score
!pip install --upgrade openpyxl
!pip install pandas openpyxl
The script below imports the required libraries.
import os
import time
import pandas as pd
from rouge_score import rouge_scorer
from openai import OpenAI
Importing the Dataset
We will summarize the articles in the News Articles with Summary dataset. The dataset consists of article content and human-generated summaries.
The following script imports the CSV dataset file into a Pandas DataFrame.
# Kaggle dataset download link
# https://github.com/reddzzz/DataScience_FP/blob/main/dataset.xlsx
dataset = pd.read_excel(r"D:\Datasets\dataset.xlsx")
dataset = dataset.sample(frac=1)
print(dataset.shape)
dataset.head()
Output:
The content
column contains the article content, while the human_summary
column contains human-generated summaries of the article.
Next, we will find the average number of characters in all summaries. We will use this number to summarize articles using LLMs.
dataset['summary_length'] = dataset['human_summary'].apply(len)
average_length = dataset['summary_length'].mean()
print(f"Average length of summaries: {average_length:.2f} characters")
Output:
Average length of summaries: 1168.78 characters
Text Summarization with GPT-4o mini, GPT-4o, and GPT-3.5 Turbo
We are now ready to summarize articles using GPT-4o mini, GPT-4o, and GPT-3.5 turbo.
We will first create an object of the OpenAI
class and use this object to make calls to various OpenAI LLMs. You must pass your OpenAI API Key to the OpenAI
class constructor.
Next, we will define the generate_summary()
function that accepts the LLM model name and the article content. We then ask the LLM to generate an 1150-word summary of the passed article. Notice that we use the 1150-character length as the average length of the human-generated summaries was 1168 characters. The generate_summary()
function returns the summary generated via the LLM model.
We will also define the calculate_rouge()
function that calculates the ROUGE 1, ROUGE 2, and ROUGE L scores by comparing LLM-generated and human-generated summaries.
ROUGE scores compare machine-generated texts, like summaries, to human-generated texts to assess the quality of the former. ROUGE-1 measures the overlap of unigrams (single words), ROUGE-2 measures the overlap of bigrams (two consecutive words), and ROUGE-L focuses on the longest common subsequence between the two texts.
client = OpenAI(
# This is the default and can be omitted
api_key = os.environ.get('OPENAI_API_KEY'),
)
# Function to generate summary using OpenAI API
def generate_summary(model, article):
prompt = f"Summarize the following article in 1150 characters. The summary should look like human created:\n\n{article}\n\nSummary:"
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=1150,
temperature=0.7
)
return response.choices[0].message.content
# Function to calculate ROUGE scores
def calculate_rouge(reference, candidate):
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, candidate)
return {key: value.fmeasure for key, value in scores.items()}
Subsequently, we will define a nested for
loop. The outer for loop will iterate through all the LLMs, and the inner for loop will iterate through the articles in our dataset. To save cost, we will only iterate through 20 articles in the dataset.
During each nested for
loop iteration, we will pass the model name and the article to the generate_summary()
function. The result will be stored as a dictionary in the results
list and will contain the model name, the article ID, the generated summary, and the three ROUGE scores.
The' results' list contains 60 records for 3 models and 20 articles. Finally, the results
list is converted to results_df
for better viewing and data manipulation.
models = ["gpt-4o-mini",
"gpt-4o",
"gpt-3.5-turbo"]
results = []
i = 0
for _, row in dataset[:20].iterrows():
article = row['content']
human_summary = row['human_summary']
i = i + 1
for model in models:
print(f"Summarizing article {i} with model {model}")
generated_summary = generate_summary(model, article)
rouge_scores = calculate_rouge(human_summary, generated_summary)
results.append({
'model': model,
'article_id': row.id,
'generated_summary': generated_summary,
'rouge1': rouge_scores['rouge1'],
'rouge2': rouge_scores['rouge2'],
'rougeL': rouge_scores['rougeL']
})
# Create a DataFrame with results
results_df = pd.DataFrame(results)
Output:
Finally, you can sort the results_df
in the descending order of the rouge1
score column to see which model performs best on ROUGE 1. You can also sort by other metrics.
average_scores = results_df.groupby('model')[['rouge1', 'rouge2', 'rougeL']].mean()
average_scores_sorted = average_scores.sort_values(by='rouge1', ascending=False)
print("Average ROUGE scores by model:")
average_scores_sorted.head()
Output:
The above output shows that the GPT-4o performs best with a ROUGE 1 score of 0.381, followed by GPT-4o mini and GPT-3.5 turbo, with ROUGE 1 scores of 0.375 and 0.363, respectively. The ROUGE 2 and ROUGE L scores show that GPT-4o performs better than the other two models. However, GPT-3.5 performs better than GPT-4o mini for ROUGE 2 and ROUGE L scores.
Another way to evaluate an LLM's text summarization performance is via another LLM, which you will see in the next section.
Evaluating LLM-Generated Summary using an LLM
The script below defines the llm_evaluate_summary()
function, which accepts article content and LLM-generated summaries. Inside the function, we ask an LLM (GPT-4o mini in this case) to evaluate the generated summaries on different criteria, such as completeness, conciseness, and coherence. The llm_evaluate_summary()
function returns the ratings for these metrics on a scale of 1-10 in a list.
def llm_evaluate_summary(article, summary):
prompt = f"""Evaluate the following summary for the given article. Rate it on a scale of 1-10 for:
1. Completeness: Does it capture all key points?
2. Conciseness: Is it brief and to the point?
3. Coherence: Is it well-structured and easy to understand?
Article: {article}
Summary: {summary}
Provide the ratings as a comma-separated list (completeness,conciseness,coherence).
"""
response = client.chat.completions.create(
model= "gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=100,
temperature=0.7
)
return [float(score) for score in response.choices[0].message.content.strip().split(',')]
Next, we iterate through the results_df
and pass the LLM-generated summaries and the corresponding original content to the llm_evaluate_summary()
function. We store the result in a defaultdict
object.
# Initialize a dictionary to store scores for each model
scores_dict = defaultdict(lambda: {'completeness': [], 'conciseness': [], 'coherence': []})
i = 0
for _, row in results_df.iterrows():
i = i + 1
# Corrected method to access content by article_id
article = dataset.loc[dataset['id'] == row['article_id'], 'content'].iloc[0]
scores = llm_evaluate_summary(article, row['generated_summary'])
print(f"Model: {row['model']}, Scores: {scores}")
# Store the scores for the model
model = row['model']
scores_dict[model]['completeness'].append(scores[0])
scores_dict[model]['conciseness'].append(scores[1])
scores_dict[model]['coherence'].append(scores[2])
Output:
Finally, we calculate the average scores for completeness, conciseness, and coherence for all models, store them in a Pandas DataFrame, and display them in the output.
# Calculate the average scores for each model
average_scores = {}
for model, scores in scores_dict.items():
average_scores[model] = {
'completeness': sum(scores['completeness']) / len(scores['completeness']),
'conciseness': sum(scores['conciseness']) / len(scores['conciseness']),
'coherence': sum(scores['coherence']) / len(scores['coherence']),
}
# Convert to DataFrame for better visualization (optional)
average_scores_df = pd.DataFrame.from_dict(average_scores, orient='index')
average_scores_df.columns = ['Completeness', 'Conciseness', 'Coherence']
average_scores_df.head()
Output:
The above output shows that the GPT-4o mini, GPT-4o, and GPT-3.5 turbo perform equally well on the completeness metric. For the conciseness metric, GPT-4o and GPT-3.5 turbo models score equal but better than GPT-4o mini. Finally, GPT-4o performed best for coherence, followed by GPT-4o mini and GPT-3.5 turbo.
Conclusion
This article compares GPT-4o mini, GPT-4o, and GPT-3.5 turbo for zero-shot text summarization. The results show no significant difference between the ROUGE scores obtained by the three models. In addition, LLM-based evaluation also demonstrates only slight performance differences between the three models.
Given that the GPT-4o mini is 36 times cheaper than the GPT-4o and 1/3rd the price of the GPT-3.5 turbo, I suggest you use the GPT-4o mini for your text summarization tasks. The performance difference with the other models is negligible.