Qwen 2.5-72b Vs. Llama 3.3-70b for Text Classification and Summarization

usmanmalik57 3 Tallied Votes 471 Views Share

Open-source LLMs are gaining significant traction due to their ability to match the performance of advanced proprietary LLMs. These models are free to use and allow users to modify their source code or fine-tune them on their own systems, making them highly versatile for various applications.

Alibaba's Qwen and Meta's Llama series are two prominent contenders in the open-source LLM landscape.

In this article, we compare the performance of Qwen 2.5-72b and Llama 3.3-70b, which was released on December 6, 2024. Meta claims that Llama 3.3-70b achieves the same performance as its Llama 3.2 model with 405 billion parameters, making it a powerful and efficient choice for NLP tasks.

By the end of this article, you will understand which model best suits specific NLP needs.

So, let's dive right in!

Installing and Importing Required Libraries

We need to install some libraries to call the Hugging Face inference API to access the Qwen and LLama models. We will also use the rouge-score library to calculate ROUGE scores for text summarization tasks. Below is the script to install the necessary libraries:

!pip install huggingface_hub==0.24.7
!pip install rouge-score
!pip install --upgrade openpyxl
!pip install pandas openpyxl

After installation, import the required libraries as shown below:

from huggingface_hub import InferenceClient
import os
import pandas as pd
from rouge_score import rouge_scorer
from sklearn.metrics import accuracy_score
from collections import defaultdict

Calling Qwen 2.5 and Llama 3.3 Models Using Hugging Face Inference API

To access models using the Hugging Face inference API, you will need your Hugging Face User Access tokens. Then, create a client object for each model using the InferenceClient class from the huggingface_hub library. Pass the Hugging Face model path and your access token to the InferenceClient constructor.

Below is the script to create model clients for Qwen 2.5-72b and Llama 3.3-70b models:

hf_token = os.environ.get('HF_TOKEN')

#qwen 2.5 endpoint
#https://huggingface.co/Qwen/Qwen2.5-72B-Instruct
qwen_model_client = InferenceClient(
    "Qwen/Qwen2.5-72B-Instruct",
    token=hf_token
)

#Llama 3.3 endpoint
#https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
llama_model_client = InferenceClient(
    "meta-llama/Llama-3.3-70B-Instruct",
    token=hf_token
)

Next, to get a response from a model, use the chat_completion() method. This method requires a list of system and user messages passed to its messages attribute.

We will define the make_prediction() method, which generates a response from the model client.

def make_prediction(model, system_role, user_query):

    response = model.chat_completion(
    messages=[{"role": "system", "content": system_role},
        {"role": "user", "content": user_query}],
    max_tokens=10,
    )

    return response.choices[0].message.content

Now, let's generate the response to a dummy question using the Qwen 2.5-72b model:

system_role = "Assign positive, negative, or neutral sentiment to the movie review. Return only a single word in your response"
user_query = "I like this movie a lot"
make_prediction(qwen_model_client,
               system_role,
               user_query)

Output:

'positive'

Similarly, let's try the Llama 3.3-70b model:

system_role = "Assign positive, negative, or neutral sentiment to the movie review. Return only a single word in your response"
user_query = "I hate this movie a lot"
make_prediction(llama_model_client,
               system_role,
               user_query)

Output:

'Negative'

Both models generate accurate responses. Next, we will compare their performance for zero-shot text classification and summarization tasks.

Qwen 2.5-72b vs. Llama 3.3-70b For Zero-shot Text Classification

For this task, we will use the Twitter US Airline Sentiment dataset, which includes positive, negative, and neutral tweets about US airlines.

First, import the dataset into a Pandas DataFrame:

## Dataset download link
## https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment?select=Tweets.csv

dataset = pd.read_csv(r"D:\Datasets\Tweets.csv")
dataset.head()

Output:

img1.png

Next, we will preprocess the dataset and select 100 tweets (34 neutral, 33 positive, and 33 negative). You can select more tweets if you want.


# Remove rows where 'airline_sentiment' or 'text' are NaN
dataset = dataset.dropna(subset=['airline_sentiment', 'text'])

# Remove rows where 'airline_sentiment' or 'text' are empty strings
dataset = dataset[(dataset['airline_sentiment'].str.strip() != '') & (dataset['text'].str.strip() != '')]

# Filter the DataFrame for each sentiment
neutral_df = dataset[dataset['airline_sentiment'] == 'neutral']
positive_df = dataset[dataset['airline_sentiment'] == 'positive']
negative_df = dataset[dataset['airline_sentiment'] == 'negative']

# Randomly sample records from each sentiment
neutral_sample = neutral_df.sample(n=34)
positive_sample = positive_df.sample(n=33)
negative_sample = negative_df.sample(n=33)

# Concatenate the samples into one DataFrame
dataset = pd.concat([neutral_sample, positive_sample, negative_sample])

# Reset index if needed
dataset.reset_index(drop=True, inplace=True)

# print value counts
print(dataset["airline_sentiment"].value_counts())

Output:

neutral     34
positive    33
negative    33
Name: airline_sentiment, dtype: int64

Next, we will define the predict_sentiment() function, which accepts the model client, the system prompt, and the user query and generates a model response.


def predict_sentiment(model, system_role, user_query):

    response = model.chat_completion(
    messages=[{"role": "system", "content": system_role},
        {"role": "user", "content": user_query}],
    max_tokens=10,
    )

    return response.choices[0].message.content

The next step is to iterate through the 100 tweets in the dataset and predict the sentiment of each tweet using the Qwen 2.5-72b and Llama 3.3-70b models. The following script performs this step:

models = {
    "qwen2.5-72b": qwen_model_client,
    "llama3.3-70b": llama_model_client
}

tweets_list = dataset["text"].tolist()
all_sentiments = []
exceptions = 0

for i, tweet in enumerate(tweets_list, 1):
    for model_name, model_client in models.items():
        try:
            print(f"Processing tweet {i} with model {model_name}")


            system_role = "You are an expert in annotating tweets with positive, negative, and neutral emotions"

            user_query = (
                f "What is the sentiment expressed in the following tweet about an airline? "
                f "Select sentiment value from positive, negative, or neutral. "
                f "Return only the sentiment value in small letters.\n\n"
                f "tweet: {tweet}"
            )

            sentiment_value = predict_sentiment(model_client, system_role, user_query)
            all_sentiments.append({
                'tweet_id': i,
                'model': model_name,
                'sentiment': sentiment_value
            })
            print(i, model_name, sentiment_value)

        except Exception as e:
            print("===================")
            print("Exception occurred with model:", model_name, "| Tweet:", i, "| Error:", e)
            exceptions += 1

    print("============================================")
print("Total exception count:", exceptions)

Output:

img2.png

Finally, we can use the following script to calculate the accuracy scores for both models:

results_df = pd.DataFrame(all_sentiments)
for model_name in models.keys():
    model_results = results_df[results_df['model'] == model_name]
    accuracy = accuracy_score(model_results['sentiment'], dataset["airline_sentiment"].iloc[:len(model_results)])
    print(f"Accuracy for {model_name}: {accuracy}")

Output:


Accuracy for qwen2.5-72b: 0.79
Accuracy for llama3.3-70b: 0.79

The above output shows that both the Qwen and Llama models achieve exactly the same accuracy for text classification. However, since Llama is a slightly lighter model (2 billion fewer parameters), I will call it a winner here.

Next, we will compare Qwen 2.5-72b and Llama 3.3-70b models for Zero-shot text summarization.

Qwen 2.5-72b vs Llama 3.3-70b For Text Summarization

We will use the News Article Dataset to compare the performance of the Qwen and Llama models. The following script imports the dataset into a Pandas DataFrame.

# Kaggle dataset download link
# https://github.com/reddzzz/DataScience_FP/blob/main/dataset.xlsx

dataset = pd.read_excel(r"D:\Datasets\dataset.xlsx")
dataset = dataset.sample(frac=1)
print(dataset.shape)
dataset.head()

Output:

img3.png

Next, we will check the average number of characters in all summaries. We will use this number to set the number of output tokens in the LLM model response.

dataset['summary_length'] = dataset['human_summary'].apply(len)
average_length = dataset['summary_length'].mean()
print(f"Average length of summaries: {average_length:.2f} characters")

Output:

Average length of summaries: 1168.78 characters

Next, we will define the generate_summary() function that given the model client, the system prompt, and the article content, returns the text summary of the article.

def generate_summary(model, system_role, user_query):

    response = model.chat_completion(
    messages=[{"role": "system", "content": system_role},
        {"role": "user", "content": user_query}],
    max_tokens=1200,
    )

    return response.choices[0].message.content

We will use the ROUGE score metric to evaluate model text summarization performance. The following script defines the calculate_rouge() function, which accepts human-written and model-generated summaries as parameters and returns the ROUGE scores.


# Function to calculate ROUGE scores
def calculate_rouge(reference, candidate):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, candidate)
    return {key: value.fmeasure for key, value in scores.items()}

Next, we will loop through the first 20 articles in the dataset to summarize them using the Qwen 2.5-72b and Llama 3.3-70b models. The generate_summary() function will generate model summaries, and the calculate_rouge() method will compute ROUGE scores for the generated summary.

The results, including ROUGE scores for all summaries generated by the Qwen 2.5-72b and Llama 3.3-70b models, will be stored in a Pandas DataFrame.

models = {"qwen2.5-72b": qwen_model_client,
          "llama3.3-70b": llama_model_client}

results = []

i = 0
for _, row in dataset[:20].iterrows():
    article = row['content']
    human_summary = row['human_summary']

    i = i + 1

    for model_name, model_client in models.items():

        print(f"Summarizing article {i} with model {model_name}")
        system_role = "You are an expert in creating summaries from text"
        user_query = f "Summarize the following article in 1150 characters. The summary should look like human created:\n\n{article}\n\nSummary:"

        generated_summary = generate_summary(model_client, system_role, user_query)
        rouge_scores = calculate_rouge(human_summary, generated_summary)

        results.append({
            'model': model_name,
            'article_id': row.id,
            'generated_summary': generated_summary,
            'rouge1': rouge_scores['rouge1'],
            'rouge2': rouge_scores['rouge2'],
            'rougeL': rouge_scores['rougeL']
        })

# Create a DataFrame with results
results_df = pd.DataFrame(results)

Output:

img4.png

Finally, we will compare the two models by calculating the mean ROUGE scores for all the article summaries.

average_scores = results_df.groupby('model')[['rouge1', 'rouge2', 'rougeL']].mean()
average_scores_sorted = average_scores.sort_values(by='rouge1', ascending=False)
print("Average ROUGE scores by model:")
average_scores_sorted.head()

Output:

img5.png

The Qwen model performs better for text summarization per all the ROUGE metrics.

Conclusion

Qwen and Meta Llama models are two major players in the open-source LLM arena. In this article, we compare the performance of the Qwen 2.5-72b and Llama 3.3-70b models for zero-shot text classification. The results show that Qwen and Llama achieve the same performance for text classification. However, Qwen performs better than Llama for text summarization.

To conclude, I recommend Qwen 2.5-72b if you are looking for a robust open-source LLM. However, we will still have to wait to see how Llama 3.3-70b evolves before reaching a final conclusion.

sadiaafrin commented: good post +0
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.