Text Classification and Summarization with DeepSeek R1 Distill Llama 70B

4 Months Ago usmanmalik57 0 867 Views

In the last article, I explained how you can use the DeepSeek-R1-Distill-Qwen-32B model for text classification and summarization problems.

In this article, we will use the DeepSeek-R1-Distill-Llama-70b for the same tasks.

Following results from the DeepSeek-AI's official paper show that DeepSeek-R1-Distill-Llama-70b outperform the other distilled models on 4 out of 6 benchmarks.

Output:

So, let's begin without ado and see what results we achieve with DeepSeek-R1-Distill-Llama-70b model.

Importing and Installing Required Libraries

We will access the DeepSeek-R1-Distill-Llama-70b model via the Groq API. To run scripts in this article, you will need the Groq API Key.

The following script installs the Python Groq library and other Python modules you will need to run scripts for in this article.

!pip install groq
!pip install rouge-score
!pip install --upgrade openpyxl
!pip install pandas openpyxl

The script below imports the required libraries.

The following script

from groq import Groq
import os
import pandas as pd
from rouge_score import rouge_scorer
from sklearn.metrics import accuracy_score
from collections import defaultdict
from google.colab import userdata

Calling DeepSeek R1 Distill Llama 70B using the Groq API

Let's first see how to call the DeepSeek-R1-Distill-Llama-70b via the Groq API.

First, you will need to create an object of the Groq client and pass it your Groq API key.

Next, we define the generate_response() function, which accepts the system instructions and the user prompt.

Finally, you need to call the chat.completions.create() function of the Groq client object to generate a response via any of the Groq models.

The model ID must be passed to the model attribute of the create() function.

The response from the DeepSeek-R1-Distill-Llama-70b model also contains the reasoning process within the <think> </think> tags. We split the response using the last </think> tag and return the remaining answer to the calling function.

The following script shows an example of calling the DeepSeek-R1-Distill-Llama-70b model using the Groq API.


client = Groq(
    api_key=userdata.get('GROQ_API_KEY'),
)

def generate_response(system_instructions, user_query):

  response = client.chat.completions.create(
    model="deepseek-r1-distill-llama-70b",
    temperature = 0,
    max_tokens = 1000,
    messages=[
          {"role": "system", "content": system_instructions},
          {"role": "user", "content": user_query}
      ]
  )

  output = response.choices[0].message.content
  final_response = output.strip().split("</think>")[-1].strip()
  return final_response


system_instructions = "You are an expert Pizza chef"
user_query = """Explain the process of baking a pizza in three simple steps."""
response= generate_response(system_instructions, user_query)
print(response)

Output:

Let's now see how to use DeepSeek-R1-Distill-Llama-70b for text classification and summarization.

DeepSeek R1 Distill Llama 70B For Text Classification

We will classify tweets based on sentiment in the Twitter US Airline Sentiment Dataset. The following script imports the dataset and displays its header.

## Dataset download link
## https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment?select=Tweets.csv

dataset = pd.read_csv(r"/content/Tweets.csv")
dataset.head()

Output:

The text column contains the tweet text w,hile the airline_sentiment column contains the corresponding sentiment, which can be positive, negative, or neutral.

We will perform sentiment classification on 100 tweets. The following script creates a dataset of 100 randomly selected tweets, with 34 tweets labeled as neutral, 33 as positive, and 33 as negative.


# Remove rows where 'airline_sentiment' or 'text' are NaN
dataset = dataset.dropna(subset=['airline_sentiment', 'text'])

# Remove rows where 'airline_sentiment' or 'text' are empty strings
dataset = dataset[(dataset['airline_sentiment'].str.strip() != '') & (dataset['text'].str.strip() != '')]

# Filter the DataFrame for each sentiment
neutral_df = dataset[dataset['airline_sentiment'] == 'neutral']
positive_df = dataset[dataset['airline_sentiment'] == 'positive']
negative_df = dataset[dataset['airline_sentiment'] == 'negative']

# Randomly sample records from each sentiment
neutral_sample = neutral_df.sample(n=34)
positive_sample = positive_df.sample(n=33)
negative_sample = negative_df.sample(n=33)

# Concatenate the samples into one DataFrame
dataset = pd.concat([neutral_sample, positive_sample, negative_sample])

# Reset index if needed
dataset.reset_index(drop=True, inplace=True)

# print value counts
print(dataset["airline_sentiment"].value_counts())

Output:

airline_sentiment
neutral     34
positive    33
negative    33
Name: count, dtype: int64

Next, we will iterate through the list of 100 tweets in our dataset and predict each tweet's sentiment using the generate_response() method we defined earlier.



tweets_list = dataset["text"].tolist()
all_sentiments = []
exceptions = 0

for i, tweet in enumerate(tweets_list, 1):

    try:
        print(f"Processing tweet {i}")
        system_instructions = """You are an expert in annotating tweets with positive, negative, and neutral emotions. Think step by step."""

        user_query = (
            f"What is the sentiment expressed in the following tweet about an airline? "
            f"Select sentiment value from positive, negative, or neutral. "
            f"Return only the sentiment value in small letters.\n\n"
            f"tweet: {tweet}"
        )

        sentiment_value = response = generate_response(system_instructions, user_query)

        all_sentiments.append({
            'tweet_id': i,
            'sentiment': sentiment_value
        })
        print(i, sentiment_value)

    except Exception as e:
        print("===================")
        print("Exception occurred with Tweet:", i, "| Error:", e)
        exceptions += 1

print("Total exception count:", exceptions)

Output:

Overall Accuracy: 0.69

The above output shows that the DeepSeek-R1-Distill-Llama-70b model achieves an accuracy of 69% which is far less compared to the accuracy (87%) achieved by the DeepSeek-R1-Distill-Qwen-32B model in the previous article.

It is safe to assume that the DeepSeek-R1-Distill-Qwen-32B model is a much better choice for text classification, particularly sentiment classification.

Let's see how both models compare for text summarization.

DeepSeek R1 Distill Llama 70B for Text Summarization

We will summarize articles in the News Article Dataset. The following script imports the dataset.

# Kaggle dataset download link
# https://github.com/reddzzz/DataScience_FP/blob/main/dataset.xlsx

dataset = pd.read_excel(r"/content/dataset.xlsx")
dataset = dataset.sample(frac=1)
print(dataset.shape)
dataset.head()

Output:

To compare the models, we will use the average ROUGE scores for all model-generated summaries.

The following function returns ROUGE scores for a single summary.

# Function to calculate ROUGE scores
def calculate_rouge(reference, candidate):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, candidate)
    return {key: value.fmeasure for key, value in scores.items()}

Next, we will iterate through the first 20 articles in our dataset, generate article summaries using the generate_response() function, and calculate ROUGE scores for summaries using the calculate_rouge() function.

Finally, we calculate the average ROUGE scores of all the summaries.

Note: We used the first 20 articles to test model performance. You can use any number of articles for testing.



results = []

i = 0
for _, row in dataset[:20].iterrows():
    article = row['content']
    human_summary = row['human_summary']

    i = i + 1

    print(f"Summarizing article {i}.")
    system_instructions = "You are an expert in creating summaries from text"
    user_query = f"""Summarize the following article in 1150 characters. Do not return your thought process. Only the summary.
    Your summary will be evaluated using ROUGE score. The summary should look like human created:\n\n{article}\n\nSummary:"""

    generated_summary = generate_response(system_instructions, user_query)

    rouge_scores = calculate_rouge(human_summary, generated_summary)

    results.append({
        'article_id': row.id,
        'generated_summary': generated_summary,
        'rouge1': rouge_scores['rouge1'],
        'rouge2': rouge_scores['rouge2'],
        'rougeL': rouge_scores['rougeL']
    })

# Create a DataFrame with results
results_df = pd.DataFrame(results)

mean_values = results_df[['rouge1', 'rouge2', 'rougeL']].mean()
print(mean_values)

Output:

rouge1    0.347660
rouge2    0.100158
rougeL    0.183272

The above output shows ROUGE scores for the DeepSeek-R1-Distill-Llama-70b model. These scores are less than those achieved by the DeepSeek-R1-Distill-Qwen-32B model in the previous article.

Conclusion

This article explains how to summarize text and classify using the DeepSeek-R1-Distill-Llama-70b mode. The results show that for both text classification and summarization tasks, the DeepSeek-R1-Distill-Llama-70b model returns inferior results compared to the much lighter DeepSeek-R1-Distill-Qwen-32B model.

Therefore, we can safely assume that the DeepSeek-R1-Distill-Qwen-32B is the go-to model for text classification and summarization tasks.

rproffitt commented: Leaky system. Unsafe to use. -4

rproffitt 2,706 https://5calls.org

4 Months Ago

"Wiz Research Uncovers Exposed DeepSeek Database Leaking Sensitive Information"
"Security researchers tested 50 well-known jailbreaks against DeepSeek’s popular new AI chatbot. It didn’t stop a single one."

It only seems to get worse the more you look at DeepSeek. And I must note how it is known to not want to talk about Tiananmen Square.

As such I think I will be downvoting DeepSeek R1 tutorials.

policenbicleara -8 Newbie Poster

4 Months Ago

Llama-70B struggles with sentiment analysis (69% accuracy) vs. Qwen-32B (87%).
Summarization performance is weaker, with lower ROUGE scores.
Qwen-32B is the better choice—smaller, faster, and more accurate.

Pelorus_1 -56 Newbie Poster

3 Months Ago

Great breakdown of DeepSeek R1 Distill LLaMA 70B! The explanation of text classification and summarization is clear and insightful. Appreciate the practical examples—makes implementation much easier. Thanks for sharing!

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.