Summarizing YouTube Video Transcriptions Using Distil Whisper and LLM

usmanmalik57 1 Tallied Votes 26 Views Share

In this tutorial, you will see how to summarize YouTube video transcriptions using Distil Whisper large V3 and Mistral-7b-Instruct. Both Distill Whisper Large V3 and Mistral-7B-Instruct models are open-source and free-to-use models.

The Distil Whisper large V3 model is a faster and smaller variant of the Whisper large V3 model, a state-of-the-art speech-to-text model. You will use this model to transcribe YouTube audio. Next, you will use the Mistral-7b-Instruct LLM to summarize the transcriptions. In the process, you will learn to extract audio from YouTube videos. We have many interesting things to see, so let's begin without ado.

Importing and Installing Required Libraries

As always, the first step is to install and import the required libraries. The following script installs libraries required to run codes in this tutorial.


!pip install -q -U transformers==4.38.0
!pip install -q -U bitsandbytes==0.42.0
!pip install -q -U accelerate==0.27.1
!pip install -q datasets
!pip install -q pytube

The script below imports the required libraries.


import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, logging
from datasets import load_dataset
from pytube import YouTube
from transformers import BitsAndBytesConfig

Extracting Audios from YouTube Videos

We will begin by extracting audio from the YouTube video we want to transcribe.
You can use the YouTube class from the pytube module, as shown in the following script.

youtube_video_url = "https://www.youtube.com/watch?v=5sLYAQS9sWQ"
youtube_video_content = YouTube(youtube_video_url)

The streams attribute of the YouTube class object returns various audio and video streams.

for stream in youtube_video_content.streams:
  print(stream)

Output:

Image_1

We are only interested in the audio streams here. We will filter the audio/mp4 with 128kbps ABR (adaptive bitrate stream). You can select any other audio stream.

audio_stream = [stream for stream in youtube_video_content.streams if stream.mime_type == "audio/mp4" and stream.abr == "128kbps"][0]
audio_stream

Output:

<Stream: itag="140" mime_type="audio/mp4" abr="128kbps" acodec="mp4a.40.2" progressive="False" type="audio">

Next, we will download the audio stream using the download method of the stream we filtered.

audio_path = audio_stream.download("intro_to_llms")
audio_path

Output:

/content/intro_to_llms/How Large Language Models Work.mp4

We have downloaded the audio for our YouTube video. The next step is to transcribe this audio into text.

Transcribing Audio Using Distil Whisper Large V3

To transcribe the YouTube audio, we will use the Distil Whisper Large V3 model from the Hugging Face library. The following script downloads the model and its input/output processor.

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

whisper_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True
)
whisper_model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

Next, we will define a pipeline that takes the audio file as input, preprocess and tokenizes it into segments, and generates transcriptions.


pipe = pipeline(
    "automatic-speech-recognition",
    model=whisper_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

result = pipe(audio_path)
print(result["text"])

Output:

image4

We have transcribed the YouTube video; the next step is to summarize it using an LLM.

Summarizing YouTube Video Text Using Mistral-7B-Instruct LLM

We will use the Mistral-7B-instruct LLM to summarize YouTube audio transcriptions. To know more about Mistral-7B, check my article on 7 NLP Tasks to Perform for Free in Python with Mistral 7b LLM. You can use any other LLM instead of Mistral-7B.

Mistral-7B consists of seven billion parameters. It requires a large amount of memory to use Mistral-7B even for inference, let alone fine-tuning. You can use quantization techniques to reduce the weight sizes of such huge LLMs. The script below defines a quantization model that reduces the weight sizes of an LLM to 4 bits. We will use this quantization model to reduce the weight sizes of our Mistral-7B model.


#Ignore warnings
logging.set_verbosity(logging.CRITICAL)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Next, we will import the Mistral-7B model and its tokenizer from the Hugging Face library.

model_id = "mistralai/Mistral-7B-Instruct-v0.1"
device = "cuda" # the device to load the model onto
LLM = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config=bnb_config,
                                             device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id)

We can now use the Mistral-7B model for summarization. To do so, we will define the generate_response() method, which takes the input text, the number of response tokens, and the temperature value. The temperature value must be between 0 and 1, with higher temperature values allowing more creative model responses. The generate_response() function uses the Mistral-7B model to generate the model response.


def generate_response(input_text, response_tokens, temperature):
  messages = [
      {"role": "user", "content": input_text},
  ]
  encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

  model_inputs = encodeds.to(device)

  generated_ids = LLM.generate(model_inputs,
                                max_new_tokens=response_tokens,
                                temperature = temperature,
                                do_sample=True)
  decoded = tokenizer.batch_decode(generated_ids)
  return decoded[0].split("[/INST]")[1].rstrip("</s>")

Finally, we can summarize the YouTube audio transcript by passing it in the summarization prompt to the generate_response() function.

input_text = f"Summarize the following text: {result['text']}"
response = generate_response(input_text, 1000, 0.1)
print(f"Total characters in summarized result: {len(response)}")
print(response)

Output:

Image_2

The above output shows that the YouTube video transcription is summarized in less than 1000 characters.

Asking Other Questions About the Video

In addition to summarization, you can use the generate_response() method to ask other questions about the YouTube video. For example, the following script asks the model to tell whether the video's tone is positive, negative, or neutral.

input_text = f"What is the overall tone of the following video text, positive, negative, or neutral: {result['text']}"
response = generate_response(input_text, 50, 0.1)
print(response)

Output:

image3

Conclusion

Summarizing YouTube video transcriptions is a handy task, as it allows users to retrieve important information expressed in videos, thereby saving time spent watching the complete video. With the help of Distil Whisper and Mistral-7B, you can easily and without any cost summarize YouTube transcriptions. I hope you liked the article. Feel free to share your feedback.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.