In this tutorial, you will see how to summarize YouTube video transcriptions using Distil Whisper large V3 and Mistral-7b-Instruct. Both Distill Whisper Large V3 and Mistral-7B-Instruct models are open-source and free-to-use models.
The Distil Whisper large V3 model is a faster and smaller variant of the Whisper large V3 model, a state-of-the-art speech-to-text model. You will use this model to transcribe YouTube audio. Next, you will use the Mistral-7b-Instruct LLM to summarize the transcriptions. In the process, you will learn to extract audio from YouTube videos. We have many interesting things to see, so let's begin without ado.
Importing and Installing Required Libraries
As always, the first step is to install and import the required libraries. The following script installs libraries required to run codes in this tutorial.
!pip install -q -U transformers==4.38.0
!pip install -q -U bitsandbytes==0.42.0
!pip install -q -U accelerate==0.27.1
!pip install -q datasets
!pip install -q pytube
The script below imports the required libraries.
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, logging
from datasets import load_dataset
from pytube import YouTube
from transformers import BitsAndBytesConfig
Extracting Audios from YouTube Videos
We will begin by extracting audio from the YouTube video we want to transcribe.
You can use the YouTube
class from the pytube
module, as shown in the following script.
youtube_video_url = "https://www.youtube.com/watch?v=5sLYAQS9sWQ"
youtube_video_content = YouTube(youtube_video_url)
The streams
attribute of the YouTube
class object returns various audio and video streams.
for stream in youtube_video_content.streams:
print(stream)
Output:
We are only interested in the audio streams here. We will filter the audio/mp4
with 128kbps
ABR (adaptive bitrate stream). You can select any other audio stream.
audio_stream = [stream for stream in youtube_video_content.streams if stream.mime_type == "audio/mp4" and stream.abr == "128kbps"][0]
audio_stream
Output:
<Stream: itag="140" mime_type="audio/mp4" abr="128kbps" acodec="mp4a.40.2" progressive="False" type="audio">
Next, we will download the audio stream using the download
method of the stream we filtered.
audio_path = audio_stream.download("intro_to_llms")
audio_path
Output:
/content/intro_to_llms/How Large Language Models Work.mp4
We have downloaded the audio for our YouTube video. The next step is to transcribe this audio into text.
Transcribing Audio Using Distil Whisper Large V3
To transcribe the YouTube audio, we will use the Distil Whisper Large V3 model from the Hugging Face library. The following script downloads the model and its input/output processor.
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"
whisper_model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True
)
whisper_model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
Next, we will define a pipeline that takes the audio file as input, preprocess and tokenizes it into segments, and generates transcriptions.
pipe = pipeline(
"automatic-speech-recognition",
model=whisper_model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device,
)
result = pipe(audio_path)
print(result["text"])
Output:
We have transcribed the YouTube video; the next step is to summarize it using an LLM.
Summarizing YouTube Video Text Using Mistral-7B-Instruct LLM
We will use the Mistral-7B-instruct LLM to summarize YouTube audio transcriptions. To know more about Mistral-7B, check my article on 7 NLP Tasks to Perform for Free in Python with Mistral 7b LLM. You can use any other LLM instead of Mistral-7B.
Mistral-7B consists of seven billion parameters. It requires a large amount of memory to use Mistral-7B even for inference, let alone fine-tuning. You can use quantization techniques to reduce the weight sizes of such huge LLMs. The script below defines a quantization model that reduces the weight sizes of an LLM to 4 bits. We will use this quantization model to reduce the weight sizes of our Mistral-7B model.
#Ignore warnings
logging.set_verbosity(logging.CRITICAL)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
Next, we will import the Mistral-7B model and its tokenizer from the Hugging Face library.
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
device = "cuda" # the device to load the model onto
LLM = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=bnb_config,
device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id)
We can now use the Mistral-7B model for summarization. To do so, we will define the generate_response()
method, which takes the input text, the number of response tokens, and the temperature value. The temperature value must be between 0 and 1, with higher temperature values allowing more creative model responses. The generate_response()
function uses the Mistral-7B model to generate the model response.
def generate_response(input_text, response_tokens, temperature):
messages = [
{"role": "user", "content": input_text},
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to(device)
generated_ids = LLM.generate(model_inputs,
max_new_tokens=response_tokens,
temperature = temperature,
do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
return decoded[0].split("[/INST]")[1].rstrip("</s>")
Finally, we can summarize the YouTube audio transcript by passing it in the summarization prompt to the generate_response()
function.
input_text = f"Summarize the following text: {result['text']}"
response = generate_response(input_text, 1000, 0.1)
print(f"Total characters in summarized result: {len(response)}")
print(response)
Output:
The above output shows that the YouTube video transcription is summarized in less than 1000 characters.
Asking Other Questions About the Video
In addition to summarization, you can use the generate_response()
method to ask other questions about the YouTube video. For example, the following script asks the model to tell whether the video's tone is positive, negative, or neutral.
input_text = f"What is the overall tone of the following video text, positive, negative, or neutral: {result['text']}"
response = generate_response(input_text, 50, 0.1)
print(response)
Output:
Conclusion
Summarizing YouTube video transcriptions is a handy task, as it allows users to retrieve important information expressed in videos, thereby saving time spent watching the complete video. With the help of Distil Whisper and Mistral-7B, you can easily and without any cost summarize YouTube transcriptions. I hope you liked the article. Feel free to share your feedback.