Extract and Count Reviews Script

This script was basically the concept for a similar WP Plugin, which automatically counts the amount of all single product ratings in each category and writes the correct amount of total reviews in a category on the category pages "aggregate rating" Schema.org Markup.

We had a case, where this was the optimal solution to display the correct amount of "aggregate rating" in "Recipe Rich Results" for a Foodblog/Recipe-Website.

As for today, Google does not seem to give to much attention, but there are indicators showing, the math is getting more important.

Description

This script extracts the total number of reviews from all categories listed in a sitemap and saves the results to a file. It is specifically designed to work with webpages where review counts are displayed in a specific format (e.g., "(123)").

Usage

Run the Script: Replace placeholders (https://example.com/category-sitemap.xml) with the actual URL of the sitemap. Execute the script in a Python environment.
Output: The total reviews per category are saved in result.txt.

Requirements

Python libraries: requests, beautifulsoup4, re.

Special Notes

Review Format: This script is suitable for webpages where the number of reviews is enclosed in parentheses, such as "(123)". It uses a regular expression to identify and extract these numbers.

# scrape_review_count.py
# Author: Christopher Hüneke
# Date: 04.08.2024
# Description: This script extracts the total number of reviews from all categories listed in a sitemap and saves the results to a file. 
# Description: It is specifically designed to work with webpages where review counts are displayed in a specific format (e.g., "(123)").

import requests
from bs4 import BeautifulSoup
import re

# Function to get the total number of reviews from a category URL
def get_total_reviews(url):
    total_reviews = 0
    page_number = 1
    review_pattern = re.compile(r'\((\d+)\)')

    while True:
        page_url = f"{url}/page/{page_number}/" if page_number > 1 else url
        response = requests.get(page_url)

        if response.status_code == 404:
            break

        soup = BeautifulSoup(response.content, 'html.parser')
        page_reviews = soup.find_all(string=review_pattern)

        if not page_reviews:
            break

        for review_text in page_reviews:
            match = review_pattern.search(review_text)
            if match:
                total_reviews += int(match.group(1))

        page_number += 1

    return total_reviews

# Main function to process the sitemap and extract reviews for each category
def main():
    sitemap_url = 'https://example.com/category-sitemap.xml'  # Replace with the actual sitemap URL
    response = requests.get(sitemap_url)
    soup = BeautifulSoup(response.content, 'xml')
    categories = soup.find_all('loc')

    results = []

    for category in categories:
        category_url = category.text
        category_name = category_url.split('/')[-2]
        print(f"Processing category: {category_name}")
        total_reviews = get_total_reviews(category_url)
        results.append(f"{category_name}: {total_reviews} reviews\n")

    with open('result.txt', 'w', encoding='utf-8') as file:
        file.writelines(results)

    print("Results saved to result.txt")

if __name__ == '__main__':
    main()

Sorry, i got that wrong. I can't post scripts in Digital Marketing section, right?

No, just in Web Development, but you can use the seo tag.

Great script! It’s a practical solution for aggregating review counts across categories. Automating the process for accurate Schema.org markup can definitely improve SEO and visibility. Thanks for sharing!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.