Web Crawler - Leetcode Solution

Companies:

Unknown

Difficulty: Medium

Topics: string depth-first-search breadth-first-search

Problem Statement:

The problem asks for a solution to implement a web-crawler that starts at a given URL and follows all links it can find to new pages on that site and continues recursively until every reachable page has been visited exactly once. The problem also specifies that the crawler should avoid revisiting pages it has already visited.

Solution:

A web crawler is a bot that automatically crawls through pages and extracts information, First, we need to understand the requirements that a web crawler needs to fulfill. In general, a web crawler should have the following components:

A queue, which stores links to be fetched.
A set, which stores the links that have already been fetched.
A method to check if a link has already been fetched.
A method to fetch the content of a page at a given URL.
A method to extract the links from a page's content.
A method to add found links to the queue.

Using these components, we can build a web crawler.

To implement the web crawler, we need to start with a queue containing the starting URL. We then fetch the content at the first URL and extract any links. We can add these links to the queue as long as they have not been fetched before.

We continue this process until the queue is empty. This ensures that all reachable pages have been visited exactly once.

A simple implementation of the web crawler might look something like this:

import urllib.request
from urllib.parse import urlparse
from urllib.parse import urljoin
from bs4 import BeautifulSoup

class WebCrawler:
    def __init__(self, start_url):
        self.start_url = start_url
        self.visited_urls = set()
        self.links_to_visit = []

    def crawl(self):
        self.links_to_visit.append(self.start_url)
        while len(self.links_to_visit) > 0:
            url = self.links_to_visit.pop(0)
            if url in self.visited_urls:
                continue
            self.visited_urls.add(url)
            page_content = self.fetch_page(url)
            links = self.extract_links(page_content, url)
            for link in links:
                if link not in self.visited_urls:
                    self.links_to_visit.append(link)

    def fetch_page(self, url):
        response = urllib.request.urlopen(url)
        return response.read()

    def extract_links(self, page_content, url):
        soup = BeautifulSoup(page_content, 'html.parser')
        links = []
        for link in soup.find_all('a'):
            href = link.get('href')
            if href is None:
                continue
            href = urljoin(url, href)
            parsed_href = urlparse(href)
            if parsed_href.scheme not in {'http', 'https'}:
                continue
            links.append(href)
        return links

The above implementation follows the requirements that a web crawler needs to fulfill.

The crawl() method runs continuously until the links_to_visit queue is empty. It gets the next URL to visit from the front of the queue, checks if the URL has already been visited or not and fetches the content of the page.

The extract_links() method uses the BeautifulSoup library to parse the page content and extract all links in the page that have the http or https scheme. It converts any relative links to absolute links using the urljoin() method from the Python urlparse module.

Conclusion:

In conclusion, a web crawler is one of the useful components that make web scraping possible. It helps to programmatically navigate the web from one page to another, extracting valuable data in the process. The web crawler implementation provided here should work for most websites, but there are cases where additional processing may be required to handle some edge cases.

Similar Problems

Web Crawler - Leetcode Solution

Companies:

Web Crawler Solution Code