Similar Problems

Similar Problems not available

Web Crawler Multithreaded - Leetcode Solution

Companies:

LeetCode:  Web Crawler Multithreaded Leetcode Solution

Difficulty: Medium

Topics: depth-first-search breadth-first-search  

Problem statement:

Given a URL startUrl and an interface HtmlParser, implement a Multi-threaded web crawler to crawl all links that are under the same hostname as startUrl.

Return all URLs obtained by your web crawler in any order.

Your crawler should:

Start from the page: startUrl

Call HtmlParser.getUrls(url) to get all urls from a webpage of given url.

Do not crawl the same link twice.

Explore only the links that are under the same hostname as startUrl.

As shown in the example url above, the hostname is example.org. For simplicity sake, you may assume all urls use http protocol without any port specified. For example, the urls http://leetcode.com/problems and http://leetcode.com/contest are under the same hostname, while urls http://example.org/test and http://example.com/abc are not under the same hostname.

The HtmlParser interface is defined as such:

interface HtmlParser { // Return a list of all urls from a webpage of given url. // This is a blocking call, that means it will do HTTP request and return when this request is finished. public List<String> getUrls(String url); }

Note that getUrls(String url) simulates performing a HTTP request. You can treat it as a blocking function call which waits for a HTTP request to finish. It is guaranteed that getUrls(String url) will return the urls within 15ms. Single-threaded solutions will exceed the time limit so it must be multi-threaded.

Example:

Input: urls = [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com", "http://news.yahoo.com/us" ] edges = [[2,0],[2,1],[3,2],[3,1],[0,4]] startUrl = "http://news.yahoo.com/news/topics/" Output: ["http://news.yahoo.com","http://news.yahoo.com/news","http://news.yahoo.com/news/topics/","http://news.yahoo.com/us"]

Solution:

Approach 1: BFS with Multi-threading

In this approach, we perform a BFS traversal with multi-threading where at each level, we visit all the links present on that level and create a new thread for every link to visit its sub-links. To prevent visiting the same link multiple times, we maintain a set to keep track of visited links.

Algorithm:

Create a new HashSet called visited to keep track of already visited links. Create a new Queue of type String called queue. Add startUrl to the queue. Create a new thread-safe HashSet called results to store the results. While the queue is not empty: Remove the first link from the queue. If link is not visited before: Add link to visited set. Get the hostname of the link by calling getHost() method on URL class. If the hostname of the link matches the hostname of the startUrl: Add the link to result set. Get all the sub-links of the link by calling getUrls(link) method. This method returns a list of all the URLs present on the current link. For every sub-link of link, create a new thread. The new thread calls the execute() function which does the following: If sub-link is not visited before: Add sub-link to visited set. Get the hostname of the sub-link. If the hostname of the sub-link matches the hostname of the startUrl: Add the sub-link to the result set. Enqueue the sub-link into the queue for BFS. Wait for all the threads to complete.

class Solution { public List<String> crawl(String startUrl, HtmlParser htmlParser) {

    Set<String> visited = new HashSet<>();
    Queue<String> queue = new LinkedList<>();
    Set<String> results = Collections.synchronizedSet(new HashSet<>());

    queue.offer(startUrl);
    visited.add(startUrl);

    while(!queue.isEmpty()) {
        int size = queue.size();

        List<Thread> threads = new ArrayList<>();
        for(int i=0;i<size;i++) {
            String currentUrl = queue.poll();
            Thread t = new Thread(new Runnable(){
                public void run() {
                    if(!visited.contains(currentUrl) && currentUrl.contains(startUrl.split("/")[2])) {
                        visited.add(currentUrl);
                        results.add(currentUrl);
                        for(String url : htmlParser.getUrls(currentUrl)) {

                            Thread thread = new Thread(new Runnable(){
                                public void run() {
                                    if(!visited.contains(url) && url.contains(startUrl.split("/")[2])) {
                                        visited.add(url);
                                        results.add(url);
                                        queue.offer(url);
                                    }
                                }
                            });
                            thread.start();
                            threads.add(thread);
                        }
                    }
                }
            });
            t.start();
            threads.add(t);
        }

        for(Thread t : threads) {
            try {
                t.join();
            } catch(Exception e) { }
        }
    }

    return new ArrayList<>(results);
}

}

Time Complexity: O(n), where n is the number of links.

The time complexity of this solution is O(n), where n is the number of links. This is because at each level of recursion, all the links are visited once. The maximum depth of recursion would be the maximum number of links that a single link can have, which is bounded above by a constant. Therefore, the time complexity is linear in the number of links.

Space Complexity: O(n)

The space complexity of this solution is O(n), where n is the number of links. The visited set and the results set use O(n) space. The queue uses up to O(n) space. Lastly, the maximum space used up by all the threads is also O(n). Therefore, the space complexity of this solution is O(n).

Approach 2: DFS with Multi-threading

In this approach, we perform a DFS traversal with multi-threading where we create a new thread for every link to visit its sub-links. To prevent visiting the same link multiple times, we maintain a set to keep track of visited links.

Algorithm:

Create a new HashSet called visited to keep track of already visited links. Create a new thread-safe HashSet called results to store the results. Create a new thread-safe stack called stack to perform DFS. Add the startUrl to the stack. While the stack is not empty: Pop the top link from the stack. If link is not visited before: Add link to visited set. Get the hostname of the link by calling getHost() method on URL class. If the hostname of the link matches the hostname of the startUrl: Add the link to result set. Get all the sub-links of the link by calling getUrls(link) method. This method returns a list of all the URLs present on the current link. For every sub-link of link, create a new thread. The new thread calls the execute() function which does the following: If sub-link is not visited before: Add sub-link to visited set. Get the hostname of the sub-link. If the hostname of the sub-link matches the hostname of the startUrl: Add the sub-link to the result set. Push the sub-link onto the stack for DFS. Wait for all the threads to complete.

class Solution { public List<String> crawl(String startUrl, HtmlParser htmlParser) {

    Set<String> visited = new HashSet<>();
    Stack<String> stack = new Stack<>();
    Set<String> results = Collections.synchronizedSet(new HashSet<>());

    stack.push(startUrl);
    visited.add(startUrl);

    while(!stack.isEmpty()) {

        List<Thread> threads = new ArrayList<>();
        String currentUrl = stack.pop();
        Thread t = new Thread(new Runnable(){
            public void run() {
                if(currentUrl.contains(startUrl.split("/")[2])) {
                    if(!visited.contains(currentUrl)) {
                        visited.add(currentUrl);
                        results.add(currentUrl);
                        for(String url : htmlParser.getUrls(currentUrl)) {

                            Thread thread = new Thread(new Runnable(){
                                public void run() {
                                    if(!visited.contains(url) && url.contains(startUrl.split("/")[2])) {
                                        visited.add(url);
                                        results.add(url);
                                        stack.push(url);
                                    }
                                }
                            });
                            thread.start();
                            threads.add(thread);
                        }
                    }
                }
            }
        });
        t.start();
        threads.add(t);

        for(Thread thread : threads) {
            try{
                thread.join();
            } catch(Exception e) { }
        }
    }

    return new ArrayList<String>(results);
}

}

Time Complexity: O(n), where n is the number of links.

The time complexity of this solution is O(n), where n is the number of links. This is because at each level of recursion, all the links are visited once. The maximum depth of recursion would be the maximum number of links that a single link can have, which is bounded above by a constant. Therefore, the time complexity is linear in the number of links.

Space Complexity: O(n)

The space complexity of this solution is O(n), where n is the number of links. The visited set and the results set use O(n) space. The stack uses up to O(n) space. Lastly, the maximum space used up by all the threads is also O(n). Therefore, the space complexity of this solution is O(n).

Web Crawler Multithreaded Solution Code

1