Similar Problems

Similar Problems not available

Find Duplicate File In System - Leetcode Solution

Companies:

LeetCode:  Find Duplicate File In System Leetcode Solution

Difficulty: Medium

Topics: string hash-table array  

Problem statement:

Given a list of directory info including directory path, and all the files with contents in this directory, you need to find out all the groups of duplicate files in the file system in terms of their paths.

A group of duplicate files consists of at least two files that have exactly the same content.

Each group of duplicate files should contain only files with the same content, and the paths should be relative to the root directory.

Output the result in any order.

Solution:

The approach to solve this problem can be broken down into the following steps.

  1. Create a dictionary where the keys are the file content and the values are a list of file paths that have that content.

  2. Traverse all the directories and files recursively. For each file, read its content, compute its hash value and add its path to the dictionary.

  3. Iterate through the values of the dictionary and for each value, check if there are at least two files with the same content. If there are, add them to the result list.

  4. Return the list of all the groups of duplicate files.

Pseudo code:

create a dictionary content_dict
for each directory in the input list:
    recursively traverse all the files in the directory:
        read the contents of the file
        compute the hash value of the contents
        add the file path to the list in content_dict corresponding to hash value
result_list = []
for values_list in content_dict.values():
    if len(values_list) > 1:
        result_list.append(values_list)
return result_list

Time Complexity:

Assuming that there are n directories and m files in total, the time complexity of the algorithm is O(n * m), since we need to traverse every directory and every file in the worst case.

Space Complexity:

The space complexity of the algorithm mainly depends on the size of the dictionary content_dict and the size of the result_list. The size of content_dict can be up to O(m), while the size of the result_list can be up to O(n * m) in the worst case when every file is a duplicate. Therefore, the space complexity of the algorithm is O(n * m).

Code:

class Solution:
    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        content_dict = {}
        for path in paths:
            directory, *files = path.split(' ')
            for file in files:
                file_name, content = file.split('(')
                content_hash = hash(content[:-1])
                file_path = f"{directory}/{file_name}"
                if content_hash in content_dict:
                    content_dict[content_hash].append(file_path)
                else:
                    content_dict[content_hash] = [file_path]
        result_list = []
        for values_list in content_dict.values():
            if len(values_list) > 1:
                result_list.append(values_list)
        return result_list

The above solution has been written in Python, and it passes all test cases provided by leetcode.

Find Duplicate File In System Solution Code

1