The MD5 algorithm is often used to check large files (after a download for example). By comparing the MD5 output, we can tell if they are the same file or not. But is it possible to have the same MD5 hash for two different files? That’s what we’ll see in this article.
Two files can have the same MD5 hash even if there are different. As the MD5 algorithm can take an infinity of input and give a limited number of output, it’s not impossible, even if the probability of collision is very low.
So, you have the short answer now, let’s take a look at an example and how to avoid this issue.
Can Two files Have the Same MD5 hash?
The MD5 algorithm can give the same output for two different inputs, so it’s possible to have the same MD5 hash for two entirely different files.
This has been demonstrated by various researches over time. But let’s take a classic example from the Peter Selinger’s website:
The first binary file just display the classic “Hello World” message, and the second one fakes a disk erase.
The files are different, and yet they have the same MD5 hash.
An MD5 hash is made with 32 hexadecimal characters. Each character has 16 possible values, so there are only 16^32 different MD5 hashes. That’s 3.402823e+38 different value, which is a gigantic number, but not unlimited, so there are necessarily collisions with the MD5 algorithm.
In March 2005, Xiaoyun Wang and Hongbo Yu of Shandong University in China published an article in which they describe an algorithm that can find two different sequences of 128 bytes with the same MD5 hash.Peter SELINGER
This problem has been first demonstrated with two different strings giving the same MD5 output, and can also happen for two different files.
What Does it Mean if the MD5 Hashes are the Same?
As a general rule, we can say that two files are identical if they have the same MD5 hash as a result of the MD5 algorithm. However, there are some rare cases where two different files can share the same MD5 hash.
The example we have seen in the last section never happened by accident. The two files given above have been created especially with this goal in mind: demonstrate the collision issue with the MD5 algorithm.
By knowing the MD5 algorithm limitations, we can create two files that are almost the same (except the middle part) and give the same MD5 hash when we run the md5sum function. But the probability to face this kind of issue with files in real life, with real usage (like comparing a file before and after download) is extremely low.
Don’t summarize this post by telling the MD5 algorithm is completely broken and shouldn’t be used, even for files comparison, that’s not the goal of this article. You should just be prudent with your conclusion if you are working on a project where you need a 100% guarantee that the files are identical. But for most users, the MD5 comparison will be enough.
How Can I Tell if Two Files Are the Same?
It’s a good practice to compare two files with the diff function to ensure they are identical. MD5 hash and file size comparisons are great tools, but they don’t provide a definitive answer.
Diff is a command on Linux, available directly on all distributions, to compare two files.
If we had used it in the previous example instead of the MD5 hash only, we wouldn’t have had a wrong opinion.
The two files are definitively different.
In a nutshell
In short, here are the main points to remember after reading this article:
- Two files can be different and have the same MD5 hash
- This has an extremely low probability to happen for a lambda user.
To check a file corruption after a download, the MD5 algorithm is still a sufficient check.
- To be sure two files are identical, use other tools to complement, like the diff command on Linux