why two files have the same md5 hash

Can 2 Files Have the Same MD5 Hash? (and why)

The MD5 algorithm is often used to check large files (after a download for example). By comparing the MD5 output, we can tell if they are the same file or not. But is it possible to have the same MD5 hash for two different files? That’s what we’ll see in this article.

Two files can have the same MD5 hash even if there are different. As the MD5 algorithm can take an infinity of input and give a limited number of output, it’s not impossible, even if the probability of collision is very low.

So, you have the short answer now, let’s take a look at an example and how to avoid this issue.

Master Ethical Hacking Skills!
Join the Complete Ethical Hacking Course Bundle and step into the world of cybersecurity.
Learn to think like a hacker and protect systems with this comprehensive course.

Table of Contents

Master Linux Commands
Your essential Linux handbook
Want to level up your Linux skills? Here is the perfect solution to become efficient on Linux. 20% off today!

Download now

Can Two files Have the Same MD5 hash?

The MD5 algorithm can give the same output for two different inputs, so it’s possible to have the same MD5 hash for two entirely different files.

Your Go-To Linux Command Reference!
Download your exclusive free PDF containing the most useful Linux commands to elevate your skills!

This has been demonstrated by various researches over time. But let’s take a classic example from the Peter Selinger’s website:

The first binary file just display the classic “Hello World” message, and the second one fakes a disk erase.
The files are different, and yet they have the same MD5 hash.

An MD5 hash is made with 32 hexadecimal characters. Each character has 16 possible values, so there are only 16^32 different MD5 hashes. That’s 3.402823e+38 different value, which is a gigantic number, but not unlimited, so there are necessarily collisions with the MD5 algorithm.

In March 2005, Xiaoyun Wang and Hongbo Yu of Shandong University in China published an article in which they describe an algorithm that can find two different sequences of 128 bytes with the same MD5 hash.

Peter SELINGER

This problem has been first demonstrated with two different strings giving the same MD5 output, and can also happen for two different files.

What Does it Mean if the MD5 Hashes are the Same?

As a general rule, we can say that two files are identical if they have the same MD5 hash as a result of the MD5 algorithm. However, there are some rare cases where two different files can share the same MD5 hash.

The example we have seen in the last section never happened by accident. The two files given above have been created especially with this goal in mind: demonstrate the collision issue with the MD5 algorithm.

The above files were generated by exploiting two facts: the block structure of the MD5 function, and the fact that Wang and Yu’s technique works for an arbitrary initialization vector

Peter Selinger
Your Go-To Linux Command Reference!
Download your exclusive free PDF containing the most useful Linux commands to elevate your skills!

By knowing the MD5 algorithm limitations, we can create two files that are almost the same (except the middle part) and give the same MD5 hash when we run the md5sum function. But the probability to face this kind of issue with files in real life, with real usage (like comparing a file before and after download) is extremely low.

Don’t summarize this post by telling the MD5 algorithm is completely broken and shouldn’t be used, even for files comparison, that’s not the goal of this article. You should just be prudent with your conclusion if you are working on a project where you need a 100% guarantee that the files are identical. But for most users, the MD5 comparison will be enough.

How Can I Tell if Two Files Are the Same?

It’s a good practice to compare two files with the diff function to ensure they are identical. MD5 hash and file size comparisons are great tools, but they don’t provide a definitive answer.

Hide your IP address and location with a free VPN:
Try it for free now, with advanced security features.
2900+ servers in 65 countries. It's free. Forever.

Diff is a command on Linux, available directly on all distributions, to compare two files.
If we had used it in the previous example instead of the MD5 hash only, we wouldn’t have had a wrong opinion.

The two files are definitively different.

In a nutshell

In short, here are the main points to remember after reading this article:

  • Two files can be different and have the same MD5 hash
  • This has an extremely low probability to happen for a lambda user.
    To check a file corruption after a download, the MD5 algorithm is still a sufficient check.
  • To be sure two files are identical, use other tools to complement, like the diff command on Linux

Whenever you’re ready for more security, here are things you should think about:

- Break free from Gmail: You should be able to choose what happens to your data. With Proton, only you can read your emails. Get private email.

- Protect yourself online: Use a high-speed Swiss VPN that safeguards your privacy. Open-source, no activity logs. Get Proton VPN risk-free.

- Master Linux commands: A sure method to learn (and remember) Linux commands. Useful ones only, one at a time, with clear explanations. Download the e-book.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *