How does file hashing work?

752 Asked by Amitraj in Cyber Security , Asked on Sep 26, 2022

So when you run sha256sum filename OR md5sum filename, does it generate hash based on the file size or the whole contents of a file?

I'm wondering how it's different from password hash. Given a string, the program uses its algorithm to create a hash and similar way it decrypts it?

Answered by Anisha Dalal

File hashing is a one-way digest function. It takes a number of input bytes and computes a fixed-length value from it. If you compute the same hash again, you get the same result. Generally the numeric value of the length of the input is not considered, as the data is inherently changed if you change the length.

Hashes cannot be decrypted. They are lossy compression functions by nature and in most cases you cannot recover the full original data from them (this may not be the case for non-cryptographic indexing hash functions and small input values).

There are three main types of hash function: those used for indexing, integrity, and cryptography.

An indexing hash (e.g. MurmurHash or Java's hashCode) can be used to divide equal or very similar values into groups for improved performance in collection operations. For example, by precomputing a hash of each object and keeping the internal array for a collection sorted by the objects' hashes, lookups can be performed in worst case O(log n) time rather than O(n) using a binary search. The general requirements for indexing hashes are as follows:

Extremely high performance

Generally specialised to certain input data types (e.g. ASCII strings or pointers) In some cases, it is designed to produce output values which represent the magnitude of the input value. For example, in string searching, you might choose to convert the first four characters to their ASCII integer values and interpret those together as a 32-bit integer, since sorting that integer also pre-sorts the list alphabetically.

An integrity hash (e.g. CRC32), sometimes called a "redundancy hash", has the properties necessary to provide a fairly good indication of accidental corruption of a file. These hashes will generally produce a different hash when a bit is changed, but will not withstand someone purposefully trying to generate a collision. For example, the strings "Maria has nine red beds." and "Steven has fifteen white tables." are different but generate the same CRC32 hash of 0ECB65F5. The general requirements for integrity hashes are as follows:

Have a high statistical likelihood of producing a different value when an input bit is changed

Works with any type of input data (universal)

Very high performance

These two hash function types generally make little effort to prevent someone from learning about the input data or finding collisions. This is why they are considered non-cryptographic.

A cryptographic hash such as MD5, SHA1, SHA256, or Keccak, has many more requirements. These hashes are designed to be highly resistant against attempts to discover any information about the original input data, or collisions in the hash function. The general requirements for a cryptographic hash function are as follows: It should be very difficult to determine any information about the input value from the output value, even if an attacker can select parts of the input.

Each output bit changes with a probability of roughly 50% when any one bit in the input changes. Resistance to first order preimage attacks, i.e. if someone gives you a hash value y, it is very difficult to find some value x such that h(x) = y. Or, put more simply, it is computationally infeasible to find the input value that produced a particular output value.

Resistance to second order preimage attacks, i.e. for any given input a it is very difficult to find another value b such that h(a) = h(b). Or, put more simply, it is hard to find a collision for a given plaintext.

General collision resistance, i.e. it is hard to find any two values a and b such that h(a) = h(b) even if the attacker selects both a and b.

Works with any type of input data (universal)

High performance As you noted, it turns out that these functions are quite useful for obscuring passwords and turning passwords into cryptographic keys, although their main purpose is not for password storage or key derivation at all. In fact, it turns out that using a plain hash function for either of these purposes can be quite insecure. You can read about the arms race between attackers and defenders of passwords here, and more about good practice here.

When a hash function or password storage function is used, the verification is not done by decrypting anything. Instead, the password given by a user is hashed in the same way and the resulting hash is compared against the one in the database. If the password is correct then the hash will be the same, otherwise the hashes will not be equal and the server can reject the login attempt. This is generally flawed with basic hashes like SHA256, though, so I suggest reading the two links I posted above to better understand why and what good practice options you have.

How does file hashing work?

Your Answer