less_retarded_wiki

main page, file list, single page HTML, source, report, wiki last updated on 06/07/23

Hash

Hash is a number that's computed from some data in a chaotic way and which is used for many different purposes, e.g. for quick comparisons (instead of comparing big data structures we just compare their hashes) or mapping data structures to table indices.

Hash is computed by a hash function, a function that takes some data and turns it into a number (the hash) that's in terms of bit width much smaller than the data itself, has a fixed size (number of bits) and which has additional properties such as being completely different from hash values computed from very similar (but slightly different) data. Thanks to these properties hashes have a very wide use in computer science -- they are often used to quickly compare whether two pieces of non-small data, such as documents, are the same, they are used in indexing structures such as hash tables which allow for quick search of data, and they find a great use in cryptocurrencies and security, e.g. for digital signatures or storing passwords (for security reasons in databases of users we store just hashes of their passwords, never the passwords themselves). Hashing is extremely important and as a programmer you won't be able to avoid encountering hashes somewhere in the wild.

{ Talking about wilderness, hyenas have their specific smells that are determined by bacteria in them and are unique to each individual depending on the exact mix of the bacteria. They use these smells to quickly identify each other. The smell is kind of like the animal's hash. But of course the analogy isn't perfect, for example similar mixes of bacteria may produce similar smells, which is not how hashes should behave. ~drummyfish }

It is good to know that we distinguish between "normal" hashes used for things such as indexing data and cryptographic hashes that are used in computer security and have to satisfy some stricter mathematical criteria. For the sake of simplicity we will sometimes ignore this distinction here. Just know it exists.

It is generally given that a hash (or hash function) should satisfy the following criteria:

Hashes are similar to checksums but are different: checksums are simpler because their only purpose is for checking data integrity, they don't have to have a chaotic behavior, uniform mapping and they are often easy to reverse. Hashes are also different from database IDs: IDs are just sequentially assigned numbers that aren't derived from the data itself, they don't satisfy the hash properties and they have to be absolutely unique. The term pseudohash may also be encountered, it seems to be used for values similar to true hashes which however don't quite satisfy the definition.

{ I wasn't able to find an exact definition of pseudohash, but I've used the term myself e.g. when I needed a function to make a string into a corresponding fixed length string ID: I took the first N characters of the string and appended M characters representing some characteristic of the original string such as its length or checksum -- this is what I called the string's pseudohash. ~drummyfish }

Some common uses of hashes are:

Example

Let's say we want a hash function for string which for any ASCII string will output a 32 bit hash. How to do this? We need to make sure that every character of the string will affect the resulting hash.

First thought that may come to mind could be for example to multiply the ASCII values of all the characters in the string. However there are at least two mistakes in this: firstly short strings will result in small values as we'll get a product of fewer numbers (so similar strings such as "A" and "B" will give similar hashes, which we don't want). Secondly reordering the characters in a string (i.e. its permutations) will not change the hash at all (as with multiplication order is insignificant)! These violate the properties we want in a hash function. If we used this function to implement a hash table and then tried to store strings such as "abc", "bca" and "cab", all would map to the same hash and cause collisions that would negate the benefits of a hash table.

A better hash function for strings is shown in the section below.

Nice Hashes

{ Reminder: I make sure everything on this Wiki is pretty copy-paste safe, from the code I find on the Internet I only copy extremely short (probably uncopyrightable) snippets of public domain (or at least free) code and additionally also reformat and change them a bit, so don't be afraid of the snippets. ~drummyfish }

Here is a simple and pretty nice 8bit hash, it outputs all possible values and all its bits look quite random: { Made by me. ~drummyfish }

uint8_t hash(uint8_t n)
{
  n *= 23;
  n = ((n >> 4) | (n << 4)) * 11;
  n = ((n >> 1) | (n << 7)) * 9;

  return n;
}

The hash prospector project (unlicense) created a way for automatic generation of integer hash functions with nice statistical properties which work by XORing the input value with a bit-shift of itself, then multiplying it by a constant and repeating this a few times. The functions are of the format:

uint32_t hash(uint32_t n)
{
  n = A * (n ^ (n >> S1));
  n = B * (n ^ (n >> S2));
  return n ^ (n >> S3);
}

Where A, B, S1, S2 and S3 are constants specific to each function. Some nice constants found by the project are:

A B S1 S2 S3
303484085 985455785 15 15 15
88290731 342730379 16 15 16
2626628917 1561544373 16 15 17
3699747495 1717085643 16 15 15

The project also explores 16 bit hashes, here is a nice hash that doesn't even use multiplication!

uint16_t hash(uint16_t n)
{
  n = n + (n << 7); 
  n = n ^ (n >> 8);
  n = n + (n << 3); 
  n = n ^ (n >> 2);
  n = n + (n << 4);
  return n ^ (n >> 8);
}

Here is a nice string hash, works even for short strings, all bits look pretty random: { Made by me. ~drummyfish }

uint32_t strHash(const char *s)
{
  uint32_t r = 21;

  while (*s)
  {
    r = (r * 31) + *s;
    s++;
  }

  r = r * 4451;
  r = ((r << 19) | (r >> 13)) * 5059;

  return r;
}

TODO: more


All content available under CC0 1.0 (public domain). Send comments and corrections to drummyfish at disroot dot org.