Saturday, January 19, 2013

One-Way Functions for Unique Identifiers

A somewhat common but faulty pattern for programmers who need to hide information is to use unsalted hash functions. This does not work. Let me explain.

Background

While reading a site with Django design patterns, among some quite good advice, I came across this:

If you use PK [primary key] in urls you are giving away sensitive information, for example, the number of entries in your table. It also makes it trivial to guess other urls.

Use slugs in urls. This has the advantage of being both user and SEO friendly.

If slugs do not make sense, instead use a CRC algorithm.:

class Customer(models.Model):
    name = models.CharField(max_length = 100)

    def get_absolute_url(self):
        import zlib
        #Use permalink in real case
        return '/customer/%s/' % zlib.crc32(self.pk)

No, no, no, no.

If you don’t know what they are talking about in the beginning, don’t worry about it. The import point is that they do not want to show the self.pk value to the user. To hide it, they pass it through the zlib.crc32 function, which returns a number. This is actually wrong on multiple levels.

Unsalted Hash Function

The first problem I want to point out is a common mistake programmers make. I have seen this done especially with IP addresses, but also a number of other items.

The general problem is that there is some kind of unique item that needs to be passed to the user, but the user should not be able to identify which item that is. The solution seems obvious: One-way functions are functions that can be calculated quickly, but it’s not feasible to calculate the original value from the result.

This is a problem when the source domain is relatively small. In the case of the code above, even if you have millions of customers, a modern computer can calculate the hash values of the first million integers in under a second. So when they have the hash value of the user id, they can easily find out the original value by simply trying them all.

The same approach works for slightly larger search spaces like IPv4 addresses or even passwords. Here, an attacker will spend any amount of time pre-computing hashes for possible or likely values and store them in a database. When they get the hash value, they can just do a lookup and find the original. There even are such rainbow tables for common passwords available for download.

The solution here is to employ a salt for the function. That’s any kind of random data you add to your secret information before running the hash function. In the case above, you can simple add a random string:

salt = "".join("{:x}".format(random.randint(0, 15))
               for ignored in range(10))
hash("{}{}".format(self.pk, salt))

First, we construct the salt as a random string of ten hexadecimal digits. Then, we construct the hash value by appending the salt to the secret value. The ten hexadecimal values mean that for every possible value of the primary key, there are over a trillion possible hash values, making rainbow tables a lot less feasible. Storing a trillion SHA-256 values, for example, would require about 35 terabyte of space. And that’s for just a single primary key.

Your salt should be a fixed-width, to avoid for example (1, 11) and (11, 1) having the same hash value, which can reduce the domain and thereby the range of the hash function by a lot.

Take-away message: If you use hash functions for small values (integers, strings), you almost always want to use salts.

Error Detecting Codes

One more issue with the quote at the beginning of this post. The code example uses CRC32 as its hash function. That’s especially bad. CRC32 is designed to be an error-detecting code. The goal is to make sure that small changes in a message, e.g. from transmission errors, are detected, and that the code is very fast to execute so not to slow down transmission. This is done at the cost of being more prone to collissions for larger changes. CRC32 is not a cryptographically useful hash function.

If you need a one-way hash function, use SHA-2 or SHA-3.