Wilbert's website at SocSci

> Misc> Cryptographic hash

hash.html 2013-11-20

Proof of data integrity: Cryptographic Hash

Imagine publishing an article that heavily depends on the content of a datafile which you measured yourself. After publication of an article you may be asked to reproduce the conclusions of your article. Of course you stored the data files in a safe location. Reproducing your data analysis can easily be done with the original r-scripts or from the description of your graphs and tables, but how do you proof that this was indeed the data file that you used? How do you show that this precise file already existed when you wrote the article?

In case of a review article you can simply point to the articles containing the data in your bibliography. Ideally you would want to put a reference to your datafiles in the bibliography of your article. This is possible if you have a unique reference to your datafiles. One way to accomplish this is to store your data files in a notary's vault and to refer to the entry number in the notary's register. This is a system often used for data and notebooks used in anglo-saxon patents.

An easier method is available nowadays. It is called a cryptographic hash, or simply a hash. A hash is the fingerprint of a datafile. If you change the data file by as much as one character, the hash will change in an unpredictable way. It is not possible to make a new data file with the same hash as the old data file deliberately. Doing so by accident is extremely unlikely.

One particularly good and popular hash function is SHA-256. Is is used widely for storing passwords, authenticating e-mail and software and archiving purposes. It has a legal status in both the US and the EU.

Users of Unix-like operating systems (Apple, Linux, IBM) usually have the sha256sum function installed by default. Install the GNU Core Utilities, if you have not. Install the md5deep utilities if you use Microsoft. Alternatively you can use one of many online calculators.

Try it yourself

Calculate the SHA-256 hash of the text in the textbox below. Try changing one letter and calculating again. The hash will change in an unpredictable way.