Source code for bioutils.vmc_digest
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import hashlib
from .digest import Digest
ENC = "UTF-8"
DEFAULT_DIGEST_SIZE = 24
[docs]def vmc_digest(data, digest_size=DEFAULT_DIGEST_SIZE):
"""Returns the VMC Digest as a Digest object, which has both bytes and
string (URL-safe, Base 64) representations.
>>> d = vmc_digest("")
.. # I can't figure out how to make this test work on Py 2 and 3 :-(
.. >>> d # doctest: +SKIP
.. b'\xcf\x83\xe15~\xef\xb8\xbd\xf1T(P\xd6m\x80'
>>> str(d)
'z4PhNX7vuL3xVChQ1m2AB9Yg5AULVxXc'
>>> len(d), len(str(d))
(24, 32)
>>> str(vmc_digest("", 24))
'z4PhNX7vuL3xVChQ1m2AB9Yg5AULVxXc'
>>> vmc_digest("", 17)
Traceback (most recent call last):
...
ValueError: digest_size must be a multiple of 3
>>> vmc_digest("", 66)
Traceback (most recent call last):
...
ValueError: digest_size must be between 0 and 63 (bytes)
SHA-512 is 2x faster than SHA1 on modern 64-bit platforms.
However, few appliations require 512 bits (64 bytes) of keyspace.
That larger size translates into proportionally larger key size
requirements, with attendant performance implications. By
truncating the SHA-512 digest [1], users may obtain a tunable
level of collision avoidance.
The string returned by this function is Base 64 encoded with
URL-safe characters [2], making it suitable for use with URLs or
filesystem paths. Base 64 encoding results in an output string
that is 4/3 the size of the input. If the length of the input
string is not divisible by 3, the output is right-padded with
equal signs (=), which have no information content. Therefore,
this function requires that digest_size is evenly divisible by 3.
(The resulting vmc_digest will be 4/3*digest_size bytes.)
According to [3], the probability of a collision using b bits with
m messages (sequences) is:
``P(b, m) = m^2 / 2^(b+1)``.
Note that the collision probability depends on the number of
messages, but not their size. Solving for the number of messages:
``m(b, P) = sqrt(P * 2^(b+1))``
Solving for the number of *bits*:
``b(m, P) = log2(m^2/P) - 1``
For various values of ``m`` and ``P``, the number of *bytes* required
according to ``b(m,P)`` rounded to next multiple of 3 is:
+---------+-----------+-----------+-----------+-----------+-----------+-----------+
| #m |``P<1e-24``|``P<1e-21``|``P<1e-18``|``P<1e-15``|``P<1e-12``|``P<1e-09``|
+---------+-----------+-----------+-----------+-----------+-----------+-----------+
|``1e+06``| 15 | 12 | 12 | 9 | 9 | 9 |
+---------+-----------+-----------+-----------+-----------+-----------+-----------+
|``1e+09``| 15 | 15 | 12 | 12 | 9 | 9 |
+---------+-----------+-----------+-----------+-----------+-----------+-----------+
|``1e+12``| 15 | 15 | 15 | 12 | 12 | 9 |
+---------+-----------+-----------+-----------+-----------+-----------+-----------+
|``1e+15``| 18 | 15 | 15 | 15 | 12 | 12 |
+---------+-----------+-----------+-----------+-----------+-----------+-----------+
|``1e+18``| 18 | 18 | 15 | 15 | 15 | 12 |
+---------+-----------+-----------+-----------+-----------+-----------+-----------+
|``1e+21``| 21 | 18 | 18 | 15 | 15 | 15 |
+---------+-----------+-----------+-----------+-----------+-----------+-----------+
|``1e+24``| 21 | 21 | 18 | 18 | 15 | 15 |
+---------+-----------+-----------+-----------+-----------+-----------+-----------+
For example, given ``1e+18`` expected messages and a desired collision
probability ``< 1e-15``, we use ``digest_size = 15`` (bytes).
References:
- [1] http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf
- [2] https://tools.ietf.org/html/rfc3548#section-4
- [3] http://stackoverflow.com/a/4014407/342839
- [4] http://stackoverflow.com/a/22029380/342839
- [5] http://preshing.com/20110504/hash-collision-probabilities/
- [6] https://en.wikipedia.org/wiki/Birthday_problem
"""
# TODO: Consider relaxing %3 constraint and stripping padding
if digest_size % 3 != 0:
raise ValueError("digest_size must be a multiple of 3")
if not 0 <= digest_size <= 63:
raise ValueError("digest_size must be between 0 and 63 (bytes)")
sha512 = Digest(hashlib.sha512(data.encode(ENC)).digest())
return sha512[:digest_size]