bioutils.vmc_digest module

bioutils.vmc_digest.vmc_digest(data, digest_size=24)[source]

Returns the VMC Digest as a Digest object, which has both bytes and string (URL-safe, Base 64) representations.

>>> d = vmc_digest("")
>>> str(d)
'z4PhNX7vuL3xVChQ1m2AB9Yg5AULVxXc'
>>> len(d), len(str(d))
(24, 32)
>>> str(vmc_digest("", 24))
'z4PhNX7vuL3xVChQ1m2AB9Yg5AULVxXc'
>>> vmc_digest("", 17)
Traceback (most recent call last):
...
ValueError: digest_size must be a multiple of 3
>>> vmc_digest("", 66)
Traceback (most recent call last):
...
ValueError: digest_size must be between 0 and 63 (bytes)

SHA-512 is 2x faster than SHA1 on modern 64-bit platforms. However, few appliations require 512 bits (64 bytes) of keyspace. That larger size translates into proportionally larger key size requirements, with attendant performance implications. By truncating the SHA-512 digest [1], users may obtain a tunable level of collision avoidance.

The string returned by this function is Base 64 encoded with URL-safe characters [2], making it suitable for use with URLs or filesystem paths. Base 64 encoding results in an output string that is 4/3 the size of the input. If the length of the input string is not divisible by 3, the output is right-padded with equal signs (=), which have no information content. Therefore, this function requires that digest_size is evenly divisible by 3. (The resulting vmc_digest will be 4/3*digest_size bytes.)

According to [3], the probability of a collision using b bits with m messages (sequences) is:

P(b, m) = m^2 / 2^(b+1).

Note that the collision probability depends on the number of messages, but not their size. Solving for the number of messages:

m(b, P) = sqrt(P * 2^(b+1))

Solving for the number of bits:

b(m, P) = log2(m^2/P) - 1

For various values of m and P, the number of bytes required according to b(m,P) rounded to next multiple of 3 is:

#m

P<1e-24

P<1e-21

P<1e-18

P<1e-15

P<1e-12

P<1e-09

1e+06

15

12

12

9

9

9

1e+09

15

15

12

12

9

9

1e+12

15

15

15

12

12

9

1e+15

18

15

15

15

12

12

1e+18

18

18

15

15

15

12

1e+21

21

18

18

15

15

15

1e+24

21

21

18

18

15

15

For example, given 1e+18 expected messages and a desired collision probability < 1e-15, we use digest_size = 15 (bytes).

References