bioutils.vmc_digest module¶
- bioutils.vmc_digest.vmc_digest(data, digest_size=24)[source]¶
Returns the VMC Digest as a Digest object, which has both bytes and string (URL-safe, Base 64) representations.
>>> d = vmc_digest("")
>>> str(d) 'z4PhNX7vuL3xVChQ1m2AB9Yg5AULVxXc'
>>> len(d), len(str(d)) (24, 32)
>>> str(vmc_digest("", 24)) 'z4PhNX7vuL3xVChQ1m2AB9Yg5AULVxXc'
>>> vmc_digest("", 17) Traceback (most recent call last): ... ValueError: digest_size must be a multiple of 3
>>> vmc_digest("", 66) Traceback (most recent call last): ... ValueError: digest_size must be between 0 and 63 (bytes)
SHA-512 is 2x faster than SHA1 on modern 64-bit platforms. However, few appliations require 512 bits (64 bytes) of keyspace. That larger size translates into proportionally larger key size requirements, with attendant performance implications. By truncating the SHA-512 digest [1], users may obtain a tunable level of collision avoidance.
The string returned by this function is Base 64 encoded with URL-safe characters [2], making it suitable for use with URLs or filesystem paths. Base 64 encoding results in an output string that is 4/3 the size of the input. If the length of the input string is not divisible by 3, the output is right-padded with equal signs (=), which have no information content. Therefore, this function requires that digest_size is evenly divisible by 3. (The resulting vmc_digest will be 4/3*digest_size bytes.)
According to [3], the probability of a collision using b bits with m messages (sequences) is:
P(b, m) = m^2 / 2^(b+1).Note that the collision probability depends on the number of messages, but not their size. Solving for the number of messages:
m(b, P) = sqrt(P * 2^(b+1))Solving for the number of bits:
b(m, P) = log2(m^2/P) - 1For various values of
mandP, the number of bytes required according tob(m,P)rounded to next multiple of 3 is:#m
P<1e-24P<1e-21P<1e-18P<1e-15P<1e-12P<1e-091e+0615
12
12
9
9
9
1e+0915
15
12
12
9
9
1e+1215
15
15
12
12
9
1e+1518
15
15
15
12
12
1e+1818
18
15
15
15
12
1e+2121
18
18
15
15
15
1e+2421
21
18
18
15
15
For example, given
1e+18expected messages and a desired collision probability< 1e-15, we usedigest_size = 15(bytes).References