bioutils.normalize module¶
Provides functionality for normalizing alleles, ensuring comparable representations.
-
class
bioutils.normalize.
NormalizationMode
(value)¶ Bases:
enum.Enum
Enum passed to normalize to select the normalization mode.
-
EXPAND
¶ Normalize alleles to maximal extent both left and right.
-
LEFTSHUFFLE
¶ Normalize alleles to maximal extent left.
-
RIGHTSHUFFLE
¶ Normalize alleles to maximal extent right.
-
TRIMONLY
¶ Only trim the common prefix and suffix of alleles.
-
VCF
¶ Normalize with VCF.
-
EXPAND
= 1¶
-
LEFTSHUFFLE
= 2¶
-
RIGHTSHUFFLE
= 3¶
-
TRIMONLY
= 4¶
-
VCF
= 5¶
-
-
bioutils.normalize.
normalize
(sequence, interval, alleles, mode=<NormalizationMode.EXPAND: 1>, bounds=None, anchor_length=0)[source]¶ Normalizes the alleles that co-occur on sequence at interval, ensuring comparable representations.
- Parameters
sequence (str or iterable) – The reference sequence; must support indexing and
__getitem__
.interval (2-tuple of int) – The location of alleles in the reference sequence as
(start, end)
. Interbase coordinates.alleles (iterable of str) – The sequences to be normalized. The first element corresponds to the reference sequence being unchanged and must be None.
bounds (2-tuple of int, optional) – Maximal extent of normalization left and right. Must be provided if sequence doesn’t support
__len__
. Defaults to(0, len(sequence))
.mode (NormalizationMode Enum or string, optional) – A NormalizationMode Enum or the corresponding string. Defaults to
EXPAND
.anchor (int, optional) – number of flanking residues left and right. Defaults to
0
.
- Returns
(new_interval, [new_alleles])
- Return type
tuple
- Raises
ValueError – If normalization mode is VCF and anchor_length is nonzero.
ValueError – If the interval start is greater than the end.
ValueError – If the first (reference) allele is not None.
ValueError – If there are not at least two distinct alleles.
Examples
>>> sequence = "CCCCCCCCACACACACACTAGCAGCAGCA" >>> normalize(sequence, interval=(22,25), alleles=(None, "GC", "AGCAC"), mode='TRIMONLY') ((22, 24), ('AG', 'G', 'AGCA'))
>>> normalize(sequence, interval=(22, 22), alleles=(None, 'AGC'), mode='RIGHTSHUFFLE') ((29, 29), ('', 'GCA'))
>>> normalize(sequence, interval=(22, 22), alleles=(None, 'AGC'), mode='EXPAND') ((19, 29), ('AGCAGCAGCA', 'AGCAGCAGCAGCA'))
-
bioutils.normalize.
roll_left
(sequence, alleles, ref_pos, bound)[source]¶ Determines common distance all alleles can be rolled (circularly permuted) left within the reference sequence without altering it.
- Parameters
sequence (str) – The reference sequence.
alleles (list of str) – The sequences to be normalized.
ref_pos (int) – The beginning index for rolling.
bound (int) – The lower bound index in the reference sequence for normalization, hence also for rolling.
- Returns
The distance that the alleles can be rolled.
- Return type
int
-
bioutils.normalize.
roll_right
(sequence, alleles, ref_pos, bound)[source]¶ Determines common distance all alleles can be rolled (circularly permuted) right within the reference sequence without altering it.
- Parameters
sequence (str) – The reference sequence.
alleles (list of str) – The sequences to be normalized.
ref_pos (int) – The start index for rolling.
bound (int) – The upper bound index in the reference sequence for normalization, hence also for rolling.
- Returns
The distance that the alleles can be rolled
- Return type
int
-
bioutils.normalize.
trim_left
(alleles)[source]¶ Removes common prefix of given alleles.
- Parameters
alleles (list of str) – A list of alleles.
- Returns
(number_trimmed, [new_alleles])
.- Return type
tuple
Examples
>>> trim_left(["","AA"]) (0, ['', 'AA'])
>>> trim_left(["A","AA"]) (1, ['', 'A'])
>>> trim_left(["AT","AA"]) (1, ['T', 'A'])
>>> trim_left(["AA","AA"]) (2, ['', ''])
>>> trim_left(["CAG","CG"]) (1, ['AG', 'G'])
-
bioutils.normalize.
trim_right
(alleles)[source]¶ Removes common suffix of given alleles.
- Parameters
alleles (list of str) – A list of alleles.
- Returns
(number_trimmed, [new_alleles])
.- Return type
tuple
Examples
>>> trim_right(["","AA"]) (0, ['', 'AA'])
>>> trim_right(["A","AA"]) (1, ['', 'A'])
>>> trim_right(["AT","AA"]) (0, ['AT', 'AA'])
>>> trim_right(["AA","AA"]) (2, ['', ''])
>>> trim_right(["CAG","CG"]) (1, ['CA', 'C'])