bioutils.normalize module¶
Provides functionality for normalizing alleles, ensuring comparable representations.
- class bioutils.normalize.NormalizationMode(value)¶
Bases:
Enum
Enum passed to normalize to select the normalization mode.
- EXPAND¶
Normalize alleles to maximal extent both left and right.
- LEFTSHUFFLE¶
Normalize alleles to maximal extent left.
- RIGHTSHUFFLE¶
Normalize alleles to maximal extent right.
- TRIMONLY¶
Only trim the common prefix and suffix of alleles. Deprecated – use mode=None with trim=True instead.
- VCF¶
Normalize with VCF.
- EXPAND = 1¶
- LEFTSHUFFLE = 2¶
- RIGHTSHUFFLE = 3¶
- TRIMONLY = 4¶
- VCF = 5¶
- bioutils.normalize.normalize(sequence, interval, alleles, mode: Optional[NormalizationMode] = NormalizationMode.EXPAND, bounds=None, anchor_length=0, trim: bool = True)[source]¶
Normalizes the alleles that co-occur on sequence at interval, ensuring comparable representations.
Normalization performs three operations: - trimming - shuffling - anchoring
- Parameters:
sequence (str or iterable) – The reference sequence; must support indexing and
__getitem__
.interval (2-tuple of int) – The location of alleles in the reference sequence as
(start, end)
. Interbase coordinates.alleles (iterable of str) – The sequences to be normalized. The first element corresponds to the reference sequence being unchanged and must be None.
bounds (2-tuple of int, optional) – Maximal extent of normalization left and right. Must be provided if sequence doesn’t support
__len__
. Defaults to(0, len(sequence))
.mode (NormalizationMode Enum or string, optional) – A NormalizationMode Enum or the corresponding string. Defaults to
EXPAND
. Set to None to skip shuffling. Does not affect trimming or anchoring.anchor (int, optional) – number of flanking residues left and right. Defaults to
0
.trim (bool) – indicates whether to trim the common prefix and suffix of alleles. Defaults to True. Set to False to skip trimming. Does not affect shuffling or anchoring.
- Returns:
(new_interval, [new_alleles])
- Return type:
tuple
- Raises:
ValueError – If normalization mode is VCF and anchor_length is nonzero.
ValueError – If the interval start is greater than the end.
ValueError – If the first (reference) allele is not None.
ValueError – If there are not at least two distinct alleles.
Examples
>>> sequence = "CCCCCCCCACACACACACTAGCAGCAGCA" >>> normalize(sequence, interval=(22,25), alleles=(None, "GC", "AGCAC"), mode='TRIMONLY') ((22, 24), ('AG', 'G', 'AGCA'))
>>> normalize(sequence, interval=(22, 22), alleles=(None, 'AGC'), mode='RIGHTSHUFFLE') ((29, 29), ('', 'GCA'))
>>> normalize(sequence, interval=(22, 22), alleles=(None, 'AGC'), mode='EXPAND') ((19, 29), ('AGCAGCAGCA', 'AGCAGCAGCAGCA'))
- bioutils.normalize.roll_left(sequence, alleles, ref_pos, bound)[source]¶
Determines common distance all alleles can be rolled (circularly permuted) left within the reference sequence without altering it.
- Parameters:
sequence (str) – The reference sequence.
alleles (list of str) – The sequences to be normalized.
ref_pos (int) – The beginning index for rolling.
bound (int) – The lower bound index in the reference sequence for normalization, hence also for rolling.
- Returns:
The distance that the alleles can be rolled.
- Return type:
int
- bioutils.normalize.roll_right(sequence, alleles, ref_pos, bound)[source]¶
Determines common distance all alleles can be rolled (circularly permuted) right within the reference sequence without altering it.
- Parameters:
sequence (str) – The reference sequence.
alleles (list of str) – The sequences to be normalized.
ref_pos (int) – The start index for rolling.
bound (int) – The upper bound index in the reference sequence for normalization, hence also for rolling.
- Returns:
The distance that the alleles can be rolled
- Return type:
int
- bioutils.normalize.trim_left(alleles)[source]¶
Removes common prefix of given alleles.
- Parameters:
alleles (list of str) – A list of alleles.
- Returns:
(number_trimmed, [new_alleles])
.- Return type:
tuple
Examples
>>> trim_left(["","AA"]) (0, ['', 'AA'])
>>> trim_left(["A","AA"]) (1, ['', 'A'])
>>> trim_left(["AT","AA"]) (1, ['T', 'A'])
>>> trim_left(["AA","AA"]) (2, ['', ''])
>>> trim_left(["CAG","CG"]) (1, ['AG', 'G'])
- bioutils.normalize.trim_right(alleles)[source]¶
Removes common suffix of given alleles.
- Parameters:
alleles (list of str) – A list of alleles.
- Returns:
(number_trimmed, [new_alleles])
.- Return type:
tuple
Examples
>>> trim_right(["","AA"]) (0, ['', 'AA'])
>>> trim_right(["A","AA"]) (1, ['', 'A'])
>>> trim_right(["AT","AA"]) (0, ['AT', 'AA'])
>>> trim_right(["AA","AA"]) (2, ['', ''])
>>> trim_right(["CAG","CG"]) (1, ['CA', 'C'])