bioutils.normalize module

Provides functionality for normalizing alleles, ensuring comparable representations.

class bioutils.normalize.NormalizationMode(value)

Bases: Enum

Enum passed to normalize to select the normalization mode.

EXPAND

Normalize alleles to maximal extent both left and right.

LEFTSHUFFLE

Normalize alleles to maximal extent left.

RIGHTSHUFFLE

Normalize alleles to maximal extent right.

TRIMONLY

Only trim the common prefix and suffix of alleles. Deprecated – use mode=None with trim=True instead.

VCF

Normalize with VCF.

EXPAND = 1
LEFTSHUFFLE = 2
RIGHTSHUFFLE = 3
TRIMONLY = 4
VCF = 5
bioutils.normalize.normalize(sequence, interval, alleles, mode: Optional[NormalizationMode] = NormalizationMode.EXPAND, bounds=None, anchor_length=0, trim: bool = True)[source]

Normalizes the alleles that co-occur on sequence at interval, ensuring comparable representations.

Normalization performs three operations: - trimming - shuffling - anchoring

Parameters:
  • sequence (str or iterable) – The reference sequence; must support indexing and __getitem__.

  • interval (2-tuple of int) – The location of alleles in the reference sequence as (start, end). Interbase coordinates.

  • alleles (iterable of str) – The sequences to be normalized. The first element corresponds to the reference sequence being unchanged and must be None.

  • bounds (2-tuple of int, optional) – Maximal extent of normalization left and right. Must be provided if sequence doesn’t support __len__. Defaults to (0, len(sequence)).

  • mode (NormalizationMode Enum or string, optional) – A NormalizationMode Enum or the corresponding string. Defaults to EXPAND. Set to None to skip shuffling. Does not affect trimming or anchoring.

  • anchor (int, optional) – number of flanking residues left and right. Defaults to 0.

  • trim (bool) – indicates whether to trim the common prefix and suffix of alleles. Defaults to True. Set to False to skip trimming. Does not affect shuffling or anchoring.

Returns:

(new_interval, [new_alleles])

Return type:

tuple

Raises:
  • ValueError – If normalization mode is VCF and anchor_length is nonzero.

  • ValueError – If the interval start is greater than the end.

  • ValueError – If the first (reference) allele is not None.

  • ValueError – If there are not at least two distinct alleles.

Examples

>>> sequence = "CCCCCCCCACACACACACTAGCAGCAGCA"
>>> normalize(sequence, interval=(22,25), alleles=(None, "GC", "AGCAC"), mode='TRIMONLY')
((22, 24), ('AG', 'G', 'AGCA'))
>>> normalize(sequence, interval=(22, 22), alleles=(None, 'AGC'), mode='RIGHTSHUFFLE')
((29, 29), ('', 'GCA'))
>>> normalize(sequence, interval=(22, 22), alleles=(None, 'AGC'), mode='EXPAND')
((19, 29), ('AGCAGCAGCA', 'AGCAGCAGCAGCA'))
bioutils.normalize.roll_left(sequence, alleles, ref_pos, bound)[source]

Determines common distance all alleles can be rolled (circularly permuted) left within the reference sequence without altering it.

Parameters:
  • sequence (str) – The reference sequence.

  • alleles (list of str) – The sequences to be normalized.

  • ref_pos (int) – The beginning index for rolling.

  • bound (int) – The lower bound index in the reference sequence for normalization, hence also for rolling.

Returns:

The distance that the alleles can be rolled.

Return type:

int

bioutils.normalize.roll_right(sequence, alleles, ref_pos, bound)[source]

Determines common distance all alleles can be rolled (circularly permuted) right within the reference sequence without altering it.

Parameters:
  • sequence (str) – The reference sequence.

  • alleles (list of str) – The sequences to be normalized.

  • ref_pos (int) – The start index for rolling.

  • bound (int) – The upper bound index in the reference sequence for normalization, hence also for rolling.

Returns:

The distance that the alleles can be rolled

Return type:

int

bioutils.normalize.trim_left(alleles)[source]

Removes common prefix of given alleles.

Parameters:

alleles (list of str) – A list of alleles.

Returns:

(number_trimmed, [new_alleles]).

Return type:

tuple

Examples

>>> trim_left(["","AA"])
(0, ['', 'AA'])
>>> trim_left(["A","AA"])
(1, ['', 'A'])
>>> trim_left(["AT","AA"])
(1, ['T', 'A'])
>>> trim_left(["AA","AA"])
(2, ['', ''])
>>> trim_left(["CAG","CG"])
(1, ['AG', 'G'])
bioutils.normalize.trim_right(alleles)[source]

Removes common suffix of given alleles.

Parameters:

alleles (list of str) – A list of alleles.

Returns:

(number_trimmed, [new_alleles]).

Return type:

tuple

Examples

>>> trim_right(["","AA"])
(0, ['', 'AA'])
>>> trim_right(["A","AA"])
(1, ['', 'A'])
>>> trim_right(["AT","AA"])
(0, ['AT', 'AA'])
>>> trim_right(["AA","AA"])
(2, ['', ''])
>>> trim_right(["CAG","CG"])
(1, ['CA', 'C'])