bioutils.sequences module

Simple functions and lookup tables for nucleic acid and amino acid sequences.

class bioutils.sequences.StrEnum(value)[source]

Bases: str, Enum

utility class

class bioutils.sequences.TranslationTable(value)[source]

Bases: StrEnum

An enum that controls switching between standard and selenocysteine translation tables.

selenocysteine = 'sec'
standard = 'standard'
bioutils.sequences.aa1_to_aa3(seq)[source]

Converts string of 1-letter amino acids to 3-letter amino acids.

Should only be used if the format of the sequence is known; otherwise use aa_to_aa3().

Parameters:

seq (str) – An amino acid sequence as 1-letter amino acids.

Returns:

The sequence as 3-letter amino acids.

Return type:

str

Raises:

KeyError – If the sequence is not of 1-letter amino acids.

Examples

>>> aa1_to_aa3("CATSARELAME")
'CysAlaThrSerAlaArgGluLeuAlaMetGlu'
>>> aa1_to_aa3(None)
bioutils.sequences.aa3_to_aa1(seq)[source]

Converts string of 3-letter amino acids to 1-letter amino acids.

Should only be used if the format of the sequence is known; otherwise use aa_to_aa1().

Parameters:

seq (str) – An amino acid sequence as 3-letter amino acids.

Returns:

The sequence as 1-letter amino acids.

Return type:

str

Raises:

KeyError – If the sequence is not of 3-letter amino acids.

Examples

>>> aa3_to_aa1("CysAlaThrSerAlaArgGluLeuAlaMetGlu")
'CATSARELAME'
>>> aa3_to_aa1(None)
bioutils.sequences.aa_to_aa1(seq)[source]

Coerces string of 1- or 3-letter amino acids to 1-letter representation.

Parameters:

seq (str) – An amino acid sequence.

Returns:

The sequence as one of 1-letter amino acids.

Return type:

str

Examples

>>> aa_to_aa1("CATSARELAME")
'CATSARELAME'
>>> aa_to_aa1("CysAlaThrSerAlaArgGluLeuAlaMetGlu")
'CATSARELAME'
>>> aa_to_aa1(None)
bioutils.sequences.aa_to_aa3(seq)[source]

Coerces string of 1- or 3-letter amino acids to 3-letter representation.

Parameters:

seq (str) – An amino acid sequence.

Returns:

The sequence as one of 3-letter amino acids.

Return type:

str

Examples

>>> aa_to_aa3("CATSARELAME")
'CysAlaThrSerAlaArgGluLeuAlaMetGlu'
>>> aa_to_aa3("CysAlaThrSerAlaArgGluLeuAlaMetGlu")
'CysAlaThrSerAlaArgGluLeuAlaMetGlu'
>>> aa_to_aa3(None)
bioutils.sequences.complement(seq)[source]

Retrieves the complement of a sequence.

Parameters:

seq (str) – A nucleotide sequence.

Returns:

The complement of the sequence.

Return type:

str

Examples

>>> complement("ATCG")
'TAGC'
>>> complement(None)
bioutils.sequences.elide_sequence(s, flank=5, elision='...')[source]

Trims the middle of the sequence, leaving the right and left flanks.

Parameters:
  • s (str) – A sequence.

  • flank (int, optional) – The length of each flank. Defaults to five.

  • elision (str, optional) – The symbol used to represent the part trimmed. Defaults to ‘…’.

  • Returns – str: The sequence with the middle replaced by elision.

Examples

>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
'ABCDE...VWXYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", flank=3)
'ABC...XYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", elision="..")
'ABCDE..VWXYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", flank=12)
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", flank=12, elision=".")
'ABCDEFGHIJKL.OPQRSTUVWXYZ'
bioutils.sequences.looks_like_aa3_p(seq)[source]

Indicates whether a string looks like a 3-letter AA string.

Parameters:

seq (str) – A sequence.

Returns:

Whether the string is of the format of a 3-letter AA string.

Return type:

bool

bioutils.sequences.normalize_sequence(seq)[source]

Converts sequence to normalized representation for hashing.

Essentially, removes whitespace and asterisks, and uppercases the string.

Parameters:

seq (str) – The sequence to be normalized.

Returns:

The sequence as a string of uppercase letters.

Return type:

str

Raises:

RuntimeError – If the sequence contains non-alphabetic characters (besides ‘*’).

Examples

>>> normalize_sequence("ACGT")
'ACGT'
>>> normalize_sequence("  A C G T * ")
'ACGT'
>>> normalize_sequence("ACGT1")
Traceback (most recent call last):
...
RuntimeError: Normalized sequence contains non-alphabetic characters
bioutils.sequences.replace_t_to_u(seq)[source]

Replaces the T’s in a sequence with U’s.

Parameters:

seq (str) – A nucleotide sequence.

Returns:

The sequence with the T’s replaced by U’s.

Return type:

str

Examples

>>> replace_t_to_u("ACGT")
'ACGU'
>>> replace_t_to_u(None)
bioutils.sequences.replace_u_to_t(seq)[source]

Replaces the U’s in a sequence with T’s.

Parameters:

seq (str) – A nucleotide sequence.

Returns:

The sequence with the U’s replaced by T’s.

Return type:

str

Examples

>>> replace_u_to_t("ACGU")
'ACGT'
>>> replace_u_to_t(None)
bioutils.sequences.reverse_complement(seq)[source]

Converts a sequence to its reverse complement.

Parameters:

seq (str) – A nucleotide sequence.

Returns:

The reverse complement of the sequence.

Return type:

str

Examples

>>> reverse_complement("ATCG")
'CGAT'
>>> reverse_complement(None)
bioutils.sequences.translate_cds(seq, full_codons=True, ter_symbol='*', translation_table=TranslationTable.standard)[source]

Translates a DNA or RNA sequence into a single-letter amino acid sequence.

Parameters:
  • seq (str) – A nucleotide sequence.

  • full_codons (bool, optional) – If True, forces sequence to have length that is a multiple of 3 and raises an error otherwise. If False, ter_symbol will be added as the last amino acid. This corresponds to biopython’s behavior of padding the last codon with N``s. Defaults to ``True.

  • ter_symbol (str, optional) – Placeholder for the last amino acid if sequence length is not divisible by three and full_codons is False. Defaults to '*'

  • translation_table (TranslationTable, optional) – One of the options from the TranslationTable. It indicates which codon to amino acid translation table to use. By default we will use the standard translation table for humans. To enable translation for selenoproteins, the TranslationTable.selenocysteine table can get used

Returns:

The corresponding single letter amino acid sequence.

Return type:

str

Raises:
  • ValueError – If full_codons and the sequence is not a multiple of three.

  • ValueError – If a codon is undefined in the table.

Examples

>>> translate_cds("ATGCGA")
'MR'
>>> translate_cds("AUGCGA")
'MR'
>>> translate_cds(None)
>>> translate_cds("")
''
>>> translate_cds("AUGCG")
Traceback (most recent call last):
...
ValueError: Sequence length must be a multiple of three
>>> translate_cds("AUGCG", full_codons=False)
'M*'
>>> translate_cds("ATGTAN")
'MX'
>>> translate_cds("CCN")
'P'
>>> translate_cds("TRA")
'*'
>>> translate_cds("TTNTA", full_codons=False)
'X*'
>>> translate_cds("CTB")
'L'
>>> translate_cds("AGM")
'X'
>>> translate_cds("GAS")
'X'
>>> translate_cds("CUN")
'L'
>>> translate_cds("AUGCGQ")
Traceback (most recent call last):
...
ValueError: Codon CGQ at position 4..6 is undefined in codon table