bioutils.sequences module

Simple functions and lookup tables for nucleic acid and amino acid sequences.

bioutils.sequences.aa1_to_aa3(seq)[source]

Converts string of 1-letter amino acids to 3-letter amino acids.

Should only be used if the format of the sequence is known; otherwise use aa_to_aa3().

Parameters

seq (str) – An amino acid sequence as 1-letter amino acids.

Returns

The sequence as 3-letter amino acids.

Return type

str

Raises

KeyError – If the sequence is not of 1-letter amino acids.

Examples

>>> aa1_to_aa3("CATSARELAME")
'CysAlaThrSerAlaArgGluLeuAlaMetGlu'
>>> aa1_to_aa3(None)
bioutils.sequences.aa3_to_aa1(seq)[source]

Converts string of 3-letter amino acids to 1-letter amino acids.

Should only be used if the format of the sequence is known; otherwise use aa_to_aa1().

Parameters

seq (str) – An amino acid sequence as 3-letter amino acids.

Returns

The sequence as 1-letter amino acids.

Return type

str

Raises

KeyError – If the sequence is not of 3-letter amino acids.

Examples

>>> aa3_to_aa1("CysAlaThrSerAlaArgGluLeuAlaMetGlu")
'CATSARELAME'
>>> aa3_to_aa1(None)
bioutils.sequences.aa_to_aa1(seq)[source]

Coerces string of 1- or 3-letter amino acids to 1-letter representation.

Parameters

seq (str) – An amino acid sequence.

Returns

The sequence as one of 1-letter amino acids.

Return type

str

Examples

>>> aa_to_aa1("CATSARELAME")
'CATSARELAME'
>>> aa_to_aa1("CysAlaThrSerAlaArgGluLeuAlaMetGlu")
'CATSARELAME'
>>> aa_to_aa1(None)
bioutils.sequences.aa_to_aa3(seq)[source]

Coerces string of 1- or 3-letter amino acids to 3-letter representation.

Parameters

seq (str) – An amino acid sequence.

Returns

The sequence as one of 3-letter amino acids.

Return type

str

Examples

>>> aa_to_aa3("CATSARELAME")
'CysAlaThrSerAlaArgGluLeuAlaMetGlu'
>>> aa_to_aa3("CysAlaThrSerAlaArgGluLeuAlaMetGlu")
'CysAlaThrSerAlaArgGluLeuAlaMetGlu'
>>> aa_to_aa3(None)
bioutils.sequences.complement(seq)[source]

Retrieves the complement of a sequence.

Parameters

seq (str) – A nucleotide sequence.

Returns

The complement of the sequence.

Return type

str

Examples

>>> complement("ATCG")
'TAGC'
>>> complement(None)
bioutils.sequences.elide_sequence(s, flank=5, elision='...')[source]

Trims the middle of the sequence, leaving the right and left flanks.

Parameters
  • s (str) – A sequence.

  • flank (int, optional) – The length of each flank. Defaults to five.

  • elision (str, optional) – The symbol used to represent the part trimmed. Defaults to ‘…’.

  • Returns – str: The sequence with the middle replaced by elision.

Examples

>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
'ABCDE...VWXYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", flank=3)
'ABC...XYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", elision="..")
'ABCDE..VWXYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", flank=12)
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", flank=12, elision=".")
'ABCDEFGHIJKL.OPQRSTUVWXYZ'
bioutils.sequences.looks_like_aa3_p(seq)[source]

Indicates whether a string looks like a 3-letter AA string.

Parameters

seq (str) – A sequence.

Returns

Whether the string is of the format of a 3-letter AA string.

Return type

bool

bioutils.sequences.normalize_sequence(seq)[source]

Converts sequence to normalized representation for hashing.

Essentially, removes whitespace and asterisks, and uppercases the string.

Parameters

seq (str) – The sequence to be normalized.

Returns

The sequence as a string of uppercase letters.

Return type

str

Raises

RuntimeError – If the sequence contains non-alphabetic characters (besides ‘*’).

Examples

>>> normalize_sequence("ACGT")
'ACGT'
>>> normalize_sequence("  A C G T * ")
'ACGT'
>>> normalize_sequence("ACGT1")
Traceback (most recent call last):
...
RuntimeError: Normalized sequence contains non-alphabetic characters
bioutils.sequences.replace_t_to_u(seq)[source]

Replaces the T’s in a sequence with U’s.

Parameters

seq (str) – A nucleotide sequence.

Returns

The sequence with the T’s replaced by U’s.

Return type

str

Examples

>>> replace_t_to_u("ACGT")
'ACGU'
>>> replace_t_to_u(None)
bioutils.sequences.replace_u_to_t(seq)[source]

Replaces the U’s in a sequence with T’s.

Parameters

seq (str) – A nucleotide sequence.

Returns

The sequence with the U’s replaced by T’s.

Return type

str

Examples

>>> replace_u_to_t("ACGU")
'ACGT'
>>> replace_u_to_t(None)
bioutils.sequences.reverse_complement(seq)[source]

Converts a sequence to its reverse complement.

Parameters

seq (str) – A nucleotide sequence.

Returns

The reverse complement of the sequence.

Return type

str

Examples

>>> reverse_complement("ATCG")
'CGAT'
>>> reverse_complement(None)
bioutils.sequences.translate_cds(seq, full_codons=True, ter_symbol='*')[source]

Translates a DNA or RNA sequence into a single-letter amino acid sequence.

Uses the NCBI standard translation table.

Parameters
  • seq (str) – A nucleotide sequence.

  • full_codons (bool, optional) – If True, forces sequence to have length that is a multiple of 3 and raises an error otherwise. If False, ter_symbol will be added as the last amino acid. This corresponds to biopython’s behavior of padding the last codon with N``s. Defaults to ``True.

  • ter_symbol (str, optional) – Placeholder for the last amino acid if sequence length is not divisible by three and full_codons is False. Defaults to '*'

Returns

The corresponding single letter amino acid sequence.

Return type

str

Raises
  • ValueError – If full_codons and the sequence is not a multiple of three.

  • ValueError – If a codon is undefined in the table.

Examples

>>> translate_cds("ATGCGA")
'MR'
>>> translate_cds("AUGCGA")
'MR'
>>> translate_cds(None)
>>> translate_cds("")
''
>>> translate_cds("AUGCG")
Traceback (most recent call last):
...
ValueError: Sequence length must be a multiple of three
>>> translate_cds("AUGCG", full_codons=False)
'M*'
>>> translate_cds("ATGTAN")
'MX'
>>> translate_cds("CCN")
'P'
>>> translate_cds("TRA")
'*'
>>> translate_cds("TTNTA", full_codons=False)
'X*'
>>> translate_cds("CTB")
'L'
>>> translate_cds("AGM")
'X'
>>> translate_cds("GAS")
'X'
>>> translate_cds("CUN")
'L'
>>> translate_cds("AUGCGQ")
Traceback (most recent call last):
...
ValueError: Codon CGQ at position 4..6 is undefined in codon table