bioutils.sequences module¶
Simple functions and lookup tables for nucleic acid and amino acid sequences.
- class bioutils.sequences.TranslationTable(value)[source]¶
Bases:
StrEnum
An enum that controls switching between standard and selenocysteine translation tables.
- selenocysteine = 'sec'¶
- standard = 'standard'¶
- bioutils.sequences.aa1_to_aa3(seq)[source]¶
Converts string of 1-letter amino acids to 3-letter amino acids.
Should only be used if the format of the sequence is known; otherwise use
aa_to_aa3()
.- Parameters:
seq (str) – An amino acid sequence as 1-letter amino acids.
- Returns:
The sequence as 3-letter amino acids.
- Return type:
str
- Raises:
KeyError – If the sequence is not of 1-letter amino acids.
Examples
>>> aa1_to_aa3("CATSARELAME") 'CysAlaThrSerAlaArgGluLeuAlaMetGlu'
>>> aa1_to_aa3(None)
- bioutils.sequences.aa3_to_aa1(seq)[source]¶
Converts string of 3-letter amino acids to 1-letter amino acids.
Should only be used if the format of the sequence is known; otherwise use
aa_to_aa1()
.- Parameters:
seq (str) – An amino acid sequence as 3-letter amino acids.
- Returns:
The sequence as 1-letter amino acids.
- Return type:
str
- Raises:
KeyError – If the sequence is not of 3-letter amino acids.
Examples
>>> aa3_to_aa1("CysAlaThrSerAlaArgGluLeuAlaMetGlu") 'CATSARELAME'
>>> aa3_to_aa1(None)
- bioutils.sequences.aa_to_aa1(seq)[source]¶
Coerces string of 1- or 3-letter amino acids to 1-letter representation.
- Parameters:
seq (str) – An amino acid sequence.
- Returns:
The sequence as one of 1-letter amino acids.
- Return type:
str
Examples
>>> aa_to_aa1("CATSARELAME") 'CATSARELAME'
>>> aa_to_aa1("CysAlaThrSerAlaArgGluLeuAlaMetGlu") 'CATSARELAME'
>>> aa_to_aa1(None)
- bioutils.sequences.aa_to_aa3(seq)[source]¶
Coerces string of 1- or 3-letter amino acids to 3-letter representation.
- Parameters:
seq (str) – An amino acid sequence.
- Returns:
The sequence as one of 3-letter amino acids.
- Return type:
str
Examples
>>> aa_to_aa3("CATSARELAME") 'CysAlaThrSerAlaArgGluLeuAlaMetGlu'
>>> aa_to_aa3("CysAlaThrSerAlaArgGluLeuAlaMetGlu") 'CysAlaThrSerAlaArgGluLeuAlaMetGlu'
>>> aa_to_aa3(None)
- bioutils.sequences.complement(seq)[source]¶
Retrieves the complement of a sequence.
- Parameters:
seq (str) – A nucleotide sequence.
- Returns:
The complement of the sequence.
- Return type:
str
Examples
>>> complement("ATCG") 'TAGC'
>>> complement(None)
- bioutils.sequences.elide_sequence(s, flank=5, elision='...')[source]¶
Trims the middle of the sequence, leaving the right and left flanks.
- Parameters:
s (str) – A sequence.
flank (int, optional) – The length of each flank. Defaults to five.
elision (str, optional) – The symbol used to represent the part trimmed. Defaults to ‘…’.
Returns – str: The sequence with the middle replaced by
elision
.
Examples
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ") 'ABCDE...VWXYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", flank=3) 'ABC...XYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", elision="..") 'ABCDE..VWXYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", flank=12) 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", flank=12, elision=".") 'ABCDEFGHIJKL.OPQRSTUVWXYZ'
- bioutils.sequences.looks_like_aa3_p(seq)[source]¶
Indicates whether a string looks like a 3-letter AA string.
- Parameters:
seq (str) – A sequence.
- Returns:
Whether the string is of the format of a 3-letter AA string.
- Return type:
bool
- bioutils.sequences.normalize_sequence(seq)[source]¶
Converts sequence to normalized representation for hashing.
Essentially, removes whitespace and asterisks, and uppercases the string.
- Parameters:
seq (str) – The sequence to be normalized.
- Returns:
The sequence as a string of uppercase letters.
- Return type:
str
- Raises:
RuntimeError – If the sequence contains non-alphabetic characters (besides ‘*’).
Examples
>>> normalize_sequence("ACGT") 'ACGT'
>>> normalize_sequence(" A C G T * ") 'ACGT'
>>> normalize_sequence("ACGT1") Traceback (most recent call last): ... RuntimeError: Normalized sequence contains non-alphabetic characters
- bioutils.sequences.replace_t_to_u(seq)[source]¶
Replaces the T’s in a sequence with U’s.
- Parameters:
seq (str) – A nucleotide sequence.
- Returns:
The sequence with the T’s replaced by U’s.
- Return type:
str
Examples
>>> replace_t_to_u("ACGT") 'ACGU'
>>> replace_t_to_u(None)
- bioutils.sequences.replace_u_to_t(seq)[source]¶
Replaces the U’s in a sequence with T’s.
- Parameters:
seq (str) – A nucleotide sequence.
- Returns:
The sequence with the U’s replaced by T’s.
- Return type:
str
Examples
>>> replace_u_to_t("ACGU") 'ACGT'
>>> replace_u_to_t(None)
- bioutils.sequences.reverse_complement(seq)[source]¶
Converts a sequence to its reverse complement.
- Parameters:
seq (str) – A nucleotide sequence.
- Returns:
The reverse complement of the sequence.
- Return type:
str
Examples
>>> reverse_complement("ATCG") 'CGAT'
>>> reverse_complement(None)
- bioutils.sequences.translate_cds(seq, full_codons=True, ter_symbol='*', translation_table=TranslationTable.standard)[source]¶
Translates a DNA or RNA sequence into a single-letter amino acid sequence.
- Parameters:
seq (str) – A nucleotide sequence.
full_codons (bool, optional) – If
True
, forces sequence to have length that is a multiple of 3 and raises an error otherwise. If False,ter_symbol
will be added as the last amino acid. This corresponds to biopython’s behavior of padding the last codon withN``s. Defaults to ``True
.ter_symbol (str, optional) – Placeholder for the last amino acid if sequence length is not divisible by three and
full_codons
is False. Defaults to'*'
translation_table (TranslationTable, optional) – One of the options from the TranslationTable. It indicates which codon to amino acid translation table to use. By default we will use the standard translation table for humans. To enable translation for selenoproteins, the TranslationTable.selenocysteine table can get used
- Returns:
The corresponding single letter amino acid sequence.
- Return type:
str
- Raises:
ValueError – If
full_codons
and the sequence is not a multiple of three.ValueError – If a codon is undefined in the table.
Examples
>>> translate_cds("ATGCGA") 'MR'
>>> translate_cds("AUGCGA") 'MR'
>>> translate_cds(None)
>>> translate_cds("") ''
>>> translate_cds("AUGCG") Traceback (most recent call last): ... ValueError: Sequence length must be a multiple of three
>>> translate_cds("AUGCG", full_codons=False) 'M*'
>>> translate_cds("ATGTAN") 'MX'
>>> translate_cds("CCN") 'P'
>>> translate_cds("TRA") '*'
>>> translate_cds("TTNTA", full_codons=False) 'X*'
>>> translate_cds("CTB") 'L'
>>> translate_cds("AGM") 'X'
>>> translate_cds("GAS") 'X'
>>> translate_cds("CUN") 'L'
>>> translate_cds("AUGCGQ") Traceback (most recent call last): ... ValueError: Codon CGQ at position 4..6 is undefined in codon table