bioutils.sequences module¶
Simple functions and lookup tables for nucleic acid and amino acid sequences.
-
bioutils.sequences.
aa1_to_aa3
(seq)[source]¶ Converts string of 1-letter amino acids to 3-letter amino acids.
Should only be used if the format of the sequence is known; otherwise use
aa_to_aa3()
.- Parameters
seq (str) – An amino acid sequence as 1-letter amino acids.
- Returns
The sequence as 3-letter amino acids.
- Return type
str
- Raises
KeyError – If the sequence is not of 1-letter amino acids.
Examples
>>> aa1_to_aa3("CATSARELAME") 'CysAlaThrSerAlaArgGluLeuAlaMetGlu'
>>> aa1_to_aa3(None)
-
bioutils.sequences.
aa3_to_aa1
(seq)[source]¶ Converts string of 3-letter amino acids to 1-letter amino acids.
Should only be used if the format of the sequence is known; otherwise use
aa_to_aa1()
.- Parameters
seq (str) – An amino acid sequence as 3-letter amino acids.
- Returns
The sequence as 1-letter amino acids.
- Return type
str
- Raises
KeyError – If the sequence is not of 3-letter amino acids.
Examples
>>> aa3_to_aa1("CysAlaThrSerAlaArgGluLeuAlaMetGlu") 'CATSARELAME'
>>> aa3_to_aa1(None)
-
bioutils.sequences.
aa_to_aa1
(seq)[source]¶ Coerces string of 1- or 3-letter amino acids to 1-letter representation.
- Parameters
seq (str) – An amino acid sequence.
- Returns
The sequence as one of 1-letter amino acids.
- Return type
str
Examples
>>> aa_to_aa1("CATSARELAME") 'CATSARELAME'
>>> aa_to_aa1("CysAlaThrSerAlaArgGluLeuAlaMetGlu") 'CATSARELAME'
>>> aa_to_aa1(None)
-
bioutils.sequences.
aa_to_aa3
(seq)[source]¶ Coerces string of 1- or 3-letter amino acids to 3-letter representation.
- Parameters
seq (str) – An amino acid sequence.
- Returns
The sequence as one of 3-letter amino acids.
- Return type
str
Examples
>>> aa_to_aa3("CATSARELAME") 'CysAlaThrSerAlaArgGluLeuAlaMetGlu'
>>> aa_to_aa3("CysAlaThrSerAlaArgGluLeuAlaMetGlu") 'CysAlaThrSerAlaArgGluLeuAlaMetGlu'
>>> aa_to_aa3(None)
-
bioutils.sequences.
complement
(seq)[source]¶ Retrieves the complement of a sequence.
- Parameters
seq (str) – A nucleotide sequence.
- Returns
The complement of the sequence.
- Return type
str
Examples
>>> complement("ATCG") 'TAGC'
>>> complement(None)
-
bioutils.sequences.
elide_sequence
(s, flank=5, elision='...')[source]¶ Trims the middle of the sequence, leaving the right and left flanks.
- Parameters
s (str) – A sequence.
flank (int, optional) – The length of each flank. Defaults to five.
elision (str, optional) – The symbol used to represent the part trimmed. Defaults to ‘…’.
Returns – str: The sequence with the middle replaced by
elision
.
Examples
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ") 'ABCDE...VWXYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", flank=3) 'ABC...XYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", elision="..") 'ABCDE..VWXYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", flank=12) 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> elide_sequence("ABCDEFGHIJKLMNOPQRSTUVWXYZ", flank=12, elision=".") 'ABCDEFGHIJKL.OPQRSTUVWXYZ'
-
bioutils.sequences.
looks_like_aa3_p
(seq)[source]¶ Indicates whether a string looks like a 3-letter AA string.
- Parameters
seq (str) – A sequence.
- Returns
Whether the string is of the format of a 3-letter AA string.
- Return type
bool
-
bioutils.sequences.
normalize_sequence
(seq)[source]¶ Converts sequence to normalized representation for hashing.
Essentially, removes whitespace and asterisks, and uppercases the string.
- Parameters
seq (str) – The sequence to be normalized.
- Returns
The sequence as a string of uppercase letters.
- Return type
str
- Raises
RuntimeError – If the sequence contains non-alphabetic characters (besides ‘*’).
Examples
>>> normalize_sequence("ACGT") 'ACGT'
>>> normalize_sequence(" A C G T * ") 'ACGT'
>>> normalize_sequence("ACGT1") Traceback (most recent call last): ... RuntimeError: Normalized sequence contains non-alphabetic characters
-
bioutils.sequences.
replace_t_to_u
(seq)[source]¶ Replaces the T’s in a sequence with U’s.
- Parameters
seq (str) – A nucleotide sequence.
- Returns
The sequence with the T’s replaced by U’s.
- Return type
str
Examples
>>> replace_t_to_u("ACGT") 'ACGU'
>>> replace_t_to_u(None)
-
bioutils.sequences.
replace_u_to_t
(seq)[source]¶ Replaces the U’s in a sequence with T’s.
- Parameters
seq (str) – A nucleotide sequence.
- Returns
The sequence with the U’s replaced by T’s.
- Return type
str
Examples
>>> replace_u_to_t("ACGU") 'ACGT'
>>> replace_u_to_t(None)
-
bioutils.sequences.
reverse_complement
(seq)[source]¶ Converts a sequence to its reverse complement.
- Parameters
seq (str) – A nucleotide sequence.
- Returns
The reverse complement of the sequence.
- Return type
str
Examples
>>> reverse_complement("ATCG") 'CGAT'
>>> reverse_complement(None)
-
bioutils.sequences.
translate_cds
(seq, full_codons=True, ter_symbol='*')[source]¶ Translates a DNA or RNA sequence into a single-letter amino acid sequence.
Uses the NCBI standard translation table.
- Parameters
seq (str) – A nucleotide sequence.
full_codons (bool, optional) – If
True
, forces sequence to have length that is a multiple of 3 and raises an error otherwise. If False,ter_symbol
will be added as the last amino acid. This corresponds to biopython’s behavior of padding the last codon withN``s. Defaults to ``True
.ter_symbol (str, optional) – Placeholder for the last amino acid if sequence length is not divisible by three and
full_codons
is False. Defaults to'*'
- Returns
The corresponding single letter amino acid sequence.
- Return type
str
- Raises
ValueError – If
full_codons
and the sequence is not a multiple of three.ValueError – If a codon is undefined in the table.
Examples
>>> translate_cds("ATGCGA") 'MR'
>>> translate_cds("AUGCGA") 'MR'
>>> translate_cds(None)
>>> translate_cds("") ''
>>> translate_cds("AUGCG") Traceback (most recent call last): ... ValueError: Sequence length must be a multiple of three
>>> translate_cds("AUGCG", full_codons=False) 'M*'
>>> translate_cds("ATGTAN") 'MX'
>>> translate_cds("CCN") 'P'
>>> translate_cds("TRA") '*'
>>> translate_cds("TTNTA", full_codons=False) 'X*'
>>> translate_cds("CTB") 'L'
>>> translate_cds("AGM") 'X'
>>> translate_cds("GAS") 'X'
>>> translate_cds("CUN") 'L'
>>> translate_cds("AUGCGQ") Traceback (most recent call last): ... ValueError: Codon CGQ at position 4..6 is undefined in codon table