bioutils.accessions module¶
Simple routines to deal with accessions, identifiers, etc.
Biocommons terminology: an identifier is composed of a namespace and an accession. The namespace is a string, composed of any character other than colon (:). The accession is a string without character set restriction. An accession is expected to be unique within the namespace; there is no expectation of uniqueness of accessions across namespaces.
Identifier := <Namespace, Accession>
Namespace := [^:]+
Accession := \w+
Some sample serializations of Identifiers:
json: {"namespace": "RefSeq", "accession": "NM_000551.3"}
xml: <Identifier namespace="RefSeq" accession="NM_000551.3"/>
string: "RefSeq:NM_000551.3"
The string form may be used as a CURIE, in which case the document in
which the CURIE is used must contain a map of {namespace : uri}
.
- bioutils.accessions.chr22XY(c)[source]¶
Reformats chromosome to be of the form Chr1, …, Chr22, ChrX, ChrY, etc.
- Parameters:
c (str or int) – A chromosome.
- Returns:
The reformatted chromosome.
- Return type:
str
Examples
>>> chr22XY('1') 'chr1'
>>> chr22XY(1) 'chr1'
>>> chr22XY('chr1') 'chr1'
>>> chr22XY(23) 'chrX'
>>> chr22XY(24) 'chrY'
>>> chr22XY("X") 'chrX'
>>> chr22XY("23") 'chrX'
>>> chr22XY("M") 'chrM'
- bioutils.accessions.coerce_namespace(ac)[source]¶
Prefixes accession with inferred namespace if not present.
Intended to be used to promote consistent and unambiguous accession identifiers.
- Parameters:
ac (str) – The accession, with or without namespace prefixed.
- Returns:
An identifier of the form “{namespace}:{acession}”
- Return type:
str
- Raises:
ValueError – if accession sytax does not match the syntax of any namespace.
Examples
>>> coerce_namespace("refseq:NM_01234.5") 'refseq:NM_01234.5'
>>> coerce_namespace("NM_01234.5") 'refseq:NM_01234.5'
>>> coerce_namespace("bogus:QQ_01234.5") 'bogus:QQ_01234.5'
>>> coerce_namespace("QQ_01234.5") Traceback (most recent call last): ... ValueError: Could not infer namespace for QQ_01234.5
- bioutils.accessions.infer_namespace(ac)[source]¶
Infers a unique namespace from an accession, if one exists.
- Parameters:
ac (str) – An accession, without the namespace prefix.
- Returns:
- The unique namespace corresponding to accession syntax, if only one is inferred.
None if the accesssion sytax does not match any namespace.
- Return type:
str or None
- Raises:
BioutilsError – If multiple namespaces match the syntax of the accession.
Examples
>>> infer_namespace("ENST00000530893.6") 'ensembl'
>>> infer_namespace("NM_01234.5") 'refseq'
>>> infer_namespace("A2BC19") 'uniprot'
Disbled because Python 2 and 3 handles exceptions differently.
>>> infer_namespace("P12345") Traceback (most recent call last): ... bioutils.exceptions.BioutilsError: Multiple namespaces possible for P12345
>>> infer_namespace("BOGUS99") is None True
- bioutils.accessions.infer_namespaces(ac)[source]¶
Infers namespaces possible for a given accession, based on syntax.
- Parameters:
ac (str) – An accession, without the namespace prefix.
- Returns:
A list of namespaces matching the accession, possibly empty.
- Return type:
list of str
Examples
>>> infer_namespaces("ENST00000530893.6") ['ensembl']
>>> infer_namespaces("ENST00000530893") ['ensembl']
>>> infer_namespaces("ENSQ00000530893") []
>>> infer_namespaces("NM_01234") ['refseq']
>>> infer_namespaces("NM_01234.5") ['refseq']
>>> infer_namespaces("NQ_01234.5") []
>>> infer_namespaces("A2BC19") ['uniprot']
>>> sorted(infer_namespaces("P12345")) ['insdc', 'uniprot']
>>> infer_namespaces("A0A022YWF9") ['uniprot']
- bioutils.accessions.prepend_chr(chr)[source]¶
Prepends chromosome with ‘chr’ if not present.
Users are strongly discouraged from using this function. Adding a ‘chr’ prefix results in a name that is not consistent with authoritative assembly records.
- Parameters:
chr (str) – The chromosome.
- Returns:
The chromosome with ‘chr’ prepended.
- Return type:
str
Examples
>>> prepend_chr('22') 'chr22'
>>> prepend_chr('chr22') 'chr22'