bioutils.accessions module

Simple routines to deal with accessions, identifiers, etc.

Biocommons terminology: an identifier is composed of a namespace and an accession. The namespace is a string, composed of any character other than colon (:). The accession is a string without character set restriction. An accession is expected to be unique within the namespace; there is no expectation of uniqueness of accessions across namespaces.

Identifier := <Namespace, Accession>

Namespace := [^:]+

Accession := \w+

Some sample serializations of Identifiers:

json: {"namespace": "RefSeq", "accession": "NM_000551.3"}

xml: <Identifier namespace="RefSeq" accession="NM_000551.3"/>

string: "RefSeq:NM_000551.3"

The string form may be used as a CURIE, in which case the document in which the CURIE is used must contain a map of {namespace : uri}.

bioutils.accessions.chr22XY(c)[source]

Reformats chromosome to be of the form Chr1, …, Chr22, ChrX, ChrY, etc.

Parameters

c (str or int) – A chromosome.

Returns

The reformatted chromosome.

Return type

str

Examples

>>> chr22XY('1')
'chr1'
>>> chr22XY(1)
'chr1'
>>> chr22XY('chr1')
'chr1'
>>> chr22XY(23)
'chrX'
>>> chr22XY(24)
'chrY'
>>> chr22XY("X")
'chrX'
>>> chr22XY("23")
'chrX'
>>> chr22XY("M")
'chrM'
bioutils.accessions.coerce_namespace(ac)[source]

Prefixes accession with inferred namespace if not present.

Intended to be used to promote consistent and unambiguous accession identifiers.

Parameters

ac (str) – The accession, with or without namespace prefixed.

Returns

An identifier of the form “{namespace}:{acession}”

Return type

str

Raises

ValueError – if accession sytax does not match the syntax of any namespace.

Examples

>>> coerce_namespace("refseq:NM_01234.5")
'refseq:NM_01234.5'
>>> coerce_namespace("NM_01234.5")
'refseq:NM_01234.5'
>>> coerce_namespace("bogus:QQ_01234.5")
'bogus:QQ_01234.5'
>>> coerce_namespace("QQ_01234.5")
Traceback (most recent call last):
...
ValueError: Could not infer namespace for QQ_01234.5
bioutils.accessions.infer_namespace(ac)[source]

Infers a unique namespace from an accession, if one exists.

Parameters

ac (str) – An accession, without the namespace prefix.

Returns

The unique namespace corresponding to accession syntax, if only one is inferred.

None if the accesssion sytax does not match any namespace.

Return type

str or None

Raises

BioutilsError – If multiple namespaces match the syntax of the accession.

Examples

>>> infer_namespace("ENST00000530893.6")
'ensembl'
>>> infer_namespace("NM_01234.5")
'refseq'
>>> infer_namespace("A2BC19")
'uniprot'

Disbled because Python 2 and 3 handles exceptions differently.

>>> infer_namespace("P12345")  
Traceback (most recent call last):
...
bioutils.exceptions.BioutilsError: Multiple namespaces possible for P12345
>>> infer_namespace("BOGUS99") is None
True
bioutils.accessions.infer_namespaces(ac)[source]

Infers namespaces possible for a given accession, based on syntax.

Parameters

ac (str) – An accession, without the namespace prefix.

Returns

A list of namespaces matching the accession, possibly empty.

Return type

list of str

Examples

>>> infer_namespaces("ENST00000530893.6")
['ensembl']
>>> infer_namespaces("ENST00000530893")
['ensembl']
>>> infer_namespaces("ENSQ00000530893")
[]
>>> infer_namespaces("NM_01234")
['refseq']
>>> infer_namespaces("NM_01234.5")
['refseq']
>>> infer_namespaces("NQ_01234.5")
[]
>>> infer_namespaces("A2BC19")
['uniprot']
>>> sorted(infer_namespaces("P12345"))
['insdc', 'uniprot']
>>> infer_namespaces("A0A022YWF9")
['uniprot']
bioutils.accessions.prepend_chr(chr)[source]

Prepends chromosome with ‘chr’ if not present.

Users are strongly discouraged from using this function. Adding a ‘chr’ prefix results in a name that is not consistent with authoritative assembly records.

Parameters

chr (str) – The chromosome.

Returns

The chromosome with ‘chr’ prepended.

Return type

str

Examples

>>> prepend_chr('22')
'chr22'
>>> prepend_chr('chr22')
'chr22'
bioutils.accessions.strip_chr(chr)[source]

Removes the ‘chr’ prefix if present.

Parameters

chr (str) – The chromosome.

Returns

The chromosome without a ‘chr’ prefix.

Return type

str

Examples

>>> strip_chr('22')
'22'
>>> strip_chr('chr22')
'22'