bioutils.accessions module

Simple routines to deal with accessions, identifiers, etc.

Biocommons terminology: an identifier is composed of a namespace and an accession. The namespace is a string, composed of any character other than colon (:). The accession is a string without character set restriction. An accession is expected to be unique within the namespace; there is no expectation of uniqueness of accessions across namespaces.

Identifier := <Namespace, Accession>

Namespace := [^:]+

Accession := \w+

Some sample serializations of Identifiers:

json: {"namespace": "RefSeq", "accession": "NM_000551.3"}

xml: <Identifier namespace="RefSeq" accession="NM_000551.3"/>

string: "RefSeq:NM_000551.3"

The string form may be used as a CURIE, in which case the document in which the CURIE is used must contain a map of {namespace : uri}.

bioutils.accessions.chr22XY(c)[source]

Reformats chromosome to be of the form Chr1, …, Chr22, ChrX, ChrY, etc.

Parameters:

c (str or int) – A chromosome.

Returns:

The reformatted chromosome.

Return type:

str

Examples

>>> chr22XY('1')
'chr1'
>>> chr22XY(1)
'chr1'
>>> chr22XY('chr1')
'chr1'
>>> chr22XY(23)
'chrX'
>>> chr22XY(24)
'chrY'
>>> chr22XY("X")
'chrX'
>>> chr22XY("23")
'chrX'
>>> chr22XY("M")
'chrM'
bioutils.accessions.coerce_namespace(ac)[source]

Prefixes accession with inferred namespace if not present.

Intended to be used to promote consistent and unambiguous accession identifiers.

Parameters:

ac (str) – The accession, with or without namespace prefixed.

Returns:

An identifier of the form “{namespace}:{acession}”

Return type:

str

Raises:

ValueError – if accession sytax does not match the syntax of any namespace.

Examples

>>> coerce_namespace("refseq:NM_01234.5")
'refseq:NM_01234.5'
>>> coerce_namespace("NM_01234.5")
'refseq:NM_01234.5'
>>> coerce_namespace("bogus:QQ_01234.5")
'bogus:QQ_01234.5'
>>> coerce_namespace("QQ_01234.5")
Traceback (most recent call last):
...
ValueError: Could not infer namespace for QQ_01234.5
bioutils.accessions.infer_namespace(ac)[source]

Infers a unique namespace from an accession, if one exists.

Parameters:

ac (str) – An accession, without the namespace prefix.

Returns:

The unique namespace corresponding to accession syntax, if only one is inferred.

None if the accesssion sytax does not match any namespace.

Return type:

str or None

Raises:

BioutilsError – If multiple namespaces match the syntax of the accession.

Examples

>>> infer_namespace("ENST00000530893.6")
'ensembl'
>>> infer_namespace("NM_01234.5")
'refseq'
>>> infer_namespace("A2BC19")
'uniprot'

Disbled because Python 2 and 3 handles exceptions differently.

>>> infer_namespace("P12345")  
Traceback (most recent call last):
...
bioutils.exceptions.BioutilsError: Multiple namespaces possible for P12345
>>> infer_namespace("BOGUS99") is None
True
bioutils.accessions.infer_namespaces(ac)[source]

Infers namespaces possible for a given accession, based on syntax.

Parameters:

ac (str) – An accession, without the namespace prefix.

Returns:

A list of namespaces matching the accession, possibly empty.

Return type:

list of str

Examples

>>> infer_namespaces("ENST00000530893.6")
['ensembl']
>>> infer_namespaces("ENST00000530893")
['ensembl']
>>> infer_namespaces("ENSQ00000530893")
[]
>>> infer_namespaces("NM_01234")
['refseq']
>>> infer_namespaces("NM_01234.5")
['refseq']
>>> infer_namespaces("NQ_01234.5")
[]
>>> infer_namespaces("A2BC19")
['uniprot']
>>> sorted(infer_namespaces("P12345"))
['insdc', 'uniprot']
>>> infer_namespaces("A0A022YWF9")
['uniprot']
bioutils.accessions.prepend_chr(chr)[source]

Prepends chromosome with ‘chr’ if not present.

Users are strongly discouraged from using this function. Adding a ‘chr’ prefix results in a name that is not consistent with authoritative assembly records.

Parameters:

chr (str) – The chromosome.

Returns:

The chromosome with ‘chr’ prepended.

Return type:

str

Examples

>>> prepend_chr('22')
'chr22'
>>> prepend_chr('chr22')
'chr22'
bioutils.accessions.strip_chr(chr)[source]

Removes the ‘chr’ prefix if present.

Parameters:

chr (str) – The chromosome.

Returns:

The chromosome without a ‘chr’ prefix.

Return type:

str

Examples

>>> strip_chr('22')
'22'
>>> strip_chr('chr22')
'22'