bioutils.assemblies module

Creates dictionaries of genome assembly data as provided by

ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/*.assembly.txt

Assemblies are stored in json files with the package in _data/assemblies/. Those files are built with sbin/assembly-to-json, also in this package.

Definitions:

  • accession ac: symbol used to refer to a sequence (e.g., NC_000001.10)

  • name: human-label (e.g., ‘1’, ‘MT’, ‘HSCHR6_MHC_APD_CTG1’) that refers to a sequence, unique within some domain (e.g., GRCh37.p10)

  • chromosome (chr): subset of names that refer to chromosomes 1..22, X, Y, MT

  • aliases: list of other names; uniqueness unknown

Note

Some users prefer using a ‘chr’ prefix for chromosomes and some don’t. Some prefer upper case and others prefer lower. This rift is unfortunate and creates unnecessary friction in sharing data. You say TO-my-to and I say TO-mah-to doesn’t apply here. This code favors using the authoritative names exactly as defined in the assembly records. Users are encouraged to use sequence names verbatim, without prefixes or case changes.

bioutils.assemblies.get_assemblies(names=[])[source]

Retrieves data from multiple assemblies.

If assemblies are not specified, retrieves data from all available ones.

Parameters:

names (list of str, optional) – The names of the assemblies to retrieve data for.

Returns:

A dictionary of the form {assembly_name, : assembly_data}, where the values

are the dictionaries of assembly data as described in get_assembly().

Return type:

dict

Examples

>>> assemblies = get_assemblies(names=['GRCh37.p13'])
>>> assy = assemblies['GRCh37.p13']
>>> assemblies = get_assemblies()
>>> 'GRCh38.p2' in assemblies
True
bioutils.assemblies.get_assembly(name)[source]

Retreives the assembly data for a given assembly.

Parameters:

name (str) – The name of the assembly to retrieve data for.

Returns:

A dictionary of the assembly data. See examples for details.

Return type:

dict

Examples

>>> assy = get_assembly('GRCh37.p13')
>>> assy['name']
'GRCh37.p13'
>>> assy['description']
'Genome Reference Consortium Human Build 37 patch release 13 (GRCh37.p13)'
>>> assy['refseq_ac']
'GCF_000001405.25'
>>> assy['genbank_ac']
'GCA_000001405.14'
>>> len(assy['sequences'])
297
>>> import pprint
>>> pprint.pprint(assy['sequences'][0])
{'aliases': ['chr1'],
'assembly_unit': 'Primary Assembly',
'genbank_ac': 'CM000663.1',
'length': 249250621,
'name': '1',
'refseq_ac': 'NC_000001.10',
'relationship': '=',
'sequence_role': 'assembled-molecule'}
bioutils.assemblies.get_assembly_names()[source]

Retrieves available assemblies from the _data/assemblies directory.

Returns:

The names of the available assemblies.

Return type:

list of str

Examples

>>> assy_names = get_assembly_names()
>>> 'GRCh37.p13' in assy_names
True
bioutils.assemblies.make_ac_name_map(assy_name, primary_only=False)[source]

Creates a map from accessions to sequence names for a given assembly.

Parameters:
  • assy_name (str) – The name of the assembly to make a map for.

  • primary_only (bool, optional) – Whether to include only primary sequences. Defaults to False.

Returns:

A dictionary of the form {accesssion : sequence_name} for accessions in the given assembly,

where accession and sequence_name are strings.

Return type:

dict

Examples

>>> grch38p5_ac_name_map = make_ac_name_map('GRCh38.p5')
>>> grch38p5_ac_name_map['NC_000001.11']
'1'
bioutils.assemblies.make_name_ac_map(assy_name, primary_only=False)[source]

Creates a map from sequence names to accessions for a given assembly.

Parameters:
  • assy_name (str) – The name of the assembly to make a map for.

  • primary_only (bool, optional) – Whether to include only primary sequences. Defaults to False.

Returns:

A dictionary of the form {sequence_name : accession} for sequences in the given assembly,

Where sequence_name and accession are both strings.

Return type:

dict

Examples

>>> grch38p5_name_ac_map = make_name_ac_map('GRCh38.p5')
>>> grch38p5_name_ac_map['1']
'NC_000001.11'