Specifying genomes¶
This section illustrates the use of genome files for use with BEDTools programs that need to know chromosome limits to prevent out-of-range coordinates.
Using BEDTools programs like slopBed
or shuffleBed
from the command
line requires “genome” or “chromsizes” files. pybedtools
comes with
common genome assemblies already set up as a dictionary with chromosomes as
keys and zero-based (start, stop) tuples as values:
>>> from pybedtools import genome_registry
>>> genome_registry.dm3['chr2L']
(0, 23011544)
The rules for specifying a genome for methods that require a genome are as follows (use whatever is most convenient):
- Use
g
to specify either a filename or a dictionary - Use
genome
to specify either an assembly name or a dictionary
Below are examples of each.
As a file¶
This is the typical way of using BEDTools programs, by specifying an existing genome
file with g
:
>>> a = pybedtools.example_bedtool('a.bed')
>>> b = a.slop(b=100, g='hg19.genome')
As a string¶
This is probably the most convenient way of specifying a genome. If the
genome exists in the genome registry it will be used directly; otherwise it
will automatically be downloaded from UCSC. You must use the genome
kwarg for this; if you use g
a string will be interpreted as a filename:
>>> c = a.slop(b=100, genome='hg19')
As a dictionary¶
This is a good way of providing custom coordinates; either g
or genome
will accept a dictionary:
>>> d = a.slop(b=100, g={'chr1':(1, 10000)})
>>> e = a.slop(b=100, genome={'chr1':(1,100000)})
Make sure that all these different methods return the same results
>>> b == c == d == e
True
Converting to a file¶
Since BEDTools programs operate on files, the fastest choice will be to
use an existing file. While the time to convert a dictionary to a file is
extremely small, over 1000’s of files (e.g., for Monte Carlo simulations),
the time may add up. The function pybedtools.chromsizes_to_file()
will create a file from a dictionary or string:
>>> # with no filename specified, a tempfile will be created
>>> pybedtools.chromsizes_to_file(pybedtools.chromsizes('dm3'), 'dm3.genome')
'dm3.genome'
>>> print(open('dm3.genome').read())
chr2L 23011544
chr2LHet 368872
chr2R 21146708
chr2RHet 3288761
chr3L 24543557
chr3LHet 2555491
chr3R 27905053
chr3RHet 2517507
chr4 1351857
chrM 19517
chrU 10049037
chrUextra 29004656
chrX 22422827
chrXHet 204112
chrYHet 347038