Filtering VCF files¶
The filter script: vcf_filter.py¶
Filtering a VCF file based on some properties of interest is a common enough
operation that PyVCF offers an extensible script. vcf_filter.py
does
the work of reading input, updating the metadata and filtering the records.
Existing Filters¶
-
class
vcf.filters.
VariantGenotypeQuality
(args)[source]¶ Filters sites with only low quality variants.
It is possible to have a high site quality with many low quality calls. This filter demands at least one call be above a threshold quality.
-
class
vcf.filters.
ErrorBiasFilter
(args)[source]¶ Filter sites that look like correlated sequencing errors.
Some sequencing technologies, notably pyrosequencing, produce mutation hotspots where there is a constant level of noise, producing some reference and some heterozygote calls.
This filter computes a Bayes Factor for each site by comparing the binomial likelihood of the observed allelic depths under:
- A model with constant error equal to the MAF.
- A model where each sample is the ploidy reported by the caller.
The test value is the log of the bayes factor. Higher values are more likely to be errors.
Note: this filter requires rpy2
Adding a filter¶
You can reuse this work by providing a filter class, rather than writing your own filter. For example, lets say I want to filter each site based on the quality of the site. I can create a class like this:
import vcf.filters
class SiteQuality(vcf.filters.Base):
'Filter sites by quality'
name = 'sq'
@classmethod
def customize_parser(self, parser):
parser.add_argument('--site-quality', type=int, default=30,
help='Filter sites below this quality')
def __init__(self, args):
self.threshold = args.site_quality
def __call__(self, record):
if record.QUAL < self.threshold:
return record.QUAL
This class subclasses vcf.filters.Base
which provides the interface for VCF filters.
The docstring and name
are metadata about the parser. The docstring provides
the help for the script, and the first line is included in the FILTER metadata when
applied to a file.
The customize_parser
method allows you to add arguments to the script.
We use the __init__
method to grab the argument of interest from the parser.
Finally, the __call__
method processes each record and returns a value if the
filter failed. The base class uses the name
and threshold
to create
the filter ID in the VCF file.
To make vcf_filter.py aware of the filter, you can either use the local script option or declare an entry point. To use a local script, simply call vcf_filter:
$ vcf_filter.py --local-script my_filters.py ...
To use an entry point, you need to declare a vcf.filters
entry point in your setup
:
setup(
...
entry_points = {
'vcf.filters': [
'site_quality = module.path:SiteQuality',
]
}
)
Either way, when you call vcf_filter.py, you should see your filter in the list of available filters:
usage: vcf_filter.py [-h] [--no-short-circuit] [--no-filtered]
[--output OUTPUT] [--local-script LOCAL_SCRIPT]
input filter [filter_args] [filter [filter_args]] ...
Filter a VCF file
positional arguments:
input File to process (use - for STDIN) (default: None)
optional arguments:
-h, --help Show this help message and exit. (default: False)
--no-short-circuit Do not stop filter processing on a site if any filter
is triggered (default: False)
--output OUTPUT Filename to output [STDOUT] (default: <open file
'<stdout>', mode 'w' at 0x1002841e0>)
--no-filtered Output only sites passing the filters (default: False)
--local-script LOCAL_SCRIPT
Python file in current working directory with the
filter classes (default: None)
sq:
Filter sites by quality
--site-quality SITE_QUALITY
Filter sites below this quality (default: 30)
Utilities¶
Utilities for VCF files.
Simultaneously iterate two or more files¶
-
vcf.utils.
walk_together
(*readers, **kwargs)[source]¶ Simultaneously iteratate over two or more VCF readers. For each genomic position with a variant, return a list of size equal to the number of VCF readers. This list contains the VCF record from readers that have this variant, and None for readers that don’t have it. The caller must make sure that inputs are sorted in the same way and use the same reference otherwise behaviour is undefined.
- Args:
- vcf_record_sort_key: function that takes a VCF record and returns a
- tuple that can be used as a key for comparing and sorting VCF records across all readers. This tuple defines what it means for two variants to be equal (eg. whether it’s only their position or also their allele values), and implicitly determines the chromosome ordering since the tuple’s 1st element is typically the chromosome name (or calculated from it).
Trim common suffix¶
-
vcf.utils.
trim_common_suffix
(*sequences)[source]¶ Trim a list of sequences by removing the longest common suffix while leaving all of them at least one character in length.
Standard convention with VCF is to place an indel at the left-most position, but some tools add additional context to the right of the sequences (e.g. samtools). These common suffixes are undesirable when comparing variants, for example in variant databases.
>>> trim_common_suffix('TATATATA', 'TATATA') ['TAT', 'T']
>>> trim_common_suffix('ACCCCC', 'ACCCCCCCC', 'ACCCCCCC', 'ACCCCCCCCC') ['A', 'ACCC', 'ACC', 'ACCCC']
vcf_melt¶
This script converts a VCF file from wide format (many calls per row) to a long format (one call per row). This is useful if you want to grep per sample or for really quick import into, say, a spreadsheet:
$ vcf_melt < vcf/test/gatk.vcf
SAMPLE AD DP GQ GT PL FILTER CHROM POS REF ALT ID info.AC info.AF info.AN info.BaseQRankSum info.DB info.DP info.DS info.Dels info.FS info.HRun info.HaplotypeScore info.InbreedingCoeff info.MQ info.MQ0 info.MQRankSum info.QD info.ReadPosRankSum
BLANK 6,0 6 18.04 0/0 0,18,211 . chr22 42522392 G [A] rs28371738 2 0.143 14 0.375 True 1506 True 0.0 0.0 0 123.5516 253.92 0 0.685 5.9 0.59
NA12878 138,107 250 99.0 0/1 1961,0,3049 . chr22 42522392 G [A] rs28371738 2 0.143 14 0.375 True 1506 True 0.0 0.0 0 123.5516 253.92 0 0.685 5.9 0.59
NA12891 169,77 250 99.0 0/1 1038,0,3533 . chr22 42522392 G [A] rs28371738 2 0.143 14 0.375 True 1506 True 0.0 0.0 0 123.5516 253.92 0 0.685 5.9 0.59
NA12892 249,0 250 99.0 0/0 0,600,5732 . chr22 42522392 G [A] rs28371738 2 0.143 14 0.375 True 1506 True 0.0 0.0 0 123.5516 253.92 0 0.685 5.9 0.59
NA19238 248,1 250 99.0 0/0 0,627,6191 . chr22 42522392 G [A] rs28371738 2 0.143 14 0.375 True 1506 True 0.0 0.0 0 123.5516 253.92 0 0.685 5.9 0.59
NA19239 250,0 250 99.0 0/0 0,615,5899 . chr22 42522392 G [A] rs28371738 2 0.143 14 0.375 True 1506 True 0.0 0.0 0 123.5516 253.92 0 0.685 5.9 0.59
NA19240 250,0 250 99.0 0/0 0,579,5674 . chr22 42522392 G [A] rs28371738 2 0.143 14 0.375 True 1506 True 0.0 0.0 0 123.5516 253.92 0 0.685 5.9 0.59
BLANK 13,4 17 62.64 0/1 63,0,296 . chr22 42522613 G [C] rs1135840 6 0.429 14 16.289 True 1518 True 0.03 0.0 0 142.5716 242.46 0 2.01 9.16 -1.731
NA12878 118,127 246 99.0 0/1 2396,0,1719 . chr22 42522613 G [C] rs1135840 6 0.429 14 16.289 True 1518 True 0.03 0.0 0 142.5716 242.46 0 2.01 9.16 -1.731
NA12891 241,0 244 99.0 0/0 0,459,4476 . chr22 42522613 G [C] rs1135840 6 0.429 14 16.289 True 1518 True 0.03 0.0 0 142.5716 242.46 0 2.01 9.16 -1.731
NA12892 161,85 246 99.0 0/1 1489,0,2353 . chr22 42522613 G [C] rs1135840 6 0.429 14 16.289 True 1518 True 0.03 0.0 0 142.5716 242.46 0 2.01 9.16 -1.731
NA19238 110,132 242 99.0 0/1 2561,0,1488 . chr22 42522613 G [C] rs1135840 6 0.429 14 16.289 True 1518 True 0.03 0.0 0 142.5716 242.46 0 2.01 9.16 -1.731
NA19239 106,135 242 99.0 0/1 2613,0,1389 . chr22 42522613 G [C] rs1135840 6 0.429 14 16.289 True 1518 True 0.03 0.0 0 142.5716 242.46 0 2.01 9.16 -1.731
NA19240 116,126 243 99.0 0/1 2489,0,1537 . chr22 42522613 G [C] rs1135840 6 0.429 14 16.289 True 1518 True 0.03 0.0 0 142.5716 242.46 0 2.01 9.16 -1.731