Couple Paired-End BLAT Alignments¶
While BLAT does not accept paired-end sequence files, the files can be aligned individually and then coupled with this utility post-alignment. BLAT offers a number of advantages over other NGS aligners, one being its ability to call all alignment above a specific threshold. This allows for accurate identification of multihits, or sequences aligning legitimately to multiple locations in the genome, rather than calling the best hit.
Usage¶
nuckit couple testSeq-1.R2.psl.gz testSeq-1.R1.psl.gz \
-k testSeq-1.R2.key.csv testSeq-1.R1.key.csv \
-o testSeq-1.uniq.csv \
--condSites testSeq-1.cond.csv \
--chimera testSeq-1.chimera.rds \
--multihit testSeq-1.multihit.rds \
--refGenome hg38
Arguments¶
Positional arguments:
[anchorPSL] Alignment file (psl format) to couple, component of paired-end sequence that is anchored to a biological phenomenon.
[adriftPSL] Alignment file (psl format) to couple, other component of paired-end sequnece that is generated by non-natural forces (shearing).
Optional arguments:
[-h, –help] show help message and exit
[-k, –keys] Key files which denote the consolidation of reads to unique sequences. See README for requirements. Mulitple key files should be input in the same order as their respective psl files.
[-o, –uniqOutput] Output file for unique alignments. File types supported: .rds, .RData, .csv, and .tsv
[–condSites] Output file for condensed sites, based on anchor alignments. Read counts and unique alignment length counts are reported for each unique site. Same file types supported as uniqOutput.
[–chimeras] Output file for chimeric alignments. Same file types supported as uniqOutput.
[–multihits] Output file for multihit alignments. Same file types supported as uniqOutput.
[–stat] File name to be written in output directory of read couts for each sample. CSV file format. ie. test.stat.csv.
[-g, –refGenome] Reference genome, needs to be installed through BSgenome (BioConductor).
[–maxAlignStart] Maximum allowable distance from the start of the sequence to keep the alignment. Default = 5.
[–minPercentIdentity] Minimal global (whole sequence) percent identity required to keep alignment. Default = 95 (0-100).
[–minTempLength] Minimum value for paired template length to consider. Default = 30 (bps).
[–maxTempLength] Maximum value for paired template length to consider. Default = 2500 (bps).
[–keepAltChr] By default, blatCoupleR will remove alignments from psl files aligning to alternative chromosome sequences, ex. chr7_*_alt. Using this option will keep these alignments, which may increase multihit outputs.
[–readNamePattern] Regular expression for pattern matching read names. Should not contain R1/R2/I1/I2 specific components. Default is [w:-]+
[–saveImage] Output file name for saved image. Include ‘.RData’ (ie. debug.RData).
Dependencies¶
This utility is coded in R and was developed on v3.4.0, though it should run with earlier versions given the appropriate dependencies. The utility depends on the following R-packages:
- argparse
- yaml
- stringr
- GenomicRanges
- igraph
- data.table
- Matrix
- BSgenome
- BSgenome.Hsapiens.UCSC.hg38 (for tests)