procReads.pl -- Processing a duplicate-removed SAM file (rmDup) of a chromosome to generate the chromosome specific SNV list and the bedgraph file. The output files are used as input files for the ASARP pipeline.
The new procReads.pl supports strand-specific paired-end RNA-Seq data (i.e. the SAM file strand information is reliable) for more accurate results.
This is part of the full pre-processing:
1. rmDup (removing PCR duplicates for SAM files (including Dr. JH Lee's SAM format); samtools/bedtools can be used for standard SAM files)
2. mergeSam (merging SAM files if there are independent duplicates)
3. procReads (processing SAM files to get SNV read counts and generate bedgraph files)
USAGE:
perl procReads.pl input_sam_file input_snvs output_snvs output_bedgraph is_paired_end
NOTE:
the read processing script is for a standard SAM file with 11 attributes. You can also use samtools and bedtools; or to use procReadsJ.pl on the special jsam files introduced by Dr. Jae-Hyung Lee for RNA-editin and allele specific expression (ASE) studies
There are some assumptions and requirements for the input SAM/JSAM files. See Files for more details.
ARGUMENTS:
chr chromosome to be investigated (correspond to the input_sam_file) input_sam_file SAM file input after duplicate removal (use rmDup.pl) intput_snvs input SNV list (without read counts) output_snvs output SNV candidates with read counts output_bedgraph output bedgraph file, see below for the details: genome.ucsc.edu is_paired_end 0: single-end; 1: paired-end For paired-end reads, all reads should be paired up, where pair-1 should be always followed by pair-2 in the next line.
OPTIONAL [strongly recommended to be input]:
is_strand_sp 0: non-strand specific (or no input); 1: strand-specific with pair 1 **sense**; 2: strand-specific with pair 1 **anti-sense** Be careful with the strand-specific setting as it will give totally opposite strand information if wrongly set. The strand-specific option is used for strand-specific RNA-Seq data. When set, specialized bedgraph files will be output (output_bedgraph) where there is a 5th extra attribute specifying the strand: + or - besides the standard ones: genome.ucsc.edu One can use grep and cut to get +/- strand only bedgraphs.
OPTIONAL [if input, must be input in order following is_strand_sp]:
bedgraph_title a short title for the output bedgraph files (will be put in description of the header line) if there are spaces in between it should be quoted e.g. "nbt.editing reads: distinct after dup removal" if not input, "default" will be used as the short title
input_sam_file
should contain only 1 chromosome, and it should be in SAM/JSAM format. If it is in JSAM format (check out www.ncbi.nlm.nih.gov for more details), procReadsJ can also be used.
The SNP list should be in a format like this:
chr1 20129 C>T rs12354148 2:0:0 chr1 118617 T>C na 1:0:0 chr1 237763 G>A rs79665216 1:0:0 chr1 565508 G>A rs9283150 0:1:0
Each line is space separated, with
chromosome location ref_allele>alt_allele dbSnp_id [read_counts ref:alt:others]
Only the first 4 fields will be parsed so
read_counts
are not needed and ignored which are from DNA genomic sequencing.
To avoid unnecessary computational time to read SNVs of other chrosomes than the one in inptu_sam_file
, it is suggested to keep SNVs of the same chromosome in one seperate file.
By default, the strand specific flag is_strand_sp
is unset, and the inptu_sam_file
strand information is considered unreliable and not used. The output files are standard bedgraph files genome.ucsc.edu with space as the dilimiter, and SNV files should follow the file format described in Files:
Example SNV output lists look like this:
chr10 1046712 G>A rs2306409 50:39:0 chr10 1054444 A>G rs11253567 2:0:0 chr10 1055866 A>T rs4880751 1:2:1 chr10 1055949 G>A rs12355506 7:2:0 chr10 1055968 G>A rs72478237 6:5:0 chr10 1060218 G>A rs3207775 42:37:0
When is_strand_sp
is set, the program outputs bedgraph and SNV files with the extra last strand attributes. Output SNV examples would look like:
chr10 1046712 G>A rs2306409 30:23:0 + chr10 1055866 A>T rs4880751 1:2:0 + chr10 1055949 G>A rs12355506 2:0:0 + chr10 1055968 G>A rs72478237 2:2:0 + chr10 1060218 G>A rs3207775 27:22:0 + ... chr10 1046712 G>A rs2306409 20:16:0 - chr10 1054444 A>G rs11253567 2:0:0 - chr10 1055949 G>A rs12355506 5:2:0 - chr10 1055968 G>A rs72478237 4:3:0 - chr10 1060218 G>A rs3207775 15:15:0 -
As a result, one SNV may appear twice if it has valid ref:alt:wrnt read counts on both + and - strands. In the example above, one can have more accurate information for the RNA read counts, especially when there are genes on the opposite strands.
The output bedgraph lines would look like:
chr10 181481 181482 7 + chr10 181482 181483 9 + chr10 181483 181499 10 + ... chr10 181479 181482 5 - chr10 181482 181483 8 - chr10 181483 181499 9 - ...
Note that all + lines are output before any - lines output. While the 5th strand attribute is not specified in the bedgraph standard, two additional bedgraph files are output, with suffixes .plus and .minus, to provide the + and - only standard bedgraph tracks respectively.
This pipeline is free software; you can redistribute it and/or modify it given that the related works and authors are cited and acknowledged.
This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
Cyrus Tak-Ming CHAN
Xiao Lab, Department of Integrative Biology & Physiology, UCLA