rmDup.pl -- Removing duplicates in a SAM file of a chromosome (Dr. JH Lee's format), where the extra 12th attribute (mapped read blocks) are used to identify distinct reads (read pairs). Reads (read pairs) are considered as duplicates only if all of their mapped read blocks have the same coordinates. Only the read (pair) with the highest read quality will be kept. The output SAM file can be used as the input file for merging of multiple independent replicates (mergeSam), or read processing to generate SNV and bedgraph files (procReads).
This is part of the full pre-processing:
1. rmDup (removing PCR duplicates for SAM files (including Dr. JH Lee's SAM format); samtools/bedtools can be used for standard SAM files)
2. mergeSam (merging SAM files if there are independent duplicates)
3. procReads (processing SAM files to get SNV read counts and generate bedgraph files)
USAGE:
perl rmDup.pl input_sam_file output_sam_file is_paired_end
NOTE:
the duplicate removal script is for standard SAM and Dr. Jae-Hyung Lee's 20-attribute SAM file output formats, used in RNA-editing or allele specific expression (ASE) studies
ARGUMENTS:
is_paired_end 0: single-end; 1: paired-end For paired-end reads, all reads should be paired up, where pair-1 should be always followed by pair-2 in the next line.
input_sam_file
should contain only 1 chromosome, and it should be in standard SAM format or Dr. Jae-Hyung Lee's SAM format (check out www.ncbi.nlm.nih.gov for more details)
This pipeline is free software; you can redistribute it and/or modify it given that the related works and authors are cited and acknowledged.
This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
Cyrus Tak-Ming CHAN
Xiao Lab, Department of Integrative Biology & Physiology, UCLA