snp_distri.pl -- To calculate the SNV (position) distribution in exons, UTRs, introns, etc., according to transcript annotations. A SNV is disarded (not contributing to the total SNV number) if it is covered by < 2 RNA-Seq reads.
It also serves as an introductory application script to get familiar with the ASARP pipeline (asarp)
perl snp_distri.pl output_file snv_file transcript_file [powerful_snv_cutoff pwr_snv_details ordinary_snv_details [is_detailed]]
snv_file
: a SNV list, see the format description in snpParser
powerful_snv_cutoff
: an optional cutoff to categorize SNVs into powerful (>= powerful_snv_cutoff
) and non-powerful types. Default: 20
transcript_file
: Transcript and gene annotation file, see the format description in fileParser
The optional arguments must be input in order.
pwr_snv_details
and ordinary_snv_details
are the output files for the detailed SNV categories of powerful and non-powerful (ordinary) SNVs respectively.
is_detailed
: default: 0; when it is set 1 after both *_snv_details, detailed information of the SNVs will be output.
The detail format: chr;geneName;snvPos;type
output_file
: Tab dilimited counts of SNV positions in different gene regions. Headers are included as illustrated in the following example:
Type Exon Intron 5'UTR 3'UTR Complex In-gene Intergenic Total Powerful(>=20) 3788 972 586 4097 1464 7937 342 8279 Non-powerful(<20) 3846 32143 1246 3679 1323 39566 6580 46146 Overall 7634 33115 1832 7776 2787 47503 6922 54425
The application is SNV position (ref genome location) oriented. In other words, if (rarely, and not seen in our data) one position contains multiple SNVs, it will be still considered as one SNV (position). A SNV may overalp multiple transcripts of multiple genes. The rule to determine the categories of a SNV is as follows:
If the SNV overlaps certain transcript exon blocks, it is considered as in (coding) exons, regardless the times it overlaps.
If the SNV overlaps certain 5'/3' UTRs, it is considered as in 5'/3' UTRs (non-coding), regardless the times it overlaps.
Only when a SNV never overlaps any exons nor 5'/3' UTRs, and its genome location is within certain transcript span, it is considered as in introns, regardless of the times it overlaps. As a result, intron SNVs are exclusive to exons and 5'/3' UTRs.
A SNV can be categorized into multiple categories, denoted as complex. Therefore, complex is the union of SNVs with any combinations of Exon and 5'/3' UTR types. In-gene SNVs are the union of all the categories: exon, 5' UTR, 3' UTR, intron. Therefore, the sum of intron, exon, 5' and 3' UTR SNV counts will be larger than the total in-gene SNV count if complex SNVs exist. In-gene and Intergenic SNVs are exclusive to each other.
For any of the above cases, a SNV is considered in-gene. If a SNV is not in-gene, it is considered in the intergenic regions.
The application also outputs percentage (over the total SNV positions) sumamries for powerful, non-powerful and overall SNV distributions. Sample output:
Calculating powerful SNV distribution... SNV Position Distribution: Exon: 3788 (45.75%) Intron: 972 (11.74%) 5' UTR: 586 (7.08%) 3' UTR: 4097 (49.49%) Complex (in-gene): 1464 (17.68%) In-gene total: 7937 (95.87%) Intergenic total: 342 (4.13%) Total SNV Positions: 8279 Calculating non-powerful SNV distribution... SNV Position Distribution: Exon: 3846 (8.33%) Intron: 32143 (69.66%) 5' UTR: 1246 (2.70%) 3' UTR: 3679 (7.97%) Complex (in-gene): 1323 (2.87%) In-gene total: 39566 (85.74%) Intergenic total: 6580 (14.26%) Total SNV Positions: 46146 ...
If pwr_snv_details
and ordinary_snv_details
are input, you will have two addtional output files providing detailed categories of all individual in-gene SNVs. E.g.
chr pos category chr1 68586620 INTRON chr1 68591173 3'UTR; chr1 68590177 INTRON chr1 68608003 INTRON chr1 68591253 3'UTR; chr1 68591405 3'UTR; chr1 68624878 EXON; chr1 68589935 INTRON chr1 68585708 INTRON
asarp, fileParser, snpParser, MyConstants
This pipeline is free software; you can redistribute it and/or modify it given that the related works and authors are cited and acknowledged.
This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
Cyrus Tak-Ming CHAN
Xiao Lab, Department of Integrative Biology & Physiology, UCLA