NAME

snp_distri.pl -- To calculate the SNV (position) distribution in exons, UTRs, introns, etc., according to transcript annotations. A SNV is disarded (not contributing to the total SNV number) if it is covered by < 2 RNA-Seq reads.

It also serves as an introductory application script to get familiar with the ASARP pipeline (asarp)

SYNOPSIS

  perl snp_distri.pl output_file snv_file transcript_file [powerful_snv_cutoff pwr_snv_details ordinary_snv_details [is_detailed]]

snv_file: a SNV list, see the format description in snpParser

powerful_snv_cutoff: an optional cutoff to categorize SNVs into powerful (>= powerful_snv_cutoff) and non-powerful types. Default: 20

transcript_file: Transcript and gene annotation file, see the format description in fileParser

The optional arguments must be input in order.

pwr_snv_details and ordinary_snv_details are the output files for the detailed SNV categories of powerful and non-powerful (ordinary) SNVs respectively.

is_detailed: default: 0; when it is set 1 after both *_snv_details, detailed information of the SNVs will be output. The detail format: chr;geneName;snvPos;type

output_file: Tab dilimited counts of SNV positions in different gene regions. Headers are included as illustrated in the following example:

	Type    Exon    Intron  5'UTR   3'UTR   Complex In-gene Intergenic      Total
	Powerful(>=20)  3788    972     586     4097    1464    7937    342     8279
	Non-powerful(<20)       3846    32143   1246    3679    1323    39566   6580    46146
	Overall 7634    33115   1832    7776    2787    47503   6922    54425

DESCRIPTION

The application is SNV position (ref genome location) oriented. In other words, if (rarely, and not seen in our data) one position contains multiple SNVs, it will be still considered as one SNV (position). A SNV may overalp multiple transcripts of multiple genes. The rule to determine the categories of a SNV is as follows:

If the SNV overlaps certain transcript exon blocks, it is considered as in (coding) exons, regardless the times it overlaps.

If the SNV overlaps certain 5'/3' UTRs, it is considered as in 5'/3' UTRs (non-coding), regardless the times it overlaps.

Only when a SNV never overlaps any exons nor 5'/3' UTRs, and its genome location is within certain transcript span, it is considered as in introns, regardless of the times it overlaps. As a result, intron SNVs are exclusive to exons and 5'/3' UTRs.

A SNV can be categorized into multiple categories, denoted as complex. Therefore, complex is the union of SNVs with any combinations of Exon and 5'/3' UTR types. In-gene SNVs are the union of all the categories: exon, 5' UTR, 3' UTR, intron. Therefore, the sum of intron, exon, 5' and 3' UTR SNV counts will be larger than the total in-gene SNV count if complex SNVs exist. In-gene and Intergenic SNVs are exclusive to each other.

For any of the above cases, a SNV is considered in-gene. If a SNV is not in-gene, it is considered in the intergenic regions.

The application also outputs percentage (over the total SNV positions) sumamries for powerful, non-powerful and overall SNV distributions. Sample output:

	Calculating powerful SNV distribution...
	SNV Position Distribution:
	Exon: 3788 (45.75%)
	Intron: 972 (11.74%)
	5' UTR: 586 (7.08%)
	3' UTR: 4097 (49.49%)
	Complex (in-gene): 1464 (17.68%)
	In-gene total: 7937 (95.87%)
	Intergenic total: 342 (4.13%)
	Total SNV Positions: 8279

	Calculating non-powerful SNV distribution...
	SNV Position Distribution:
	Exon: 3846 (8.33%)
	Intron: 32143 (69.66%)
	5' UTR: 1246 (2.70%)
	3' UTR: 3679 (7.97%)
	Complex (in-gene): 1323 (2.87%)
	In-gene total: 39566 (85.74%)
	Intergenic total: 6580 (14.26%)
	Total SNV Positions: 46146
	...

If pwr_snv_details and ordinary_snv_details are input, you will have two addtional output files providing detailed categories of all individual in-gene SNVs. E.g.

 chr	pos	category
 chr1	68586620	INTRON
 chr1	68591173	3'UTR;
 chr1	68590177	INTRON
 chr1	68608003	INTRON
 chr1	68591253	3'UTR;
 chr1	68591405	3'UTR;
 chr1	68624878	EXON;
 chr1	68589935	INTRON
 chr1	68585708	INTRON

SEE ALSO

asarp, fileParser, snpParser, MyConstants

COPYRIGHT

This pipeline is free software; you can redistribute it and/or modify it given that the related works and authors are cited and acknowledged.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

AUTHOR

Cyrus Tak-Ming CHAN

Xiao Lab, Department of Integrative Biology & Physiology, UCLA