NAME

filePaser.pl -- All the sub-routines for getting and parsing input files (NOT involving SNV handling) in the ASARP pipeline.

SYNOPSIS

	require "fileParser.pl";

	# input arguments: $outputFile--output, 
	# $configs--input configuration file, $params--parameter file
	my ($outputFile, $configs, $params) = getArgs(@ARGV); 
	my ($snpF, $bedF, $rnaseqF, $xiaoF, $splicingF, $estF) = getRefFileConfig($configs);
	my ($POWCUTOFF, $SNVPCUTOFF, $FDRCUTOFF, $ASARPPCUTOFF, 
	$NEVCUTOFFLOWER, $NEVCUTOFFUPPER, $ALRATIOCUTOFF) = getParameters($params);

	# read the transcript annotation file
	my $transRef = readTranscriptFile($xiaoF);
	
	#get alternative initiation/termination (AI/AT) events from transcripts
	my $altRef = getGeneAltTransEnds($transRef); 

	# get indices of gene transcript starts and gene names (prepared also for SNVs)
	my ($genesRef, $geneNamesRef) = getGeneIndex($transRef); 

	# read all annotations, optionally rna-seq and est, events and compile them
	my $allEventsListRef = readAllEvents($splicingF, $rnaseqF, $estF, $transRef, $geneNamesRef);
	my $splicingRef = compileGeneSplicingEvents($genesRef, values %$allEventsListRef);

DESCRIPTION

This perl file contains all the sub-routines that handle the input arguments, read configuration files, and parse all the annotations and events that are input to be matched with SNVs to discover ASE/ASARP. SNVs are handled separately in snpParser.

After the processing shown above, the alternative processing events are in the form shown in the figure above, where SNVs (the red dot) will be matched and later Normalized Expression Values can be calculated (NEVs derived from the read counts in orange) to filter alternative patterns in snpParser.

Sub-routines (major)

These are the major (interface) sub-routines that will be used in correlation with SNVs in the whole pipeline. Read them one by one as they are quite procedural.

getRefFileConfig

get all input annotation/event file/folder paths contained in the configuration file.

 input: $configs --configuration file, check out default.config for the formats.
 
 output: ($snpF, $bedF, $rnaseqF, $xiaoF, $splicingF, $estF) 
 --SNV list file path ($snpF), 
 --the bedgraph folder path and the specified file extension separated by tab ($bedF), 
   user-specified bedgraph extension can be provided in the configuration file: 
   e.g. bedext<TAB>sam.bed
 --rna-seq.event file path ($rnaseqF, optional: '' returned if not provided in $configs), 
 --transcript annotation file path ($xiaoF), 
 --annotation.event path ($splicingF), 
 --est.event file path ($estF, optional: '' returned if not provided in $configs)

The default config file as an example can be found in the project default.config.

File formats:

Annotation file format: Transcript and gene annotation specified in $xiaoF

The example file by default, data/hg19.merged.to.ensg.all.tx.03.18.2011.txt(.gz), was created by merging ensembl Refseq, UCSC knowngene, Gencode gene, and Vegagene.

Format (tab delimited): ID, chr, strand, txStart, txEnd, cdsstart, cdsend, exoncount, exonstarts, exonends, genename, cdsstartstat,cdsendstat

IMPORTANT: all coordinates are hg19, 0-based start and 1-based end coordinates (UCSC convention) **in this file only???**.

Event (splicing event) file formats:

Annotation events specified in $splicingF (example: data/annotation.event)

The file contains splicing events as annotated in the above file ($xiaoF). The format is the same as that for rnaseq_event. (1-based start and end)

RNA-Seq events specified in $rnaseqF (example: data/rnaseq.event)

The file contains splicing events as determined by our RNA-seq data. It lists the events for each gene. The format of the events is, EVENT, chromosome, genename, strand, event_region, flanking_region_1, flanking_region_2, where *_region are in the format of starting_coordinate-ending_coordinate. (1-based start and end)

EST events specified in $estF (example: data/est.event)

The file contains splicing events as determined from hg19 EST and cDNA data. The format is tab-delimited as: event_type, event_name, starting_coordinate, ending_coordinate. (1-based start and end)

getParameters

get all the numeric parameters including p-value cutoffs, NEV lower/upper thresholds, and the allelic ratio difference threshold.

 input: $params --configuration file for the parameters

 output: ($POWCUTOFF, $SNVPCUTOFF, $FDRCUTOFF, $ASARPPCUTOFF, $NEVCUTOFFLOWER, $NEVCUTOFFUPPER, $ALRATIOCUTOFF)
 --read count cutoff for powerful SNVs ($POWCUTOFF),
 --Chi-Squared Test p-value cutoff for individual SNVs ($SNVPCUTOFF) 
   NOTE: it will be ignored if $FDRCUTOFF is provided, see below:
 --FDR cutoff ($FDRCUTOFF, e.g. 0.05 for FDR <= 5%), according to which 
   a p-value cutoff will be selected using the Benjamini-Hochberg method in R, 
   and thus any provided $SNVPCUTOFF will be ignored
 --Fisher's Exact Test p-value cutoff for target-control SNV pairs in ASARP ($ASARPPCUTOFF),
 --NEV lower and upper cutoffs (excl.) ($NEVCUTOFFLOWER, $NEVCUTOFFUPPER),
 --allelic ratio difference cutoff for target-control SNV pairs in ASARP ($ALRATIOCUTOFF)

The default parameter config file as an example can be found in default.param. All parameters are required and have to be set.

readTranscriptFile

read the transcript annotation file

 input: $xiaoF --file path of the transcript (also gene) annotation file. 

 output: $transRef --reference to an array of the parsed transcripts.

The results can be printed out using utility sub using key 'trans': printListByKey($transRef, 'trans');

getGeneIndex

intermediate sub to get indices of gene transcript starts and gene names and gene ends

 input: $transRef (see above)

 output: $genesRef --reference to the index for every gene's minimal transcript start

 	 $geneNamesRef --reference to the gene names

	 $geneEndsRef --reference to the gene ends (optionally used)

getGeneAltTransEnds

get alternative initiation/termination (AI/AT) events from transcripts

 input: $transRef (see above)

 output: $altRef--reference to the AI/AT events

The AI/AT results can be printed out using utility sub: printAltEnds($altRef);

readAllEvents

read all annotations, optionally rna-seq and est, splicing events

 input: ($splicingF, $rnaseqF, $estF, $transRef, $geneNamesRef);
 --event files (see above): anno ($splicingF), rna ($rnaseqF), est ($estF)
 --see above for $transRef, $geneNamesRef
 
 output: $allEventsListRef --the reference to a hash of 
 all 'anno', optionally 'rna', 'est' events parsed
 quoted are keys to access them in the hash

compileGeneSplicingEvents

compile events from different files and arrange them according to genes

 input: ($genesRef, values %$allEventsListRef)
 --see above for $genesRef
 --the 2nd argument is the event type(s) provided, e.g. 'anno', 'rna', 'est'
 output: $splicingRef --reference to all the splicing events

Utility subs

Utility sub-routines are implemented to get or display parsing annotations/events for intermediate usage or checking.

Parsing results are usually arranged in an internal data structure whose reference is returned.

The 1st layer is a hash with 3 keys: the data 'x', its companian index 'x_idx', and 'type' indicating the tag or telling briefly what the data is about.
The 2nd layer is an array with number of elements equal to the chromosome number (see constant $CHRNUM in MyConstants).
The 3rd layer (i.e. each element representing one chromosome) is the reference to a hash containing specific information of that structure. E.g. in the case of transcript annotation, the keys are the starting locations of the transcripts, and the values are the transcript information where multiple transcripts are separated by "\t".

printListByKey

the general print procedure for the arrays of hash+idx structures used

  input: ($ref, $key) --the reference to a structure and the corresponding key for the structure.
  output: print out the content of the structure chromosome by chromosome.

e.g. printListByKey($transRef, 'trans'); to print out all transcripts;

or printListByKey($snpRef, 'powSnps'); to print out all powerful SNVs (SNPs)

getListTag

get list tag (type of resources)

  input: the reference to the structure of events/annotations of a certain type
  output: the tagged type (or simply tag) for this list

getListByKeyChr

get the references to hash+idx of a particular chromsome number from an internal structure

  input: ($ref, $key, $chr) 
  --reference to the structure ($ref), 
  --the corresponding key of the structure ($key), 
  --the chromosome number specified ($chr in the range of 1 - $CHRNUM)
  
  output: ($hsArry[$chr], $idxArry[$chr]) 
  --the references to hash+idx of a particular chromsome number

This utility sub is frequently used in the intermediate processing of all kinds of events, annotations and SNVs.

binarySearch

serve as a general utility to get the location of an element in an ordered list (a good example is the indices _idx)

binary search for insert (or location), including left, right insert (location) cases, similar to bisectin Python. Assume the elements in the list are sorted in an ascending order and there is no duplicate.

  input: ($listRef, $x, $imin, $imax, $type) 
  --the reference to a list, the element to be searched ($x), 
  --starting index ($imin), e.g. 0, ending index ($imax), e.g. size of the list - 1,
  --$type: 'left' (default) or 'right' mimicing left or right bisect. 
  'left': if $x matches an element, the return location will be left to the element; 
  'right': the return location right to the element if matched

  output: ($loc, $flag) 
  --location of $x in the list (NOTE: range is [0, size of list], 0-based, 
  i.e. there can be size of list + 1 possible locations for an element)
  --unmatch flag: 1 means **unmatch**, 0 means **match**

Other utility subs

Kindly go to have a look at the source: fileParser.pl. There will be some comments around. Good luck!

COPYRIGHT

This pipeline is free software; you can redistribute it and/or modify it given that the related works and authors are cited and acknowledged.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

AUTHOR

Cyrus Tak-Ming CHAN

Xiao Lab, Department of Integrative Biology & Physiology, UCLA