Files

This page describes the input and output files and their formats.

Input Files

confg_file and parameter_file contain the input and parameter settings for ASARP.

Configuration File

config_file is the input configuration file which contains all the input file keys and their paths. The format is <key>tab<path>. Line starting with # are comments. Example: ../default.config. The keys and paths are explained below ([opt] indicates an optional input file):

 bedfolder	the path to the bedgraph files chr*.bedgraph
 bedext		the extension after chr*. for the bedgraph files, 
 		default is bedgraph [opt]
 xiaofile	the gene annotation file, we provide a pre-processed file 
 		for hg19 merging ensembl Refseq, UCSC knowngene, etc.
 splicingfile	the annotated splicing events derived from xiaofile
 		we provide a pre-processed file for hg19
 estfile	splicing events from ESTs [opt]
 		we provide a pre-processed file for hg19

 strandflag	whether the RNA-Seq is strand-specific [opt]
 		0 or no input: non-strand-specific (default)
		1: strand-specific

 rnaseqfile	user specified splicing events [opt], 
 		which has the same format as splicingfile and
		can be derived from their RNA-Seq data and annotations

Parameter File

parameter_file is the parameter configuration file which contains all the thresholds and cutoffs, e.g. p-value cuttoffs and bounds for absolute allelic ratio difference. The format of each line is <parameter>tab<value>. Lines starting with # are comments. The default is: ../default.param

 powerful_snv	powerful SNV count cutoff
 fdr		FDR cutoff for ASE genes [opt]
 		If input, the ASE p-value threshold
 		p_chi_snv will be ignored
 p_chi_snv	p-value cutoff for Chi-squared test [opt]
 		to shortlist an ASE SNV
 		Either fdr or p_chi_snv needs to be input
 p_fisher_pair	p-value cutoff for Fisher exact test.

 # NEV upper cutoff for alternatively spliced regions
 nev_lower	NEV lower bound (>=0)
 nev_upper	NEV upper bound (<1)

 ratio_diff	absolute allelit ratio difference cutoff 
 		for a target-control SNV pair

For preparation of the input files used in config_file, see the pre-processing section: rmDup, mergeSam, procReads

SAM files

Both SAM and JSAM files are accepted. Besides the standard SAM attributes, JSAM files have different extra attributes from common aligners to store SNV and editing information. For more details of JSAM files, see procReadsJ.

For SAM/JSAM format, only SAM files for uniquely mapped reads (read pairs) should be used. Unmapped reads (0x4) should be removed pior to the preprocessing steps: rmDup, mergeSam, procReads. SAM file header lines are ignored. In paired-end cases, mapped read pair1 should be followed by mapped read pair2 immediately and their IDs should be identical or differ at most by /1 and /2. For reads with insertion/deletion, they will be discarded according to the CIGAR string parsing.

In a strand specific setting, 0x10 is used to identify the reads mapped in an anti-sense manner. However, if the strand flag is set to be 2 (i.e. pair 2 is sense and pair 1 is anti-sens), then 0x10 will be considered as sense. All other reads with 0x10 unset will be considered as the other strand accordingly. Note that for paired-end strand specific cases, typically the /1 and /2 pair will have complementary strand flags (e.g. 0 for /1 and 16 for /2). Therefore, it is important to make sure the SAM/JSAM files contain only uniquely mapped reads (pairs).

SNV list

SNV list can be either the input to the pre-processing program procReads, or the output of procReads containing allele read counts extracted from sam files. Each line is space separated, with the following attributes

 chromosome
 location 
 ref_allele>alt_allele 
 dbSnp_id (na if not available) 
 ref:alt:wrnt (read counts of reference, alternative, and other mismatch alleles)
 [strand] (optional: for strand-specific data)

When used as input to procReads, only the first 4 attributes are needed and the extra attributes are read through and ignored. Note that strand is not used even for strand-specific data because only one SNV is assumed at one locus. Example:

 chr10 1046712 G>A rs2306409

When used as output of procReads (also input of ASARP prediction), the read counts are required. In strand-specific cases, the strand is also required, and note that one SNV locus may have reads in both + and - strands and are treated as two different cases.

read_counts are RNA read counts obtained from the SAM (a.k.a the bedgraph) file. ref indicates the read count of the reference allele, alt the alternative allele, wrnt (wrong nt) indicates other alleles neither ref nor alt. It is required that alt > wrnt, otherwise that SNV is discarded (dicarded on a particular strand if strand-specific option is on). Output SNV examples in a strand-specific setting would look like:

 chr10 1046712 G>A rs2306409 30:23:0 +

The compatible format of SNV lists as input or output enables chaining multiple components in the ASARP pipeline to analyze specific types of SNVs. For example, powerful SNVs output by aseSnvs can be used as input for SNV distribution analysis by snp_dist. SNVs selected from certain cell-line or tissue can be input to procReads with different bedgraph track files to obtain their allele read counts in another cell-line or tissue, which can be further used for comparable ASARP analysis.

Gene (transcript) annotation file

This is the xiaofile in the config_file, representing all gene transcript annotation. It is almost the same as the UCSC RefSeq file format, except that the #bin field and header line are removed, and the last fields score, name2 and exonFrames are removed.

Format (tab delimited):

 ID, chr, strand, txStart, txEnd, cdsstart, cdsend, exoncount, exonstarts, exonends, genename, cdsstartstat,cdsendstat

IMPORTANT: all coordinates are hg19, 0-based start and 1-based end coordinates (UCSC tradition) in this file only.

Examples look like this:

 ENST00000237247 chr1    +       66999065        67210057        67000041        67208778        27      66999065,66999928,67091529,67098752,67099762,67105459,67108492,67109226,67126195,67133212,67136677,67137626,67138963,67142686,67145360,67147551,67149789,67154830,67155872,67161116,67184976,67194946,67199430,67205017,67206340,67206954,67208755,        66999090,67000051,67091593,67098777,67099846,67105516,67108547,67109402,67126207,67133224,67136702,67137678,67139049,67142779,67145435,67148052,67149870,67154958,67155999,67161176,67185088,67195102,67199563,67205220,67206405,67207119,67210057,   SGIP1   cmpl    cmpl
 ENST00000371039 chr1    +       66999274        67210768        67000041        67208778        22      66999274,66999928,67091529,67098752,67105459,67108492,67109226,67136677,67137626,67138963,67142686,67145360,67154830,67155872,67160121,67184976,67194946,67199430,67205017,67206340,67206954,67208755,     66999355,67000051,67091593,67098777,67105516,67108547,67109402,67136702,67137678,67139049,67142779,67145435,67154958,67155999,67160187,67185088,67195102,67199563,67205220,67206405,67207119,67210768,     SGIP1   cmpl    cmpl

In our evaluation experiments consistent with the previous work, we use data/hg19.merged.to.ensg.all.tx.03.18.2011.txt. It was created by merging ensembl Refseq, UCSC knowngene, Gencode gene, and Vegagene.

Pre-processed events

Events files represent all potential alternatively processed regions extracted from annotations or RNA-Seq data. For human (hg19), we have generated pre-processed event files bundled with the pipeline, namely splicingfile and estfile, from the gene annotations (xiaofile) and older expressed sequence tag (EST) analysis respectively. splicingfile is required as it is the most accurate, and estfile is optional to provide more sensitivity. User can also generate their own rnaseqfile using their RNA-Seq data (or even other resources) and existing annotations to increase sensitivity, as long as the format is compatible with splicingfile.

Their formats are illustrated as follows.

splicingfile and rnaseqfile contain splicing events (alternatively processed regions) as determined respectively by annotations and user specified resources such as RNA-seq data. They have the same format, while the former is required and the latter optional. Users can generate their own events to replace the pre-processed ones at their preference. For each gene, there is a header line where the gene symbol name and the constitutive exon coordinates are first listed. The format is >gene_symbol<tab>const_start1-const_end1;const_start2-const_end2;...const_startn-const_endn

The header line is followed by, if any, event lines starting with the keyword 'EVENT'. The format of the events is EVENT, chromosome, genename, strand (1 for + and -1 for -), event_region, flanking_region_1, flanking_region_2, where *_region are in the format of starting_coordinate-ending_coordinate (1-based start and end). For example (no events for CARTPT):

 >ADAR	154574680-154574724;154574861-154575102
 EVENT	chr1	ADAR	-1	154562660-154562737	154562233-154562404	154562738-154562885
 EVENT	chr1	ADAR	-1	154569415-154569471	154569281-154569414	154569599-154569743
 EVENT	chr1	ADAR	-1	154574725-154574860	154574680-154574724	154574861-154575102
 ...
 >CARTPT	71015707-71015790
 >CAST	96083049-96083096
 EVENT	chr5	CAST	1	95998056-95998201	95865525-95865584	96011243-96011305
 EVENT	chr5	CAST	1	95998056-95998201	95997778-95997869	96011243-96011305
 EVENT	chr5	CAST	1	96058343-96058402	96038561-96038619	96073553-96073651

estfile contains splicing events as determined from hg19 EST and cDNA data. The format is tab-delimited as: event_type (ASS/SE/RI, etc.), event_name (chr:pos:strand), starting_coordinate, ending_coordinate (1-based start and end). In the EST events, only the event regions are kept track of, and no flanking regions are recorded, so constitutive exons in splicingfile will be used in NEV calculation. Examples:

 ASS	chr20:61924538:+	61943772	61943775
 ASS	chr20:61924538:+	61946753	61946755
 SE	chr20:61924538:+	61956621	61956716
 SE	chr20:61924538:+	61953410	61953463
 RI	chr22:32058418:-	32017128	32017320
 RI	chr22:31795509:+	32014212	32014300

COPYRIGHT

This pipeline is free software; you can redistribute it and/or modify it given that the related works and authors are cited and acknowledged.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

AUTHOR

Cyrus Tak-Ming CHAN

Xiao Lab, Department of Integrative Biology & Physiology, UCLA