This page describes the input and output files and their formats.
confg_file
and parameter_file
contain the input and parameter settings for ASARP.
config_file
is the input configuration file which contains all the input file keys and their paths. The format is <key>tab<path>. Line starting with # are comments. Example: ../default.config. The keys and paths are explained below ([opt] indicates an optional input file):
bedfolder the path to the bedgraph files chr*.bedgraph bedext the extension after chr*. for the bedgraph files, default is bedgraph [opt] xiaofile the gene annotation file, we provide a pre-processed file for hg19 merging ensembl Refseq, UCSC knowngene, etc. splicingfile the annotated splicing events derived from xiaofile we provide a pre-processed file for hg19 estfile splicing events from ESTs [opt] we provide a pre-processed file for hg19 strandflag whether the RNA-Seq is strand-specific [opt] 0 or no input: non-strand-specific (default) 1: strand-specific rnaseqfile user specified splicing events [opt], which has the same format as splicingfile and can be derived from their RNA-Seq data and annotations
parameter_file
is the parameter configuration file which contains all the thresholds and cutoffs, e.g. p-value cuttoffs and bounds for absolute allelic ratio difference. The format of each line is <parameter>tab<value>. Lines starting with # are comments. The default is: ../default.param
powerful_snv powerful SNV count cutoff fdr FDR cutoff for ASE genes [opt] If input, the ASE p-value threshold p_chi_snv will be ignored p_chi_snv p-value cutoff for Chi-squared test [opt] to shortlist an ASE SNV Either fdr or p_chi_snv needs to be input p_fisher_pair p-value cutoff for Fisher exact test. # NEV upper cutoff for alternatively spliced regions nev_lower NEV lower bound (>=0) nev_upper NEV upper bound (<1) ratio_diff absolute allelit ratio difference cutoff for a target-control SNV pair
For preparation of the input files used in config_file
, see the pre-processing section: rmDup, mergeSam, procReads
Both SAM and JSAM files are accepted. Besides the standard SAM attributes, JSAM files have different extra attributes from common aligners to store SNV and editing information. For more details of JSAM files, see procReadsJ.
For SAM/JSAM format, only SAM files for uniquely mapped reads (read pairs) should be used. Unmapped reads (0x4) should be removed pior to the preprocessing steps: rmDup, mergeSam, procReads. SAM file header lines are ignored. In paired-end cases, mapped read pair1 should be followed by mapped read pair2 immediately and their IDs should be identical or differ at most by /1 and /2. For reads with insertion/deletion, they will be discarded according to the CIGAR string parsing.
In a strand specific setting, 0x10 is used to identify the reads mapped in an anti-sense manner. However, if the strand flag is set to be 2 (i.e. pair 2 is sense and pair 1 is anti-sens), then 0x10 will be considered as sense. All other reads with 0x10 unset will be considered as the other strand accordingly. Note that for paired-end strand specific cases, typically the /1 and /2 pair will have complementary strand flags (e.g. 0 for /1 and 16 for /2). Therefore, it is important to make sure the SAM/JSAM files contain only uniquely mapped reads (pairs).
SNV list can be either the input to the pre-processing program procReads, or the output of procReads containing allele read counts extracted from sam files. Each line is space separated, with the following attributes
chromosome location ref_allele>alt_allele dbSnp_id (na if not available) ref:alt:wrnt (read counts of reference, alternative, and other mismatch alleles) [strand] (optional: for strand-specific data)
When used as input to procReads, only the first 4 attributes are needed and the extra attributes are read through and ignored. Note that strand is not used even for strand-specific data because only one SNV is assumed at one locus. Example:
chr10 1046712 G>A rs2306409
When used as output of procReads (also input of ASARP prediction), the read counts are required. In strand-specific cases, the strand is also required, and note that one SNV locus may have reads in both + and - strands and are treated as two different cases.
read_counts
are RNA read counts obtained from the SAM (a.k.a the bedgraph) file. ref
indicates the read count of the reference allele, alt
the alternative allele, wrnt
(wrong nt) indicates other alleles neither ref nor alt. It is required that alt
> wrnt
, otherwise that SNV is discarded (dicarded on a particular strand if strand-specific option is on). Output SNV examples in a strand-specific setting would look like:
chr10 1046712 G>A rs2306409 30:23:0 +
The compatible format of SNV lists as input or output enables chaining multiple components in the ASARP pipeline to analyze specific types of SNVs. For example, powerful SNVs output by aseSnvs can be used as input for SNV distribution analysis by snp_dist. SNVs selected from certain cell-line or tissue can be input to procReads with different bedgraph track files to obtain their allele read counts in another cell-line or tissue, which can be further used for comparable ASARP analysis.
This is the xiaofile
in the config_file, representing all gene transcript annotation. It is almost the same as the UCSC RefSeq file format, except that the #bin field and header line are removed, and the last fields score, name2 and exonFrames are removed.
Format (tab delimited):
ID, chr, strand, txStart, txEnd, cdsstart, cdsend, exoncount, exonstarts, exonends, genename, cdsstartstat,cdsendstat
IMPORTANT: all coordinates are hg19, 0-based start and 1-based end coordinates (UCSC tradition) in this file only.
Examples look like this:
ENST00000237247 chr1 + 66999065 67210057 67000041 67208778 27 66999065,66999928,67091529,67098752,67099762,67105459,67108492,67109226,67126195,67133212,67136677,67137626,67138963,67142686,67145360,67147551,67149789,67154830,67155872,67161116,67184976,67194946,67199430,67205017,67206340,67206954,67208755, 66999090,67000051,67091593,67098777,67099846,67105516,67108547,67109402,67126207,67133224,67136702,67137678,67139049,67142779,67145435,67148052,67149870,67154958,67155999,67161176,67185088,67195102,67199563,67205220,67206405,67207119,67210057, SGIP1 cmpl cmpl ENST00000371039 chr1 + 66999274 67210768 67000041 67208778 22 66999274,66999928,67091529,67098752,67105459,67108492,67109226,67136677,67137626,67138963,67142686,67145360,67154830,67155872,67160121,67184976,67194946,67199430,67205017,67206340,67206954,67208755, 66999355,67000051,67091593,67098777,67105516,67108547,67109402,67136702,67137678,67139049,67142779,67145435,67154958,67155999,67160187,67185088,67195102,67199563,67205220,67206405,67207119,67210768, SGIP1 cmpl cmpl
In our evaluation experiments consistent with the previous work, we use data/hg19.merged.to.ensg.all.tx.03.18.2011.txt
. It was created by merging ensembl Refseq, UCSC knowngene, Gencode gene, and Vegagene.
Events files represent all potential alternatively processed regions extracted from annotations or RNA-Seq data. For human (hg19), we have generated pre-processed event files bundled with the pipeline, namely splicingfile
and estfile
, from the gene annotations (xiaofile
) and older expressed sequence tag (EST) analysis respectively. splicingfile
is required as it is the most accurate, and estfile
is optional to provide more sensitivity. User can also generate their own rnaseqfile
using their RNA-Seq data (or even other resources) and existing annotations to increase sensitivity, as long as the format is compatible with splicingfile
.
Their formats are illustrated as follows.
splicingfile
and rnaseqfile
contain splicing events (alternatively processed regions) as determined respectively by annotations and user specified resources such as RNA-seq data. They have the same format, while the former is required and the latter optional. Users can generate their own events to replace the pre-processed ones at their preference.
For each gene, there is a header line where the gene symbol name and the constitutive exon coordinates are first listed. The format is >gene_symbol<tab>const_start1-const_end1;const_start2-const_end2;...const_startn-const_endn
The header line is followed by, if any, event lines starting with the keyword 'EVENT'. The format of the events is EVENT, chromosome, genename, strand (1 for + and -1 for -), event_region, flanking_region_1, flanking_region_2, where *_region are in the format of starting_coordinate-ending_coordinate (1-based start and end). For example (no events for CARTPT):
>ADAR 154574680-154574724;154574861-154575102 EVENT chr1 ADAR -1 154562660-154562737 154562233-154562404 154562738-154562885 EVENT chr1 ADAR -1 154569415-154569471 154569281-154569414 154569599-154569743 EVENT chr1 ADAR -1 154574725-154574860 154574680-154574724 154574861-154575102 ... >CARTPT 71015707-71015790 >CAST 96083049-96083096 EVENT chr5 CAST 1 95998056-95998201 95865525-95865584 96011243-96011305 EVENT chr5 CAST 1 95998056-95998201 95997778-95997869 96011243-96011305 EVENT chr5 CAST 1 96058343-96058402 96038561-96038619 96073553-96073651
estfile
contains splicing events as determined from hg19 EST and cDNA data. The format is tab-delimited as: event_type (ASS/SE/RI, etc.), event_name (chr:pos:strand), starting_coordinate, ending_coordinate (1-based start and end). In the EST events, only the event regions are kept track of, and no flanking regions are recorded, so constitutive exons in splicingfile
will be used in NEV calculation. Examples:
ASS chr20:61924538:+ 61943772 61943775 ASS chr20:61924538:+ 61946753 61946755 SE chr20:61924538:+ 61956621 61956716 SE chr20:61924538:+ 61953410 61953463 RI chr22:32058418:- 32017128 32017320 RI chr22:31795509:+ 32014212 32014300
Overview, procReads, aseSnvs, snp_dist, asarp
This pipeline is free software; you can redistribute it and/or modify it given that the related works and authors are cited and acknowledged.
This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
Cyrus Tak-Ming CHAN
Xiao Lab, Department of Integrative Biology & Physiology, UCLA