HyLiTE manual¶
HyLiTE (Hybrid Lineage Transcriptome Explorer) analyzes high-throughput transcriptome data from allopolyploid species.
Allopolyploidy describes the formation of a new hybrid organism from the union of two or more different parents. Allopolyploid species carry multiple copies of each gene (homeologs), which often exhibit unusual expression patterns. Homeolog expression levels can technically be determined from next generation sequencing data (RNA-seq), but in practice, assigning reads to one homeolog over another is extremely challenging, particularly on a whole-genome scale. HyLiTE automates this process, and allows gene expression patterns to be explored even in very complex allopolyploid species.
This page will guide you through installing HyLiTE and performing several kinds of HyLiTE analysis.
Restoring a crashed HyLiTE run
Using HacknHyLiTE and HyLiTE-merge
HyLiTE - the –alternative_params option
Installation¶
HyLiTE has been designed to be easy to install.
It comes as a python wheel, which is a common way to distribute a python package as a single file comprising its code, resources, and metadata.
First, download the latest wheel distribution from sourceforge.
Once you have the wheel, you can use it directly as a python script to execute HyLiTE:
python3 HyLiTE-2.0.2-py3-none-any.whl --help
If the wheel has downloaded correctly, the command above should display the help for HyLiTE.
You can also use pip:
sudo pip install -Iv http://sourceforge.net/projects/hylite/files/HyLiTE-2.0.2-py3-none-any.whl/download
or easy_install:
sudo easy_install /path/to/wheel/HyLiTE-2.0.2-py3-none-any.whl
In both cases, an executable will be created and you should be able to call HyLiTE directly:
HyLiTE --help
HyLiTE can be installed locally; for instance, using easy_install:
easy_install --prefix=/path/to/my/preferred/directory/ /path/to/egg/HyLiTE-2.0.2-py3-none-any.whl
The executable should be installed in the /path/to/my/preferred/directory/ folder, which you may then want to add to your UNIX PATH variable.
Dependency requirements¶
In order to fully run HyLiTE, you will need:
Warning
The current release of HyLiTE is not designed to run under python 2. Please upgrade to Python 3.
Note
bowtie2 and samtools are not necessary once a .pileup file has been obtained. Please read the following tutorials to learn more.
Examples of HyLiTE analyses¶
1. Basic considerations¶
Consider two haploid organisms: parents P1 and P2. Through a process of allopolyploidization, these two organisms have formed a new diploid child species: Ch.
From an RNA-seq experiment, we have obtained reads in the .fastq format for each organism. Imagine that these reads are contained in the files:
Ch.fastq
P1.fastq
P2.fastq
Also consider a fourth organism Ref, which is related to the three others species (Specifically, close enough that we can align the reads of the three other species on it), and for which we have genomic information (specifically, a set of gene sequences). Imagine that these gene sequences are contained in the .fasta file:
Ref.fasta
Note
The reference file will be indexed using samtools. As a consequence, the user must be able to write to the directory containing the reference file.
The reference sequence does not have to be one of the studied organisms, but it cannot be too divergent. The limiting step is read mapping, which can tolerate around 5% divergence.
2. A first HyLiTE analysis¶
The dataset previously described along with anything necessary to perform this example analysis can be downloaded at this address: tutorial dataset. The archive also includes a directory containing the expected results files in order to compare with yours.
Starting from the setup described in Basic considerations, we first need to create a protocol file. The general structure of the protocol file is described here: Format of the protocol file
In this simple case, the protocol file contains:
simple_protocol_file.txt:
Ch 2 sample1 RNAseq Ch.fastq
P1 1 sample1 RNAseq P1.fastq
P2 1 sample1 RNAseq P2.fastq
This file is already present in the tutorial dataset provided.
Note
The child allopolyploid must always be the first organism listed.
Even in this simple case, a sample name must be provided.
The current version of HyLiTE does not support parents that are more than diploid.
In this simple example, all .fastq files, the reference .fasta file and the protocol file are all located in the same folder. If this is not the case, the absolute paths to the files must be listed in the protocol file.
To try HyLiTE, first move to the folder containing the protocol file(simple_protocol_file.txt), reference file(.fasta file), and reads files (.fastq files). We now only have to type the following command:
HyLiTE -v -f simple_protocol_file.txt -r Ref.fasta -n my_first_HyLiTE
Details of the command line:
-v turns on verbose runtime comments
-f specifies the protocol file
-r specifies the .fasta file containing the reference gene sequences
-n allows the user to provide a name for the HyLiTE analysis
If installed correctly, HyLiTE should launch. The first step has bowtie2 indexing the reference .fasta file.
By default, HyLiTE creates a directory using the analysis name (from the -n flag). All results files will be written to this directory. If you use the downloadable dataset (tutorial dataset), you can compare your results with the ones included in the archive.
When HyLiTE completes (which, depending on file sizes, may take quite some time), the analysis directory (here, my_first_HyLiTE) should contain:
.bt2 index files of the reference .fasta file
.sam, .bam, .sorted.bam, .sorted.bai files for each organism
my_first_HyLiTE.pileup, the pileup file written by samtools and used by the main HyLiTE algorithm
my_first_HyLiTE.expression.txt, a text file containing gene and homeolog expression information
my_first_HyLiTE.snp.txt, a text file containing all information about SNPs identified
my_first_HyLiTE.snp.summary.txt, a text file showing SNP information by gene
my_first_HyLiTE.Ch.sample1.read.txt, a text file containing information about reads in the hybrid and how they are assigned to parental genomes
my_first_HyLiTE.Ch.sample1.read.summary.txt, a text file summarizing the ancestral origin of reads by gene
my_first_HyLiTE.run.summary.txt, a text file summarizing the total child reads mapped to the reference and how they were assigned (e.g. to P1), the total number of SNPs identified and the total number of child unique SNPs
More information about the output formats of these files can be found in HyLiTE output formats. The examples of results files shown in HyLiTE output formats actually come from this example.
3. Using the different entry points¶
In the previous example (A first HyLiTE analysis), HyLiTE coordinates all the calls to external software: bowtie2 and samtools.
Note
If the -v option is activated, HyLiTE will print the commands it uses for these calls to screen. This can be useful for understanding how HyLiTE works. However, users can generate these files themselves, and in this case, HyLiTE will skip these external calls.
In the following section, we will explore three ways to circumvent parts of the HyLiTE pipeline and illustrate how to generate the correct data files to run these truncated analyses.
3.1 If the bowtie2 index already exists: option -b¶
It is easy to generate a bowtie2 index (.bt2) of the reference gene sequences using the command:
bowtie2-build Ref.fasta Ref
(for further details, please see the bowtie2 manual)
During the first example (A first HyLiTE analysis), a bowtie2 has been generated and can be found in the folder “my_first_HyLiTE/”.
The -b flag tells HyLiTE to skip the building of the index. Relative to our simple example (A first HyLiTE analysis), the command line would be:
HyLiTE -v -f simple_protocol_file.txt -r Ref.fasta -b my_first_HyLiTE/Ref -n my_second_HyLiTE
Note
Only the addition of the -b flag followed by the index name (minus the file suffixes) has changed. The reference file must still be specified as it is used later to generate the pileup file.
3.2 If the alignment files already exist: option -S¶
We can also ask HyLiTE to skip the alignment step, provided that we supply HyLiTE with the corresponding files in the .sam format.
For instance, we can obtain alignment files by using bowtie2, as shown in the following command:
bowtie2 -x Ref_index -U P1.fastq -S P1.sam -N 1
Alternatively, running bowtie2 ourselves lets us specify a set of options that are better suited to our data set. It also lets us use a different alignment program (such as bwa) with the only constraint being that the resulting alignment files must be converted to the standard .sam format.
Once all our alignment files have been generated, we have to modify the protocol file to specify the path to the .sam files rather than the .fastq files. During the first example (A first HyLiTE analysis), .sam alignment files have been generated and can be found in the folder “my_first_HyliTE/”. So the new protocol file should be:
sam_protocol_file.txt:
Ch 2 sample1 RNAseq my_first_HyLiTE/Ch.sam
P1 1 sample1 RNAseq my_first_HyLiTE/P1.sam
P2 1 sample1 RNAseq my_first_HyLiTE/P2.sam
Note
Columns are separated by a single tab, not by spaces.
We can now launch the HyLiTE analysis, specifying the -S flag to indicate that the protocol file contains .sam files:
HyLiTE -v -S -f sam_protocol_file.txt -r Ref.fasta -n my_third_HyLiTE
Note
The main reason for using this option is that you can specify your own mapping options. Mainly:
The parallel processing option of bowtie2 to greatly reduce the alignment time
Options to adjust the sensitivity of the alignment
3.3 If the pileup file already exists: option -p¶
After the alignment step, HyLiTE would typically perform a series of operations on the .sam files to prepare them for generating the pileup file. All of these steps are just very simple samtools commands.
First, we have to index the reference .fasta file:
samtools faidx Ref.fasta
Note
This step only needs to be performed once, and if desired, the index file can be saved for later analyses.
For each .sam file, the following commands must also be run:
samtools view -Sb P1.sam > P1.bam
samtools sort P1.bam P1.sorted
samtools index P1.sorted.bam
Once this has been completed for each .sam file, we need to create a new text file listing the names of the .sorted.bam files (one file per line). For our example, we would call this file my_first_HyLiTE.txt:
Ch.sorted.bam
P1.sorted.bam
P2.sorted.bam
Note
It is very important that the order of names in this file exactly follows the order in the protocol file. Remember: the allopolyploid child must always be listed first.
Once these steps have finished, we can finally generate the .pileup file:
samtools mpileup -BQ 0 -d 1000000 -f Ref.fasta -b my_first_HyLiTE.txt > my_first_HyLiTE.pileup
During the first example (A first HyLiTE analysis), a .pileup file has been generated and can be found in the folder “my_first_HyLiTE/”.
HyLiTE can now be launched using the following command:
HyLiTE -v -f simple_protocol_file.txt -p my_first_HyLiTE/my_first_HyLiTE.pileup -n my_fourth_HyLiTE -r Ref.fasta
Note
This command line launches HyLiTE alone and does not require any external software. For many users, this may be the preferred starting point for HyLiTE analyses.
In later examples, we will often assume that the .pileup file has already been created outside HyLiTE and hence use the -p flag.
4. Using data from paired-end RNA sequencing¶
In most cases, paired-end reads are not particularly useful for gene expression studies. (Once the first read maps to a gene, the second read provides no additional information on expression). Due to its added complexity, we therefore discourage the use of paired-end data with HyLiTE.
However, if paired-end reads are available for one or more of our organisms, the files can be included under the same sample name listing in the protocol file.
In the following example, the child organism has paired-end reads, which are found, respectively, in files Ch_1.fastq and Ch_2.fastq:
simple_protocol_with_paired_end.txt:
Ch 2 sample1 RNAseq Ch_1.fastq
Ch 2 sample1 RNAseq Ch_2.fastq
P1 1 sample1 RNAseq P1.fastq
P2 1 sample1 RNAseq P2.fastq
Note
The paired-end files must immediately follow each other in the protocol file.
If we use the -S flag, there must only be one .sam file per sample (i.e., the .sam file must contain paired-end mappings).
Any HyLiTE analysis can then be launched normally.
5. Using several biological replicates¶
Most RNA-seq studies will contain biological replicates. This information can be combined to improve SNP calling (the basis of read assignment), but must be kept separate for calculations of gene expression.
Consider an example where two biological replicates have been generated for each organism. The .fastq files are labelled with the corresponding replicate (sample) names.
biological_replicate_protocol.txt:
Ch 2 sample1 RNAseq Ch.sample1.fastq
Ch 2 sample2 RNAseq Ch.sample2.fastq
P1 1 sample1 RNAseq P1.sample1.fastq
P1 1 sample2 RNAseq P1.sample2.fastq
P2 1 sample1 RNAseq P2.sample1.fastq
P1 1 sample2 RNAseq P2.sample2.fastq
Note
Files must still be grouped by organism and sample.
Paired-end read files (if used) must still be grouped together.
Any HyLiTE analysis can then be launched normally.
Note
When several biological replicates are specified, each will be given its own read and read summary result file.
6. Including genomic information¶
Genomic information can be very useful for calling SNPs. This is particularly true for genes with relatively low expression, particularly in the parent species.
To call SNPs, HyLiTE requires a minimum level of coverage (2. Detecting SNPs). Furthermore, if any one of the organisms has poor coverage at a given position, any SNP at that position must necessarily be masked out for all organisms. These SNPs therefore cannot be used to assign reads to homeologs.
However, this problem can largely be circumvented by including reads from genomic DNA.
genomic_DNA_protocol.txt:
Ch 2 sample1 RNAseq Ch.sample1.fastq
Ch 2 sample2 RNAseq Ch.sample2.fastq
Ch 2 sample3 gDNA Ch.sample3.fastq
P1 1 sample1 RNAseq P1.sample1.fastq
P1 1 sample2 RNAseq P1.sample2.fastq
P1 1 sample3 gDNA P1.sample3.fastq
P2 1 sample1 RNAseq P2.sample1.fastq
P1 1 sample2 RNAseq P2.sample2.fastq
P1 1 sample3 gDNA P2.sample3.fastq
Note
Where genomic data is included, these files must be labeled as ‘gDNA’ instead of ‘RNAseq’ in the protocol file.
Obviously, no expression files are produced for gDNA samples.
7. Comments on input data types¶
HyLiTE is designed to work with RNAseq data generated by Illumina sequencing technology. By default, HyLiTE runs your RNAseq data with Bowtie2 and phred+33 quality encoding. If your data is encoded in phred+64, you should use the -–phred64 argument with HyLiTE. HyLiTE does not support Illumina quality scores from versions less than 1.3 as initial input data. You should run Bowtie2 to generate *.sam files to be run with HyLiTE with the appropriate option in Bowtie2 (–solexa-quals). If you do not know what version your quality scores are, SolexaQA++ will tell you what version of quality scores your data is and clean it too (http://solexaqa.sourceforge.net/).
If you are not using Illumina generated data, Bowtie2 may not be the most appropriate mapping software for you. Consider starting from *.sam files created by the mapper of your choice for use with HyLiTE. Also, HyLiTE’s SNP detection is based around an expected error rate from Illumina data and can be changed within the parameter file (–alternative_params_option).
Restoring a crashed HyLiTE run¶
If HyLiTE fails unexpectedly for any reason, it is possible to restart the run using the pickle (‘checkpointing’) file. By default, this file is produced in intervals of 50 genes. To restart HyLiTE, simply use the option –restore as follow:
HyLiTE --restore my_first_HyLiTE.pickle
Note
It is possible to change the rate of the pickling operation using the option –pickling <int>, where <int> is the number of genes between each pickling operation. Pickling too frequently may negatively affect runtime.
If HyLiTE crashes before any pickle file is produced, the different entry points (options -b, -S, -p) can be used to avoid wasting runtime.
Note that even if HyLiTE crashes, result files contain complete information about any genes processed before the crash occurred.
Using HacknHyLiTE and HyLiTE-merge¶
Warning
HacknHyLiTE and HyLiTE-merge are experimental features. Please use these options with caution.
The operations that HyLiTE undertakes are not fast. However, HyLiTE has been heavily optimized such that its time complexity increases only arithmetically with the size of the pileup file. Because all genes are treated independently, a simple way to reduce runtime is to divide the pileup file into n subsets. This lets each subset run as an independent HyLiTE analysis on different cores within a supercomputing cluster, thus effectively reducing the execution time by n.
To implement this feature, we have developed three methods to:
Cleanly slice a pileup file into n subsets and generate PBS -friendly bash scripts for each HyLiTE analysis (HacknHyLiTE)
Browse the directory of each sub-analysis created by HacknHyLiTE, merge the results file and generate single summary files (HyLiTE-merge)
Another executable HyLiTE-split is also available. It just cleanly slices a pileup file into subsets, without creating any bash scripts.
Considering our first example (A first HyLiTE analysis), HacknHyLiTE and HyLiTE-merge would be used as follow:
HacknHyLiTE -n 2 -o test_slicing --name slicing -p test_import_2/test_import_2.pileup -f simple_protocol_file.txt
bash test_slicing/slicing.subset0.sh
bash test_slicing/slicing.subset1.sh
cd test_slicing
HyLiTE-merge
Note
HyLiTE-merge is only intended to be used on analyses generated by HacknHyLiTE. Proceed with caution if you use this function in other ways.
HacknHyLiTE, HyLiTE-merge and HyLiTE-split are automatically installed along with HyLiTE if you use pip or easy_install.
HyLiTE - the –alternative_params option¶
This option is unconventional and we discourage its use. However, we deem it potentially useful for future developers. This command can also be used by non-root users to define a personalised parameters file (details about these parameters can be found in the Parameters file page) or to set up an alternative set of parameters.
The command allows the user to specify a file that will be interpreted by python just after the parameter file has been read and before the HyLiTE analysis has begun. It allows the user to, for example, redefine a variable or a function.