HyLiTE manual

HyLiTE (Hybrid Lineage Transcriptome Explorer) analyzes high-throughput transcriptome data from allopolyploid species.

Allopolyploidy describes the formation of a new hybrid organism from the union of two or more different parents. Allopolyploid species carry multiple copies of each gene (homeologs), which often exhibit unusual expression patterns. Homeolog expression levels can technically be determined from next generation sequencing data (RNA-seq), but in practice, assigning reads to one homeolog over another is extremely challenging, particularly on a whole-genome scale. HyLiTE automates this process, and allows gene expression patterns to be explored even in very complex allopolyploid species.

This page will guide you through installing HyLiTE and performing several kinds of HyLiTE analysis.

Installation

Dependency requirements

Examples of HyLiTE analyses

  1. Basic considerations

  2. A first HyLiTE analysis

  3. Using the different entry points

  4. Using data from paired-end RNA sequencing

  5. Using several biological replicates

  6. Including genomic information

  7. Comments on input data types

Restoring a crashed HyLiTE run

Using HacknHyLiTE and HyLiTE-merge

HyLiTE - the --alternative_params option

Installation

HyLiTE has been designed to be easy to install.

It comes as a python egg, which is a common way to distribute a python package as a single file comprising its code, ressources, and metadata.

First, download the latest egg distribution from sourceforge.

Once you have the egg, you can use it directly as a python script to execute HyLiTE:

python /path/to/egg/HyLiTE-1.7.1-py2.7.egg --help

If the egg has downloaded correctly, the command above should display the help for HyLiTE.

You can also use pip:

sudo pip install -Iv http://sourceforge.net/projects/hylite/files/HyLiTE-1.7.1-py2.7.egg/download

or easy_install:

sudo easy_install /path/to/egg/HyLiTE-1.7.1-py2.7.egg

In both cases, an executable will be created and you should be able to call HyLiTE directly:

HyLiTE --help

HyLiTE can be installed locally; for instance, using easy_install:

easy_install --prefix=/path/to/my/preferred/directory/  /path/to/egg/HyLiTE-1.7.1-py2.7.egg

The executable should be installed in the /path/to/my/preferred/directory/ folder, which you may then want to add to your UNIX PATH variable.

Dependency requirements

In order to fully run HyLiTE, you will need:

Warning

HyLiTE is not designed to run under python 3

Note

bowtie2 and samtools are not necessary once a .pileup file has been obtained. Please read the following tutorials to learn more.

Examples of HyLiTE analyses

1. Basic considerations

Consider two haploid organisms: parents P1 and P2. Through a process of allopolyploidization, these two organisms have formed a new diploid child species: Ch.

From an RNA-seq experiment, we have obtained reads in the .fastq format for each organism. Imagine that these reads are contained in the files:

  • Ch.fastq
  • P1.fastq
  • P2.fastq

Also consider a fourth organism Ref, which is related to the three others species (Specifically, close enough that we can align the reads of the three other species on it), and for which we have genomic information (specifically, a set of gene sequences). Imagine that these gene sequences are contained in the .fasta file:

  • Ref.fasta

Note

  • The reference file will be indexed using samtools. As a consequence, the user must be able to write to the directory containing the reference file.
  • The reference sequence does not have to be one of the studied organisms, but it cannot be too divergent. The limiting step is read mapping, which can tolerate around 5% divergence.

2. A first HyLiTE analysis

The dataset previously described along with anything necessary to perform this example analysis can be downloaded at this address: tutorial dataset. The archive also includes a directory containing the expected results files in order to compare with yours.

Starting from the setup described in Basic considerations, we first need to create a protocol file. The general structure of the protocol file is described here: Format of the protocol file

In this simple case, the protocol file contains:

simple_protocol_file.txt:

Ch    2    sample1    RNAseq    Ch.fastq
P1    1    sample1    RNAseq    P1.fastq
P2    1    sample1    RNAseq    P2.fastq

This file is already present in the tutorial dataset provided.

Note

  • The child allopolyploid must always be the first organism listed.
  • Even in this simple case, a sample name must be provided.
  • The current version of HyLiTE does not support parents that are more than diploid.
  • In this simple example, all .fastq files, the reference .fasta file and the protocol file are all located in the same folder. If this is not the case, the absolute paths to the files must be listed in the protocol file.

To try HyLiTE, first move to the folder containing the protocol file(simple_protocol_file.txt), reference file(.fasta file), and reads files (.fastq files). We now only have to type the following command:

HyLiTE -v -f simple_protocol_file.txt -r Ref.fasta -n my_first_HyLiTE

Details of the command line:

  • -v turns on verbose runtime comments
  • -f specifies the protocol file
  • -r specifies the .fasta file containing the reference gene sequences
  • -n allows the user to provide a name for the HyLiTE analysis

If installed correctly, HyLiTE should launch. The first step has bowtie2 indexing the reference .fasta file.

By default, HyLiTE creates a directory using the analysis name (from the -n flag). All results files will be written to this directory. If you use the downloadable dataset (tutorial dataset), you can compare your results with the ones included in the archive.

When HyLiTE completes (which, depending on file sizes, may take quite some time), the analysis directory (here, my_first_HyLiTE) should contain:

  • .bt2 index files of the reference .fasta file
  • .sam, .bam, .sorted.bam, .sorted.bai files for each organism
  • my_first_HyLiTE.pileup, the pileup file written by samtools and used by the main HyLiTE algorithm
  • my_first_HyLiTE.expression.txt, a text file containing gene and homeolog expression information
  • my_first_HyLiTE.snp.txt, a text file containing all information about SNPs identified
  • my_first_HyLiTE.snp.summary.txt, a text file showing SNP information by gene
  • my_first_HyLiTE.Ch.sample1.read.txt, a text file containing information about reads in the hybrid and how they are assigned to parental genomes
  • my_first_HyLiTE.Ch.sample1.read.summary.txt, a text file summarizing the ancestral origin of reads by gene
  • my_first_HyLiTE.run.summary.txt, a text file summarizing the total child reads mapped to the reference and how they were assigned (e.g. to P1), the total number of SNPs identified and the total number of child unique SNPs

More information about the output formats of these files can be found in HyLiTE output formats. The examples of results files shown in HyLiTE output formats actually come from this example.

3. Using the different entry points

In the previous example (A first HyLiTE analysis), HyLiTE coordinates all the calls to external software: bowtie2 and samtools.

Note

If the -v option is activated, HyLiTE will print the commands it uses for these calls to screen. This can be useful for understanding how HyLiTE works. However, users can generate these files themselves, and in this case, HyLiTE will skip these external calls.

In the following section, we will explore three ways to circumvent parts of the HyLiTE pipeline and illustrate how to generate the correct data files to run these truncated analyses.

3.1 If the bowtie2 index already exists: option -b

It is easy to generate a bowtie2 index (.bt2) of the reference gene sequences using the command:

bowtie2-build Ref.fasta Ref

(for further details, please see the bowtie2 manual)

During the first example (A first HyLiTE analysis), a bowtie2 has been generated and can be found in the folder "my_first_HyLiTE/".

The -b flag tells HyLiTE to skip the building of the index. Relative to our simple example (A first HyLiTE analysis), the command line would be:

HyLiTE -v -f simple_protocol_file.txt -r Ref.fasta -b my_first_HyLiTE/Ref -n my_second_HyLiTE

Note

Only the addition of the -b flag followed by the index name (minus the file suffixes) has changed. The reference file must still be specified as it is used later to generate the pileup file.

3.2 If the alignment files already exist: option -S

We can also ask HyLiTE to skip the alignment step, provided that we supply HyLiTE with the corresponding files in the .sam format.

For instance, we can obtain alignment files by using bowtie2, as shown in the following command:

bowtie2 -x Ref_index -U P1.fastq -S P1.sam -N 1

Alternatively, running bowtie2 ourselves lets us specify a set of options that are better suited to our data set. It also lets us use a different alignment program (such as bwa) with the only constraint being that the resulting alignment files must be converted to the standard .sam format.

Once all our alignment files have been generated, we have to modify the protocol file to specify the path to the .sam files rather than the .fastq files. During the first example (A first HyLiTE analysis), .sam alignment files have been generated and can be found in the folder "my_first_HyliTE/". So the new protocol file should be:

sam_protocol_file.txt:

Ch     2    sample1    RNAseq    my_first_HyLiTE/Ch.sam
P1     1    sample1    RNAseq    my_first_HyLiTE/P1.sam
P2     1    sample1    RNAseq    my_first_HyLiTE/P2.sam

Note

Columns are separated by single tabs.

We can now launch the HyLiTE analysis, specifying the -S flag to indicate that the protocol file contains .sam files:

HyLiTE -v -S -f sam_protocol_file.txt -r Ref.fasta -n my_third_HyLiTE

Note

The main reason for using this option is that you can specify your own mapping options. Mainly:

  • The parallel processing option of bowtie2 to greatly reduce the alignment time
  • Options to adjust the sensitivity of the alignment

3.3 If the pileup file already exists: option -p

After the alignment step, HyLiTE would typically perform a series of operations on the .sam files to prepare them for generating the pileup file. All of these steps are just very simple samtools commands.

First, we have to index the reference .fasta file:

samtools faidx Ref.fasta

Note

This step only needs to be performed once, and if desired, the index file can be saved for later analyses.

For each .sam file, the following commands must also be run:

samtools view -Sb P1.sam > P1.bam
samtools sort P1.bam  P1.sorted
samtools index P1.sorted.bam

Once this has been completed for each .sam file, we need to create a new text file listing the names of the .sorted.bam files (one file per line). For our example, we would call this file my_first_HyLiTE.txt:

Ch.sorted.bam
P1.sorted.bam
P2.sorted.bam

Note

It is very important that the order of names in this file exactly follows the order in the protocol file. Remember: the allopolyploid child must always be listed first.

Once these steps have finished, we can finally generate the .pileup file:

samtools mpileup -BQ 0 -d 1000000 -f Ref.fasta -b my_first_HyLiTE.txt > my_first_HyLiTE.pileup

During the first example (A first HyLiTE analysis), a .pileup file has been generated and can be found in the folder "my_first_HyLiTE/".

HyLiTE can now be launched using the following command:

HyLiTE -v -f simple_protocol_file.txt -p my_first_HyLiTE/my_first_HyLiTE.pileup -n my_fourth_HyLiTE -r Ref.fasta

Note

  • This command line launches HyLiTE alone and does not require any external software. For many users, this may be the preferred starting point for HyLiTE analyses.
  • In later examples, we will often assume that the .pileup file has already been created outside HyLiTE and hence use the -p flag.

4. Using data from paired-end RNA sequencing

In most cases, paired-end reads are not particularly useful for gene expression studies. (Once the first read maps to a gene, the second read provides no additional information on expression). Due to its added complexity, we therefore discourage the use of paired-end data with HyLiTE.

However, if paired-end reads are available for one or more of our organisms, the files can be included under the same sample name listing in the protocol file.

In the following example, the child organism has paired-end reads, which are found, respectively, in files Ch_1.fastq and Ch_2.fastq:

simple_protocol_with_paired_end.txt:

Ch    2    sample1    RNAseq    Ch_1.fastq
Ch    2    sample1    RNAseq    Ch_2.fastq
P1    1    sample1    RNAseq    P1.fastq
P2    1    sample1    RNAseq    P2.fastq

Note

  • The paired-end files must immediately follow each other in the protocol file.
  • If we use the -S flag, there must only be one .sam file per sample (i.e., the .sam file must contain paired-end mappings).

Any HyLiTE analysis can then be launched normally.

5. Using several biological replicates

Most RNA-seq studies will contain biological replicates. This information can be combined to improve SNP calling (the basis of read assignment), but must be kept separate for calculations of gene expression.

Consider an example where two biological replicates have been generated for each organism. The .fastq files are labelled with the corresponding replicate (sample) names.

biological_replicate_protocol.txt:

Ch    2    sample1    RNAseq    Ch.sample1.fastq
Ch    2    sample2    RNAseq    Ch.sample2.fastq
P1    1    sample1    RNAseq    P1.sample1.fastq
P1    1    sample2    RNAseq    P1.sample2.fastq
P2    1    sample1    RNAseq    P2.sample1.fastq
P1    1    sample2    RNAseq    P2.sample2.fastq

Note

  • Files must still be grouped by organism and sample.
  • Paired-end read files (if used) must still be grouped together.

Any HyLiTE analysis can then be launched normally.

Note

When several biological replicates are specified, each will be given its own read and read summary result file.

6. Including genomic information

Genomic information can be very useful for calling SNPs. This is particularly true for genes with relatively low expression, particularly in the parent species.

To call SNPs, HyLiTE requires a minimum level of coverage (2. Detecting SNPs). Furthermore, if any one of the organisms has poor coverage at a given position, any SNP at that position must necessarily be masked out for all organisms. These SNPs therefore cannot be used to assign reads to homeologs.

However, this problem can largely be circumvented by including reads from genomic DNA.

genomic_DNA_protocol.txt:

Ch    2    sample1    RNAseq    Ch.sample1.fastq
Ch    2    sample2    RNAseq    Ch.sample2.fastq
Ch    2    sample3    gDNA      Ch.sample3.fastq
P1    1    sample1    RNAseq    P1.sample1.fastq
P1    1    sample2    RNAseq    P1.sample2.fastq
P1    1    sample3    gDNA      P1.sample3.fastq
P2    1    sample1    RNAseq    P2.sample1.fastq
P1    1    sample2    RNAseq    P2.sample2.fastq
P1    1    sample3    gDNA      P2.sample3.fastq

Note

  • Where genomic data is included, these files must be labeled as 'gDNA' instead of 'RNAseq' in the protocol file.
  • Obviously, no expression files are produced for gDNA samples.

7. Comments on input data types

HyLiTE is designed to work with RNAseq data generated by Illumina sequencing technology. By default, HyLiTE runs your RNAseq data with Bowtie2 and phred+33 quality encoding. If your data is encoded in phred+64, you should use the -–phred64 argument with HyLiTE. HyLiTE does not support Illumina quality scores from versions less than 1.3 as initial input data. You should run Bowtie2 to generate *.sam files to be run with HyLiTE with the appropriate option in Bowtie2 (--solexa-quals). If you do not know what version your quality scores are, SolexaQA++ will tell you what version of quality scores your data is and clean it too (http://solexaqa.sourceforge.net/).

If you are not using Illumina generated data, Bowtie2 may not be the most appropriate mapping software for you. Consider starting from *.sam files created by the mapper of your choice for use with HyLiTE. Also, HyLiTE's SNP detection is based around an expected error rate from Illumina data and can be changed within the parameter file (--alternative_params_option).

Restoring a crashed HyLiTE run

If HyLiTE fails unexpectedly for any reason, it is possible to restart the run using the pickle ('checkpointing') file. By default, this file is produced in intervals of 50 genes. To restart HyLiTE, simply use the option --restore as follow:

HyLiTE --restore my_first_HyLiTE.pickle

Note

  • It is possible to change the rate of the pickling operation using the option --pickling <int>, where <int> is the number of genes between each pickling operation. Pickling too frequently may negatively affect runtime.
  • If HyLiTE crashes before any pickle file is produced, the different entry points (options -b, -S, -p) can be used to avoid wasting runtime.
  • Note that even if HyLiTE crashes, result files contain complete information about any genes processed before the crash occurred.

Using HacknHyLiTE and HyLiTE-merge

The operations that HyLiTE undertakes are not fast. However, HyLiTE has been heavily optimized such that its time complexity increases only arithmetically with the size of the pileup file. Because all genes are treated independently, a simple way to reduce runtime is to divide the pileup file into n subsets. This lets each subset run as an independent HyLiTE analysis on different cores within a supercomputing cluster, thus effectively reducing the execution time by n.

To implement this feature, we have developed three methods to:

  • Cleanly slice a pileup file into n subsets and generate PBS -friendly bash scripts for each HyLiTE analysis (HacknHyLiTE)
  • Browse the directory of each sub-analysis created by HacknHyLiTE, merge the results file and generate single summary files (HyLiTE-merge)
  • Another executable HyLiTE-split is also available. It just cleanly slices a pileup file into subsets, without creating any bash scripts.

Considering our first example (A first HyLiTE analysis), HacknHyLiTE and HyLiTE-merge would be used as follow:

HacknHyLiTE -n 2 -o test_slicing --name slicing -p test_import_2/test_import_2.pileup -f simple_protocol_file.txt
bash test_slicing/slicing.subset0.sh
bash test_slicing/slicing.subset1.sh
cd test_slicing
HyLiTE-merge

Note

  • HyLiTE-merge is only intended to be used on analyses generated by HacknHyLiTE. Proceed with caution if you use this function in other ways.
  • HacknHyLiTE, HyLiTE-merge and HyLiTE-split are automatically installed along with HyLiTE if you use pip or easy_install.

HyLiTE - the --alternative_params option

This option is unconventional and we discourage its use. However, we deem it potentially useful for future developers. This command can also be used by non-root users to define a personalised parameters file (details about these parameters can be found in the Parameters file page) or to set up an alternative set of parameters.

The command allows the user to specify a file that will be interpreted by python just after the parameter file has been read and before the HyLiTE analysis has begun. It allows the user to, for example, redefine a variable or a function.