HyLiTE output formats¶
During an analysis, HyLiTE outputs various results in different files. It generally does so in a dedicated directory.
We distinguish 6 different type of output format:
The HyLiTE directory¶
Due to the large number of results files (and there can be many more if you use the full pipeline), HyLiTE recquires a directory dedicated to each analysis. By default, HyLiTE creates a new directory in the current working directory. This directory typically has the name of the corresponding HyLiTE analysis. It is possible to specify a different name (and path) for the results directory with the option -o.
1. .pickle file¶
Python users should be familiar with that kind of file. A pickle file is used to save the memory content of any python object in order to be able to load it again later. In our case, the pickle file lets you restore a crashed HyLiTE session using the option –restore (more information can be found in the HyLiTE manual: Restoring a crashed HyLiTE run).
Note
If you have tried the example dataset from the HyLiTE manual: 2. A first HyLiTE analysis, you will not have a .pickle file. This is because the pickling step is only done every 50 genes by default and the tutorial dataset only comprises 10 genes.
The number of genes between pickling can be changed using the option –pickling (default: 50).
2. .snp.txt file¶
One of these files is created per HyLiTE analysis. This file contains specific information about every SNP that HyLiTE found. It should look like:
GENE POS REF ALT Ch P1 P2
EfM3.000010 207 A C 1,0 0 0
EfM3.000010 220 C A 0 0 1
EfM3.000010 228 T C 0 0 1
EfM3.000010 268 A C 0 0 1
EfM3.000010 280 C T 0 0 1
EfM3.000010 288 G C 0 0 1
EfM3.000010 294 G T 0 0 1
EfM3.000010 396 T A 1,0 1 -1
EfM3.000010 397 T A 1,0 1 -1
...
Where:
GENE is the ID of the SNP bearing gene
POS is the position of the SNP on the gene
REF is the reference allele
ALT is the alternative allele (i.e. the nucleotide characterising the SNP)
The following columns are named after each organism (beginning with the child) and give information about the presence or absence of the SNP in that organism.
Absence or presence of the SNP is encoded as:
-1 means that the organism has a poor coverage at that position (and thus, the SNP will be masked)
0 means that the organism has the reference allele at that position
1 means that the organism has the alternative allele (at least once for polyploid organisms) at that position
a comma separates the potentially different alleles detected at that position in a polyploid organism
3. .snp.summary.txt file¶
One of these files is created per HyLiTE analysis. This file contains the same information in the .snp.txt file, but summarised by gene. It should look like:
GENE Ch P1 P2 MASKED COMMON Ch+P2 Ch+P1
EfM3.000010 1 0 6 4 0 0 0
EfM3.000020 0 0 0 19 0 0 0
EfM3.000030 0 0 0 12 0 0 0
EfM3.000040 0 0 0 22 0 0 0
EfM3.000050 8 0 8 4 1 49 2
EfM3.000060 23 0 1 0 4 22 7
EfM3.000070 2 0 3 5 1 58 3
EfM3.000080 5 0 7 0 1 44 0
EfM3.000090 1 0 21 11 2 29 3
EfM3.000100 22 0 39 104 6 172 9
Where:
GENE is the gene ID.
The other columns contains the SNP count for each category for the corresponding gene.
MASKED means that at least one organism has poor coverage at the SNP position (Thus no information about presence or absence can be obtained and categorization is impossible).
COMMON means that the SNP is shared between every organism in the study.
The other columns are named after the organisms sharing a SNP. For example, the column P1 lists the number of SNP specific to the organism P1 for each gene. Also, the column Ch+P1 lists the number of SNPs shared between these two organisms for each gene.
4. .read.txt files¶
There is one file per child sample included in the study (the child and the sample name are both included in the file name). This file contains specific information about every read in the child for a given biological replicate. It should look like:
GENE ORGANISM SAMPLE START STOP CAT NEW
EfM3.000010 sample1 Ch 108 190 UNK False
EfM3.000010 sample1 Ch 111 196 UNK False
EfM3.000010 sample1 Ch 102 197 UNK False
EfM3.000010 sample1 Ch 117 200 UNK False
EfM3.000010 sample1 Ch 181 275 P1 False
EfM3.000010 sample1 Ch 192 277 P1 False
EfM3.000010 sample1 Ch 189 278 P1 False
EfM3.000010 sample1 Ch 192 278 P1 False
EfM3.000010 sample1 Ch 192 280 P1 False
...
Where:
ORGANISM is the name of the child organism
SAMPLE is the name of the sample (or biological replicate)
GENE is the ID of the gene where the read mapped
START is the starting position of the read on the gene
STOP is the ending position of the read on the gene (please note that START < STOP)
CAT is the category of the SNP; i.e. its parental association
NEW is a boolean specifying if the read contains at least one child-specific SNP
The categories are noted as follows:
UNK means that it was not possible to categorize the read because it contains only masked SNPs and/or only child-specific SNPs
P1 means that the read contains P1-specific SNPs
P2 means that the read contains P2-specific SNPs
UNINFORMATIVE the simplified default out file combines all reads that could either come from P1 or P2
By using the –full_output flag, the UNIFORMATIVE category will no longer be in the file. A more detailed output including UNK, (P1), (P2) and the following:
+ means that the read appears to contain SNPs specific to both left part and right part (for example: (P1+P2) means a read containing elements from both P1 and P2; i.e. a chimeric read)
| means that the read can be categorised as either the left part or the right part (for example: (P1)|(P2) means a read that could come from P1 or P2)
Thus a category (P2) means that the read comes from a copy of the P2 copy of the gene inside the hybrid.
A category of (P1)|(P2) means that the read contains SNPs, but none are specific to either parent and the read could come from either of the parental copies of the gene.
A category of (P1+P2) means that the read seems to be chimeric: it contains SNPs specific to both parents.
Note
For the example of two parents only, the brackets are not very useful. However, their use is vital when dealing with more than two parents.
Please note that UNK and (P1)|(P2) are fundamentally different. Both mean that the read could come from either parent. But UNK means that it could be possible to assign only one parent if the coverage was better in this region, while (P1)|(P2) means that reads coming from both of the gene parental copies are indistinguishable on this region.
5. .read.summary.txt files¶
There is one file per child sample included in the study (the child and the sample names are both included in the file name). This file contains condensed information about the parental origin of every read in a given biological replicate, grouped by gene. It should look like:
GENE P1 P1+N P2 P2+N UNINFORMATIVE UNINFORMATIVE+N UNK UNK+N
EfM3.000010 877 7 6 0 0 0 89 0
EfM3.000020 0 0 0 0 0 0 2063 0
EfM3.000030 0 0 0 0 0 0 502 0
EfM3.000040 0 0 0 0 0 0 171 0
EfM3.000050 4420 45 1037 93 0 0 260 11
EfM3.000060 4294 165 1547 715 20 0 610 170
EfM3.000070 12878 0 3248 14 0 0 2502 0
EfM3.000080 4908 42 1287 45 0 0 228 1
EfM3.000090 589 0 30 0 0 0 174 6
EfM3.000100 15922 44 17380 145 14 0 2529 120
Where GENE is the gene ID and the other columns give the count of reads for each category. The categories are named in the same fashion as the .read.txt files in the full output. The default file shown above combines uniformative reads such as (P1)|(P2) into the UNIFORMATIVE categories. Potentially chimeric reads (P2+P1) and (P2+P1)+N are not displayed.
For a full output, use the –full_output flag when executing HyLiTE. With multiple parent and child samples in a tetraploid case, this output can be complex. In this example, your .read.summary.txt file would look like:
GENE (P1) (P1)+N (P2) (P2)+N (P2)|(P1) (P2+P1) (P2+P1)+N UNK
EfM3.000010 877 7 6 0 0 0 0 89
EfM3.000020 0 0 0 0 0 0 0 2063
EfM3.000030 0 0 0 0 0 0 0 502
EfM3.000040 0 0 0 0 0 0 0 171
EfM3.000050 4420 45 1037 93 0 222 11 38
EfM3.000060 4294 165 1547 715 20 136 170 474
EfM3.000070 12878 0 3248 14 0 746 0 1756
EfM3.000080 4908 42 1287 45 0 169 1 59
EfM3.000090 589 0 30 0 0 5 6 169
EfM3.000100 15922 44 17380 145 14 599 120 1930
Note
The order of the columns (apart from GENE which is always first) can vary from one HyLiTE analysis to another
6. .expression.txt file¶
One of these files is created per HyLiTE analysis. This file contains information about the expression level of the different organisms and biological replicates for each gene. It should look like:
GENE Ch%sample1 P1%sample1 P2%sample1
EfM3.000010 979 90 5
EfM3.000020 2063 262 3
EfM3.000030 502 146 3
EfM3.000040 171 64 40
EfM3.000050 5866 569 2523
EfM3.000060 7521 2164 7451
EfM3.000070 18642 1993 5365
EfM3.000080 6511 626 2527
EfM3.000090 799 85 130
EfM3.000100 36154 1998 32802
Where GENE is the gene ID and the other columns lists the number of reads found in each organism % sample that map to this gene.
7. .run.summary.txt file¶
A single .run.summary.txt file should be created when running a HyLiTE analysis. It describes the number of succesfully mapped reads for the child polyploid (from .sam files) and how they are allocated among major categories. In this file, the total number of SNPs identified and child specific SNPs are also printed.
It should look like:
Total number of child reads mapping on the reference: 79208
Number of child reads unambiguously assigned to a parent: 69738
Number of child reads unambiguously assigned to P1: 44191
Number of child reads unambiguously assigned to P2: 25547
Number of child reads with uninformative assignment: 34
Number of child reads with unknown or ambiguous assignement: 9436
Total number of SNPs identified: 741
Total number of child unique SNPs: 62