HyLiTE output formats

During an analysis, HyLiTE outputs various results in different files. It generally does so in a dedicated directory.

The HyLiTE directory

We distinguish 6 different type of output format:

  1. .pickle file
  2. .snp.txt file
  3. .snp.summary.txt file
  4. .read.txt files
  5. .read.summary.txt files
  6. .expression.txt file
  7. .run.summary.txt file

The HyLiTE directory

Due to the large number of results files (and there can be many more if you use the full pipeline), HyLiTE recquires a directory dedicated to each analysis. By default, HyLiTE creates a new directory in the current working directory. This directory typically has the name of the corresponding HyLiTE analysis. It is possible to specify a different name (and path) for the results directory with the option -o.

1. .pickle file

Python users should be familiar with that kind of file. A pickle file is used to save the memory content of any python object in order to be able to load it again later. In our case, the pickle file lets you restore a crashed HyLiTE session using the option --restore (more information can be found in the HyLiTE manual: Restoring a crashed HyLiTE run).

Note

  • If you have tried the example dataset from the HyLiTE manual: 2. A first HyLiTE analysis, you will not have a .pickle file. This is because the pickling step is only done every 50 genes by default and the tutorial dataset only comprises 10 genes.
  • The number of genes between pickling can be changed using the option --pickling (default: 50).

2. .snp.txt file

One of these files is created per HyLiTE analysis. This file contains specific information about every SNP that HyLiTE found. It should look like:

GENE         POS     REF     ALT     Ch      P1      P2
EfM3.000010  207     A       C       1,0     0       0
EfM3.000010  220     C       A       0       0       1
EfM3.000010  228     T       C       0       0       1
EfM3.000010  268     A       C       0       0       1
EfM3.000010  280     C       T       0       0       1
EfM3.000010  288     G       C       0       0       1
EfM3.000010  294     G       T       0       0       1
EfM3.000010  396     T       A       1,0     1       -1
EfM3.000010  397     T       A       1,0     1       -1
...

Where:

  • GENE is the ID of the SNP bearing gene
  • POS is the position of the SNP on the gene
  • REF is the reference allele
  • ALT is the alternative allele (i.e. the nucleotide characterising the SNP)
  • The following columns are named after each organism (beginning with the child) and give information about the presence or absence of the SNP in that organism.

Absence or presence of the SNP is encoded as:

  • -1 means that the organism has a poor coverage at that position (and thus, the SNP will be masked)
  • 0 means that the organism has the reference allele at that position
  • 1 means that the organism has the alternative allele (at least once for polyploid organisms) at that position
  • a comma separates the potentially different alleles detected at that position in a polyploid organism

3. .snp.summary.txt file

One of these files is created per HyLiTE analysis. This file contains the same information in the .snp.txt file, but summarised by gene. It should look like:

GENE         Ch      P1      P2      MASKED  COMMON  Ch+P2   Ch+P1
EfM3.000010  1       0       6       4       0       0       0
EfM3.000020  0       0       0       19      0       0       0
EfM3.000030  0       0       0       12      0       0       0
EfM3.000040  0       0       0       22      0       0       0
EfM3.000050  8       0       8       4       1       49      2
EfM3.000060  23      0       1       0       4       22      7
EfM3.000070  2       0       3       5       1       58      3
EfM3.000080  5       0       7       0       1       44      0
EfM3.000090  1       0       21      11      2       29      3
EfM3.000100  22      0       39      104     6       172     9

Where:

  • GENE is the gene ID.
  • The other columns contains the SNP count for each category for the corresponding gene.
  • MASKED means that at least one organism has poor coverage at the SNP position (Thus no information about presence or absence can be obtained and categorization is impossible).
  • COMMON means that the SNP is shared between every organism in the study.
  • The other columns are named after the organisms sharing a SNP. For example, the column P1 lists the number of SNP specific to the organism P1 for each gene. Also, the column Ch+P1 lists the number of SNPs shared between these two organisms for each gene.

4. .read.txt files

There is one file per child sample included in the study (the child and the sample name are both included in the file name). This file contains specific information about every read in the child for a given biological replicate. It should look like:

GENE ORGANISM        SAMPLE  START   STOP    CAT     NEW
EfM3.000010  sample1 Ch      108     190     UNK     False
EfM3.000010  sample1 Ch      111     196     UNK     False
EfM3.000010  sample1 Ch      102     197     UNK     False
EfM3.000010  sample1 Ch      117     200     UNK     False
EfM3.000010  sample1 Ch      181     275     P1      False
EfM3.000010  sample1 Ch      192     277     P1      False
EfM3.000010  sample1 Ch      189     278     P1      False
EfM3.000010  sample1 Ch      192     278     P1      False
EfM3.000010  sample1 Ch      192     280     P1      False
...

Where:

  • ORGANISM is the name of the child organism
  • SAMPLE is the name of the sample (or biological replicate)
  • GENE is the ID of the gene where the read mapped
  • START is the starting position of the read on the gene
  • STOP is the ending position of the read on the gene (please note that START < STOP)
  • CAT is the category of the SNP; i.e. its parental association
  • NEW is a boolean specifying if the read contains at least one child-specific SNP

The categories are noted as follows:

  • UNK means that it was not possible to categorize the read because it contains only masked SNPs and/or only child-specific SNPs
  • P1 means that the read contains P1-specific SNPs
  • P2 means that the read contains P2-specific SNPs
  • UNINFORMATIVE the simplified default out file combines all reads that could either come from P1 or P2

By using the --full_output flag, the UNIFORMATIVE category will no longer be in the file. A more detailed output including UNK, (P1), (P2) and the following:

  • + means that the read appears to contain SNPs specific to both left part and right part (for example: (P1+P2) means a read containing elements from both P1 and P2; i.e. a chimeric read)
  • | means that the read can be categorised as either the left part or the right part (for example: (P1)|(P2) means a read that could come from P1 or P2)

Thus a category (P2) means that the read comes from a copy of the P2 copy of the gene inside the hybrid.

A category of (P1)|(P2) means that the read contains SNPs, but none are specific to either parent and the read could come from either of the parental copies of the gene.

A category of (P1+P2) means that the read seems to be chimeric: it contains SNPs specific to both parents.

Note

  • For the example of two parents only, the brackets are not very useful. However, their use is vital when dealing with more than two parents.
  • Please note that UNK and (P1)|(P2) are fundamentally different. Both mean that the read could come from either parent. But UNK means that it could be possible to assign only one parent if the coverage was better in this region, while (P1)|(P2) means that reads coming from both of the gene parental copies are indistinguishable on this region.

5. .read.summary.txt files

There is one file per child sample included in the study (the child and the sample names are both included in the file name). This file contains condensed information about the parental origin of every read in a given biological replicate, grouped by gene. It should look like:

GENE    P1      P1+N    P2      P2+N    UNINFORMATIVE   UNINFORMATIVE+N UNK     UNK+N
EfM3.000010     877     7       6       0       0       0       89      0
EfM3.000020     0       0       0       0       0       0       2063    0
EfM3.000030     0       0       0       0       0       0       502     0
EfM3.000040     0       0       0       0       0       0       171     0
EfM3.000050     4420    45      1037    93      0       0       260     11
EfM3.000060     4294    165     1547    715     20      0       610     170
EfM3.000070     12878   0       3248    14      0       0       2502    0
EfM3.000080     4908    42      1287    45      0       0       228     1
EfM3.000090     589     0       30      0       0       0       174     6
EfM3.000100     15922   44      17380   145     14      0       2529    120

Where GENE is the gene ID and the other columns give the count of reads for each category. The categories are named in the same fashion as the .read.txt files in the full output. The default file shown above combines uniformative reads such as (P1)|(P2) into the UNIFORMATIVE categories. Potentially chimeric reads (P2+P1) and (P2+P1)+N are not displayed.

For a full output, use the --full_output flag when executing HyLiTE. With multiple parent and child samples in a tetraploid case, this output can be complex. In this example, your .read.summary.txt file would look like:

GENE         (P1)    (P1)+N  (P2)    (P2)+N  (P2)|(P1)       (P2+P1) (P2+P1)+N       UNK
EfM3.000010  877     7       6       0       0               0       0               89
EfM3.000020  0       0       0       0       0               0       0               2063
EfM3.000030  0       0       0       0       0               0       0               502
EfM3.000040  0       0       0       0       0               0       0               171
EfM3.000050  4420    45      1037    93      0               222     11              38
EfM3.000060  4294    165     1547    715     20              136     170             474
EfM3.000070  12878   0       3248    14      0               746     0               1756
EfM3.000080  4908    42      1287    45      0               169     1               59
EfM3.000090  589     0       30      0       0               5       6               169
EfM3.000100  15922   44      17380   145     14              599     120             1930

Note

The order of the columns (apart from GENE which is always first) can vary from one HyLiTE analysis to another

6. .expression.txt file

One of these files is created per HyLiTE analysis. This file contains information about the expression level of the different organisms and biological replicates for each gene. It should look like:

GENE         Ch%sample1      P1%sample1      P2%sample1
EfM3.000010  979             90              5
EfM3.000020  2063            262             3
EfM3.000030  502             146             3
EfM3.000040  171             64              40
EfM3.000050  5866            569             2523
EfM3.000060  7521            2164            7451
EfM3.000070  18642           1993            5365
EfM3.000080  6511            626             2527
EfM3.000090  799             85              130
EfM3.000100  36154           1998            32802

Where GENE is the gene ID and the other columns lists the number of reads found in each organism % sample that map to this gene.

7. .run.summary.txt file

A single .run.summary.txt file should be created when running a HyLiTE analysis. It describes the number of succesfully mapped reads for the child polyploid (from .sam files) and how they are allocated among major categories. In this file, the total number of SNPs identified and child specific SNPs are also printed.

It should look like:

Total number of child reads mapping on the reference:    79208
Number of child reads unambiguously assigned to a parent:        69738
Number of child reads unambiguously assigned to P1:      44191
Number of child reads unambiguously assigned to P2:      25547
Number of child reads with uninformative assignment:     34
Number of child reads with unknown or ambiguous assignement:     9436
Total number of SNPs identified:        741
Total number of child unique SNPs:      62