Welcome to Epi-Speller


Introduction:

Epi-Speller is a program for analyzing multiple genome-wide profling epigenomic data. It includes:

  • Signal discretization based on automatic inference of cut-offs for genome-wide signal levels.
  • Clustering based on letter-representation.
  • Using sequence logo to summarize the frequent signals using Weblogo.

  • Availability:

    Source code:

    Epi-Speller package

    Installation:

    Unzipping the package and compiling the epi_letter.cpp program by following command (in the same folder):

    g++ epi_letter.cpp -o epi_letter


    How to use?

    Input data:

    List of genomic coordinates (e.g. windows, bin, tiles, ...) and corresponding signals (microarray intensity or number of mapped short-reads with or without normalization) for chromatin marks according to the following format (tab-separated):

    - First row: Names of chromatin marks

    - For each of following row: first is the genomic coordinate (format start:end) following by corresponding signals for chromatin marks in the first row.

    - Example input file: example.txt consist of 21K tiles from 12 tiling arrays for histone modification marks and DNA methylation in Arabidopsis.

    Running the Epi-Speller step by step

    1. Grouping of epigenetic signatures with input data example.txt

        perl dictionary_maker.pl -i <example.txt> -k <number_of_epi_letter>

        Example of running with 3-letter (e.g. Low, Medium, High): perl dictionary_maker.pl -i example.txt -k 3

        The program will output a list of files including automatic-inferred cutoffs for each individual marks (with extension .cutoff).

    2. Assigning epi-letters
      1. R --vanilla < alphabet_chrom.R --f <input file> --k <number_of_epi_letter> --d <dictionary_file>--r <0> --o <mutilple_epigenome_filename>

        Example: R --vanilla < alphabet_chrom.R --f example.txt --k 3 --d epi_letter.dict --r 0 --o example.epi

        Please create the text file with the acronyms for the epi-letter as you want, each row is for a letter (--d parameter, e.g. epi_letter.dict), --r parameter is for creating random epi-letter-represented epigenomes (0-no, 1-yes), default 0.

        It will create the multiple epigenomes for all chromatin marks with epi-letter representation in a single file (--o is parameter for output file).

        It also creates the look-up dictionary (.dict) listing all the tiles with coordinates, signals and letter_ID assigned and the epi-letter string file (.dna) for each individual mark. The coordinate file (.coor) is created for using in the next step.

    3. Searching/Clustering for epigenetic signatures: either by using conventional profiling signals or by using epi-letter representation as following
      1. 3.1 Scanning for the epigenetic patterns

        perl epimotif_scanning.pl -f <mutilple_epigenome_filename> (currently only support column patterns)

        Example: perl epimotif_scanning.pl -f example.epi

        It will create a file with ".cols" that list all column patterns and the corresponding frequency of its appreances (in the file .cols.freq).

        3.2 Using R to make a unique column file for removing the repeated patterns for efficient computation of Hamming distance between patterns, for example:

        write.table(unique(read.table("example.epi.cols.freq")), "example.epi.cols.freq.uniq", sep = "\t", quote=F, row.names=F, col.names=F)

        OR using shell command-line as following:

        sort example.epi.cols.freq | uniq > example.epi.cols.freq.uniq

        example.epi.cols.freq.uniq is the file of unique column patterns. The orginal pattern file (example.epi.cols) is still necessary for tracing back the corresponding location in the genome.

        3.3 Computing Hamming distance matrix for clustering

        perl hamming_distance.pl -f <column_pattern_file>

        Example: perl hamming_distance.pl -f example.epi.cols.freq.uniq

        It will output the .hamming file that can be used for clustering, for example with k-mean method in R in the next step.

        3.4 Clustering

        R --vanilla < try_clustering.R --f <hamming_distance_file> --u <unique_pattern_file> --c <column_pattern_file> --k <number_of_cluster>

        Example: R --vanilla < try_clustering.R --f example.epi.cols.freq.uniq.hamming --u example.epi.cols.freq.uniq --c example.epi.cols --k 4

        It will output for each cluster one file (named cluster_xx, xx is the cluster_id) consiting of the pattern, coordinates and cluster_id. It also extract the pattern (the 2nd column in the file .logo) for the logo representation in the next step.

    4. Logo representation using Weblogo 3.2 program (download the sourcecode or here. You have to unzip the files to use it)
      1. Example: ./weblogo-3.2/weblogo --format pdf --ylabel '' --show-xaxis no --alphabet 'LMH' --errorbars no --color red H 'High' --color green L 'Low' --color blue M 'Middle' <cluster_1.logo >cluster_1.pdf

        If everything works out, it will produce the logo for the input cluster 1 which looks like cluster_1.pdf.

     


    References:

    Dinh HQ, Mittelsten Scheid O, von Haeseler A. Epi-Speller - a bioinformatic tool for epigenomic signature discovery. (submitted)

    Crooks et al., WebLogo: a sequence logo generator. Genome Res. 2004 Jun;14(6):1188-90.

    Roudier et al., Integrative epigenomic mapping defines four main chromatin states in Arabidopsis. EMBO J. 2011 May 18;30(10):1928-38.

    Note:
    • Please, let us know if you download this program by sending an email to {huy.dinh,arndt.von.haeseler}@univie.ac.at