Epi-Speller is a program for analyzing multiple genome-wide profling epigenomic data. It includes:
Unzipping the package and compiling the epi_letter.cpp program by following command (in the same folder):
g++ epi_letter.cpp -o epi_letter
List of genomic coordinates (e.g. windows, bin, tiles, ...) and corresponding signals (microarray intensity or number of mapped short-reads with or without normalization) for chromatin marks according to the following format (tab-separated):
- First row: Names of chromatin marks
- For each of following row: first is the genomic coordinate (format start:end) following by corresponding signals for chromatin marks in the first row.
- Example input file: example.txt consist of 21K tiles from 12 tiling arrays for histone modification marks and DNA methylation in Arabidopsis.
perl dictionary_maker.pl -i <example.txt> -k <number_of_epi_letter>
Example of running with 3-letter (e.g. Low, Medium, High): perl dictionary_maker.pl -i example.txt -k 3
The program will output a list of files including automatic-inferred cutoffs for each individual marks (with extension .cutoff).
R --vanilla < alphabet_chrom.R --f <input file> --k <number_of_epi_letter> --d <dictionary_file>--r <0> --o <mutilple_epigenome_filename>
Example: R --vanilla < alphabet_chrom.R --f example.txt --k 3 --d epi_letter.dict --r 0 --o example.epi
Please create the text file with the acronyms for the epi-letter as you want, each row is for a letter (--d parameter, e.g. epi_letter.dict), --r parameter is for creating random epi-letter-represented epigenomes (0-no, 1-yes), default 0.
It will create the multiple epigenomes for all chromatin marks with epi-letter representation in a single file (--o is parameter for output file).
It also creates the look-up dictionary (.dict) listing all the tiles with coordinates, signals and letter_ID assigned and the epi-letter string file (.dna) for each individual mark. The coordinate file (.coor) is created for using in the next step.
3.1 Scanning for the epigenetic patterns
perl epimotif_scanning.pl -f <mutilple_epigenome_filename> (currently only support column patterns)
Example: perl epimotif_scanning.pl -f example.epi
It will create a file with ".cols" that list all column patterns and the corresponding frequency of its appreances (in the file .cols.freq).
3.2 Using R to make a unique column file for removing the repeated patterns for efficient computation of Hamming distance between patterns, for example:
write.table(unique(read.table("example.epi.cols.freq")), "example.epi.cols.freq.uniq", sep = "\t", quote=F, row.names=F, col.names=F)
OR using shell command-line as following:
sort example.epi.cols.freq | uniq > example.epi.cols.freq.uniq
example.epi.cols.freq.uniq is the file of unique column patterns. The orginal pattern file (example.epi.cols) is still necessary for tracing back the corresponding location in the genome.
3.3 Computing Hamming distance matrix for clustering
perl hamming_distance.pl -f <column_pattern_file>
Example: perl hamming_distance.pl -f example.epi.cols.freq.uniq
It will output the .hamming file that can be used for clustering, for example with k-mean method in R in the next step.
3.4 Clustering
R --vanilla < try_clustering.R --f <hamming_distance_file> --u <unique_pattern_file> --c <column_pattern_file> --k <number_of_cluster>
Example: R --vanilla < try_clustering.R --f example.epi.cols.freq.uniq.hamming --u example.epi.cols.freq.uniq --c example.epi.cols --k 4
It will output for each cluster one file (named cluster_xx, xx is the cluster_id) consiting of the pattern, coordinates and cluster_id. It also extract the pattern (the 2nd column in the file .logo) for the logo representation in the next step.
Example: ./weblogo-3.2/weblogo --format pdf --ylabel '' --show-xaxis no --alphabet 'LMH' --errorbars no --color red H 'High' --color green L 'Low' --color blue M 'Middle' <cluster_1.logo >cluster_1.pdf
If everything works out, it will produce the logo for the input cluster 1 which looks like cluster_1.pdf.
Dinh HQ, Mittelsten Scheid O, von Haeseler A. Epi-Speller - a bioinformatic tool for epigenomic signature discovery. (submitted)
Crooks et al., WebLogo: a sequence logo generator. Genome Res. 2004 Jun;14(6):1188-90.
Roudier et al., Integrative epigenomic mapping defines four main chromatin states in Arabidopsis. EMBO J. 2011 May 18;30(10):1928-38.