Seeder is an exact discriminative seeding DNA motif discovery algorithm designed for fast and reliable prediction of cisregulatory elements in eukaryotic promoters.
The algorithm starts by enumerating all words of a given length. For each word, it calculates the Hamming distance (HD) between the word and its best matching subsequence (we call this distance the substring minimal distanceSMD) in each sequence of a background set. This data is used to produce a wordspecific background probability distribution for the SMD. For each word, it then calculates the sum of SMDs to sequences in a positive set. The Pvalue for this sum is calculated using the wordspecific background probability distribution. The word for which the Pvalue is minimal is retained, and a seed PWM is built from the closest matches to this word found in every positive sequence. The seed PWM is extended to full motif width and sites maximizing the score to the extended PWM are selected, one in each positive sequence. A new PWM is built from those sites and the process is iterated until convergence, or a maximum number of iterations is reached.
Key features of the algorithm:
• The enumerativeguaranteed optimality of seed selection;
• A background model based on empirical distribution of SMDs;
• Efficient data structures that make computations relatively fast;
