Dna sequences, Computer Engineering

The dataset provided in this assignment contains a collection of real DNA sequences. The number of true binding sites is quite limited and that makes the problem challenging. In machine learning community, this is termed as imbalanced datasets. Some techniques dealing with imbalanced data classification, such as sampling or filtering, can be applied for the biological data. It is a good idea to find some relevant publications to see in which way you can build effective classifiers for motif recognition.

The whole dataset should be partitioned into a training dataset used to build the learner models, and a testing dataset used to evaluate generalization capability of the classification systems. System performance will be evaluated by looking at the recall, precision, F-measure and recognition rate for both the training dataset and the test dataset.

It is very important to notice that unlike traditional way for evaluating classifier's performance, here a kmer is classified as a motif instance if its location has at least 50% overlap with a true binding site in the DNA sequences. For example, consider two true binding sites ACACGGGA and ACACGGGA in the following DNA sequence.

ccttacacaaACACGGGAgaattaatACACGGGAtcagatcaataaa (1)

Suppose that the 8mers acaaACAC and ACGGGAtc are classified as binding sites by a learner model. Then, we will count them as correct prediction because they have 50% and 75% overlaps with the true binding sites in sequence (1), respectively. Conversely, if classifiers classify them as non-binding sites, then we will count them as incorrect prediction because they have at least 50% overlaps with the true binding sites. Take another 8mer, GAgaatta, in (1). If it is classified by a learner model as a binding site, then it will be counted as a misclassified one because it has only 25% overlap with the true binding site ACACGGGA

Posted Date: 3/29/2013 5:36:08 AM | Location : United States







Related Discussions:- Dna sequences, Assignment Help, Ask Question on Dna sequences, Get Answer, Expert's Help, Dna sequences Discussions

Write discussion on Dna sequences
Your posts are moderated
Related Questions
Mutation: However it may appear that the above recombinations are a little arbitrary that especially as points defining whether crossover and inversion occur are chosen random

Explain Public Switched Telephone Network. PSTN (Public Switched Telephone Network): This is Public Switched Telephone Network (PSTN), which accommodates two types of subscri

Q. Explain working of Jaz Drive? Jaz Drive: Jaz drive is a well-liked drive with 2GB and unleashes creativity of professionals in graphic design and software development, pub

Parallel Overhead The amount of time needed to organize parallel tasks, as opposed to undertaking useful work. Parallel overhead may comprise factors like:   1) Task start-u

A class invariant is a condition that describes all valid states for an object. It is a logical condition to make sure the correct working of a class. Class invariants must hold wh

For each of the following variables: YEARofBIRTH transformed into a new variable called . To do this you will need to use the Command. Hint codes 9998 and 9999 are missin

how can get payment

Your shell must accept the exit command. This command will be in the form: e x i t When you encounter this command, your shell should terminate. Please note that when you

What is pipelining? It is a method of decomposing a sequential process into sub-operations, with each sub-process being implemented in a special dedicated segment that operates

Question: a) Evary cell in a cellular network is assigned a band of frequencies. The allocated frequencies are divided into two types of channels. Indentify and describe each t