Dna sequences, Computer Engineering

The dataset provided in this assignment contains a collection of real DNA sequences. The number of true binding sites is quite limited and that makes the problem challenging. In machine learning community, this is termed as imbalanced datasets. Some techniques dealing with imbalanced data classification, such as sampling or filtering, can be applied for the biological data. It is a good idea to find some relevant publications to see in which way you can build effective classifiers for motif recognition.

The whole dataset should be partitioned into a training dataset used to build the learner models, and a testing dataset used to evaluate generalization capability of the classification systems. System performance will be evaluated by looking at the recall, precision, F-measure and recognition rate for both the training dataset and the test dataset.

It is very important to notice that unlike traditional way for evaluating classifier's performance, here a kmer is classified as a motif instance if its location has at least 50% overlap with a true binding site in the DNA sequences. For example, consider two true binding sites ACACGGGA and ACACGGGA in the following DNA sequence.

ccttacacaaACACGGGAgaattaatACACGGGAtcagatcaataaa (1)

Suppose that the 8mers acaaACAC and ACGGGAtc are classified as binding sites by a learner model. Then, we will count them as correct prediction because they have 50% and 75% overlaps with the true binding sites in sequence (1), respectively. Conversely, if classifiers classify them as non-binding sites, then we will count them as incorrect prediction because they have at least 50% overlaps with the true binding sites. Take another 8mer, GAgaatta, in (1). If it is classified by a learner model as a binding site, then it will be counted as a misclassified one because it has only 25% overlap with the true binding site ACACGGGA

Posted Date: 3/29/2013 5:36:08 AM | Location : United States

Related Discussions:- Dna sequences, Assignment Help, Ask Question on Dna sequences, Get Answer, Expert's Help, Dna sequences Discussions

Write discussion on Dna sequences
Your posts are moderated
Related Questions
Discuss the various enhanced services that can be made available to the subscribers because of stored program control. One of the instant benefits of stored program control is

Explain Assembly Language This is a family of low-level language for programming computers, microprocessors, microcontrollers etc. They implement a symbolic sign of the numeric

What are the different methods of passing data? There are three different methods of passing data Calling by reference    Calling by value Calling by value and result

Can we create a gui status in a program from the object browser? Yes.  You can make a GUI STATUS in a program using SET PF-STATUS.

Process of Breadth first search: It's very useful to think of this search as the evolution of the given tree, and how each string of letters of word is found via the search in

Smugglers are becoming very smart day by day. Now they have developed a new technique of sending their messages from one smuggler to another. In their new technology, they are send

Channel means logical communication link. There are two parts of channels a) Message channel, b) MQI channel   1) Mesage channel use for communication among QMgr to Q

1 1 1 1 2 1 1 3 3 1 1 4 6 4 1

Q. Designing the instruction format is a complex art? Instruction Length Significance: It's the fundamental issue of the format design. It concludes the richness and flex

Q. Write an assembly function which hides the cursor? Write an assembly function which hides the cursor. Call it from a C program.             . PUBLIC CUROFF