Dna sequences, Computer Engineering

The dataset provided in this assignment contains a collection of real DNA sequences. The number of true binding sites is quite limited and that makes the problem challenging. In machine learning community, this is termed as imbalanced datasets. Some techniques dealing with imbalanced data classification, such as sampling or filtering, can be applied for the biological data. It is a good idea to find some relevant publications to see in which way you can build effective classifiers for motif recognition.

The whole dataset should be partitioned into a training dataset used to build the learner models, and a testing dataset used to evaluate generalization capability of the classification systems. System performance will be evaluated by looking at the recall, precision, F-measure and recognition rate for both the training dataset and the test dataset.

It is very important to notice that unlike traditional way for evaluating classifier's performance, here a kmer is classified as a motif instance if its location has at least 50% overlap with a true binding site in the DNA sequences. For example, consider two true binding sites ACACGGGA and ACACGGGA in the following DNA sequence.

ccttacacaaACACGGGAgaattaatACACGGGAtcagatcaataaa (1)

Suppose that the 8mers acaaACAC and ACGGGAtc are classified as binding sites by a learner model. Then, we will count them as correct prediction because they have 50% and 75% overlaps with the true binding sites in sequence (1), respectively. Conversely, if classifiers classify them as non-binding sites, then we will count them as incorrect prediction because they have at least 50% overlaps with the true binding sites. Take another 8mer, GAgaatta, in (1). If it is classified by a learner model as a binding site, then it will be counted as a misclassified one because it has only 25% overlap with the true binding site ACACGGGA

Posted Date: 3/29/2013 5:36:08 AM | Location : United States







Related Discussions:- Dna sequences, Assignment Help, Ask Question on Dna sequences, Get Answer, Expert's Help, Dna sequences Discussions

Write discussion on Dna sequences
Your posts are moderated
Related Questions
Prolog Programming Language : Probably programming languages are procedural: than the programmer specifies exactly the right instructions (algorithms) required to get an agent

A using declaration in C++ makes it likely to use a name from a namespace without the scope operator.

HOW TO SAVE YOUR FILE? Step 1: Click on FILE Step 2: Click on SAVE Step 3: Choose the folder in which you want to save Step 4: Provide a name to the file (with .htm /

Can you give an example of when it would be appropriate to use a web service as opposed to a non-serviced .NET component? Services which help in stock trading by giving analysi

Q. What is the dissimilarity between a lens and a mirror? Answer:- A mirror is a reflective surface that light passes through the glass and hits the silver backing reflect

How do you make programs portable on Unix and Does under such circumstances? Constructors are invoked themself when the object is created for a class. There may be any number o

I n a time division space switch the size of the control memory is N and its Width:  (A) Log 10 M  (B) Log e M  (C) Log N M  (D) Log 2 M Where N are the ou

Question 1: (a) How would you explain human reasoning? (b) Explain the different types of human reasoning? Question 2: (a) What are the computational factors affec

Question 1: What do you meant by ERP? What are the benefits of ERP? Definition of ERP Question 2: Describe briefly the advantages of the ERP. Explanation of six advanta

Show that a positive logic NAND gate is equivalent to negative logic NOR gate. Ans:  Positive logic denotes True or 1 with a high voltage and False or 0 with a low volt