Dna sequences, Computer Engineering

The dataset provided in this assignment contains a collection of real DNA sequences. The number of true binding sites is quite limited and that makes the problem challenging. In machine learning community, this is termed as imbalanced datasets. Some techniques dealing with imbalanced data classification, such as sampling or filtering, can be applied for the biological data. It is a good idea to find some relevant publications to see in which way you can build effective classifiers for motif recognition.

The whole dataset should be partitioned into a training dataset used to build the learner models, and a testing dataset used to evaluate generalization capability of the classification systems. System performance will be evaluated by looking at the recall, precision, F-measure and recognition rate for both the training dataset and the test dataset.

It is very important to notice that unlike traditional way for evaluating classifier's performance, here a kmer is classified as a motif instance if its location has at least 50% overlap with a true binding site in the DNA sequences. For example, consider two true binding sites ACACGGGA and ACACGGGA in the following DNA sequence.

ccttacacaaACACGGGAgaattaatACACGGGAtcagatcaataaa (1)

Suppose that the 8mers acaaACAC and ACGGGAtc are classified as binding sites by a learner model. Then, we will count them as correct prediction because they have 50% and 75% overlaps with the true binding sites in sequence (1), respectively. Conversely, if classifiers classify them as non-binding sites, then we will count them as incorrect prediction because they have at least 50% overlaps with the true binding sites. Take another 8mer, GAgaatta, in (1). If it is classified by a learner model as a binding site, then it will be counted as a misclassified one because it has only 25% overlap with the true binding site ACACGGGA

Posted Date: 3/29/2013 5:36:08 AM | Location : United States







Related Discussions:- Dna sequences, Assignment Help, Ask Question on Dna sequences, Get Answer, Expert's Help, Dna sequences Discussions

Write discussion on Dna sequences
Your posts are moderated
Related Questions
Why is the Wait-For-Memory-Function-Completed step required when reading from or writing to the main memory? WMFC step is needed for the write control signal / read control si

Determine how Simulation can be developed To determine how a simulation can be developed for use in a real situation the below illustration has been chosen. Scenario chosen is

contributes to violence in our society. Others point out that television contributes to the high level of obesity among children. Now, we may have to add financial problems to the

Clients that join to a WebLogic Server cluster and look up a clustered object get a replica-aware stub for the object. This stub haves the list of available server instances that h


When entering word into the cell, press Alt-Enter to insert a line break. When you do so, Excel will automatically give text wrapping to the cell. To reformat existing cells s

Q. In PRAM model steps required for executing an algorithm? Subsequent steps are performed by a PRAM model whenever executing an algorithm: i) Read phase: First the N proc

What is Synchronous reset? Synchronous reset: Synchronous reset logic will synthesize to smaller flip-flops, mainly when the reset is gated along with the logic generating t

Colour The use of colour is considered by many to be one of the most important areas in composition. Colours can be used in isolation or specific combinations to create partic

In IP addressing scheme, class used for multicasting is: A class used for multicasting in IP addressing scheme is class D.