Dna sequences, Computer Engineering

The dataset provided in this assignment contains a collection of real DNA sequences. The number of true binding sites is quite limited and that makes the problem challenging. In machine learning community, this is termed as imbalanced datasets. Some techniques dealing with imbalanced data classification, such as sampling or filtering, can be applied for the biological data. It is a good idea to find some relevant publications to see in which way you can build effective classifiers for motif recognition.

The whole dataset should be partitioned into a training dataset used to build the learner models, and a testing dataset used to evaluate generalization capability of the classification systems. System performance will be evaluated by looking at the recall, precision, F-measure and recognition rate for both the training dataset and the test dataset.

It is very important to notice that unlike traditional way for evaluating classifier's performance, here a kmer is classified as a motif instance if its location has at least 50% overlap with a true binding site in the DNA sequences. For example, consider two true binding sites ACACGGGA and ACACGGGA in the following DNA sequence.

ccttacacaaACACGGGAgaattaatACACGGGAtcagatcaataaa (1)

Suppose that the 8mers acaaACAC and ACGGGAtc are classified as binding sites by a learner model. Then, we will count them as correct prediction because they have 50% and 75% overlaps with the true binding sites in sequence (1), respectively. Conversely, if classifiers classify them as non-binding sites, then we will count them as incorrect prediction because they have at least 50% overlaps with the true binding sites. Take another 8mer, GAgaatta, in (1). If it is classified by a learner model as a binding site, then it will be counted as a misclassified one because it has only 25% overlap with the true binding site ACACGGGA

Posted Date: 3/29/2013 5:36:08 AM | Location : United States







Related Discussions:- Dna sequences, Assignment Help, Ask Question on Dna sequences, Get Answer, Expert's Help, Dna sequences Discussions

Write discussion on Dna sequences
Your posts are moderated
Related Questions
Edith & Associates is a modern law firm. They have an Internet Link that is an ADSL 8MB links from JJNET. The server they are to deploy will the main service giving  server and wil

I need help coming with an idea for BSCE final project, which is solvable in about a semester

The largest and the second largest number from a set of n distinct numbers can be found in    O (n)

Determine the abstraction mechanisms for modelling The object orientation conceptual structure helps in providing abstraction mechanisms for modelling, that includes: Cl

Assessing Heuristic Searches: Given a particular problem you want to build an agent to solve, so there may be more than one way of justifying it as a search problem, more than

What is a heap? The heap is an area of memory that is dynamically allocated. As a stack, this may grow and shrink throughout runtime. Not like a stack, a heap is not LIFO show

Explain the various interface circuits.  An I/O interface having of circuitry required to connect an I/O device to computer bus. One side having of a data path with its associa

Problem 1 (a) Identify and briefly describe the possible roles of Codes of Ethics (b) Describe why is a code of ethics important to stakeholders. (c) Explain how should

What is memory controller? A memory controller is a circuit which is interposed among the processor and the dynamic memory. It is used for performing multiplexing of address bi

A interrupt that can be turned off by the programmer is called as Maskable interrupt.