Dna sequences, Computer Engineering

The dataset provided in this assignment contains a collection of real DNA sequences. The number of true binding sites is quite limited and that makes the problem challenging. In machine learning community, this is termed as imbalanced datasets. Some techniques dealing with imbalanced data classification, such as sampling or filtering, can be applied for the biological data. It is a good idea to find some relevant publications to see in which way you can build effective classifiers for motif recognition.

The whole dataset should be partitioned into a training dataset used to build the learner models, and a testing dataset used to evaluate generalization capability of the classification systems. System performance will be evaluated by looking at the recall, precision, F-measure and recognition rate for both the training dataset and the test dataset.

It is very important to notice that unlike traditional way for evaluating classifier's performance, here a kmer is classified as a motif instance if its location has at least 50% overlap with a true binding site in the DNA sequences. For example, consider two true binding sites ACACGGGA and ACACGGGA in the following DNA sequence.

ccttacacaaACACGGGAgaattaatACACGGGAtcagatcaataaa (1)

Suppose that the 8mers acaaACAC and ACGGGAtc are classified as binding sites by a learner model. Then, we will count them as correct prediction because they have 50% and 75% overlaps with the true binding sites in sequence (1), respectively. Conversely, if classifiers classify them as non-binding sites, then we will count them as incorrect prediction because they have at least 50% overlaps with the true binding sites. Take another 8mer, GAgaatta, in (1). If it is classified by a learner model as a binding site, then it will be counted as a misclassified one because it has only 25% overlap with the true binding site ACACGGGA

Posted Date: 3/29/2013 5:36:08 AM | Location : United States







Related Discussions:- Dna sequences, Assignment Help, Ask Question on Dna sequences, Get Answer, Expert's Help, Dna sequences Discussions

Write discussion on Dna sequences
Your posts are moderated
Related Questions
Explain about the Client/Server Computing? Although there are different various configurations, various hardware and software platforms and even various network protocols into


Define the PUBLIC and EXTRN directives- Assembler directives PUBLIC and EXTRN directives are very significant to modular programming. PUBLIC used to declare that labels of data

Q. Fundamental types of flash memory? Code Storage Flash which is made by Intel, AMD, Atmel. It stores programming algorithms and it is largely found in cell phones. Data

Determine the uses of Programming Language Interface PLI  is  used  for  implementing  system  calls  that would  have  been  hard  to  do  otherwise  (or impossible) using Ve

How does the interaction between the Dynpro and the ABAP/4 Modules takes place? -A transaction is a collection os screens and ABAP/4 routines , controlled and implemented by a

Define Deadlock with Resource request and allocation graph (RRAG) Deadlocks can be described through a directed bipartite graph termed as a RRAG that is Resource Request All

Define Step by Step Procedure in Computer Programming? The Computer programming is the skill and art of creating a computer program a defined set of instructions in source code

Discuss about Charles Babbagein brief Mechanism  for  advancing  or  reversing  of  control  card  were  allowed  therefore enabling execution of any desired instruction. In ot

This document is intended to help students get started with the real-time systems (RTS) assignment. We will start on the assignment together in the laboratory. Students will then c