Dna sequences, Computer Engineering

The dataset provided in this assignment contains a collection of real DNA sequences. The number of true binding sites is quite limited and that makes the problem challenging. In machine learning community, this is termed as imbalanced datasets. Some techniques dealing with imbalanced data classification, such as sampling or filtering, can be applied for the biological data. It is a good idea to find some relevant publications to see in which way you can build effective classifiers for motif recognition.

The whole dataset should be partitioned into a training dataset used to build the learner models, and a testing dataset used to evaluate generalization capability of the classification systems. System performance will be evaluated by looking at the recall, precision, F-measure and recognition rate for both the training dataset and the test dataset.

It is very important to notice that unlike traditional way for evaluating classifier's performance, here a kmer is classified as a motif instance if its location has at least 50% overlap with a true binding site in the DNA sequences. For example, consider two true binding sites ACACGGGA and ACACGGGA in the following DNA sequence.

ccttacacaaACACGGGAgaattaatACACGGGAtcagatcaataaa (1)

Suppose that the 8mers acaaACAC and ACGGGAtc are classified as binding sites by a learner model. Then, we will count them as correct prediction because they have 50% and 75% overlaps with the true binding sites in sequence (1), respectively. Conversely, if classifiers classify them as non-binding sites, then we will count them as incorrect prediction because they have at least 50% overlaps with the true binding sites. Take another 8mer, GAgaatta, in (1). If it is classified by a learner model as a binding site, then it will be counted as a misclassified one because it has only 25% overlap with the true binding site ACACGGGA

Posted Date: 3/29/2013 5:36:08 AM | Location : United States







Related Discussions:- Dna sequences, Assignment Help, Ask Question on Dna sequences, Get Answer, Expert's Help, Dna sequences Discussions

Write discussion on Dna sequences
Your posts are moderated
Related Questions
Q. Library of functions of parallel virtual machine? PVM offers a library of functions libpvm3.a, that application programmer calls. Every function has some specific effect in

How does the Dialog handle user requests? when an action is performed ,the system triggers the PROCESS AFTER INPUT event. The data passed contains field screen data data enter

What is commitment unit? When out-of-order execution is permitted, a special control unit is required to guarantee in-order commitment. This is known as the commitment unit. It

What are the central interfaces of the R/3 system? There are three central interfaces:- Presentation Interface. Database Interface. Operating system Interface.

Static or Dynamic - artificial intelligence An environment is static if it doesn't change while an agent's program is making the decision about how to act. When programming ag

How to Creating a Key Pair You can make a key pair using the Strong Name tool (Sn.exe). Key pair files usually have an .snk extension. To create a key pair at the command

Task   A task is logically discrete section of computational work. A task is normally a program or else set of instructions which are executed by a processor. Parallel

The Concept of Process Informally, a method is a program in execution, behind the program has been loaded in the main memory. However, a method is more than just a program code

interrupt method in keypad operation coding using PIC16

This is the MATLAB graphics system. It contains high-level commands for two-dimensional and three-dimensional data visualization, image processing, animation, and presentation grap