Dna sequences, Computer Engineering

The dataset provided in this assignment contains a collection of real DNA sequences. The number of true binding sites is quite limited and that makes the problem challenging. In machine learning community, this is termed as imbalanced datasets. Some techniques dealing with imbalanced data classification, such as sampling or filtering, can be applied for the biological data. It is a good idea to find some relevant publications to see in which way you can build effective classifiers for motif recognition.

The whole dataset should be partitioned into a training dataset used to build the learner models, and a testing dataset used to evaluate generalization capability of the classification systems. System performance will be evaluated by looking at the recall, precision, F-measure and recognition rate for both the training dataset and the test dataset.

It is very important to notice that unlike traditional way for evaluating classifier's performance, here a kmer is classified as a motif instance if its location has at least 50% overlap with a true binding site in the DNA sequences. For example, consider two true binding sites ACACGGGA and ACACGGGA in the following DNA sequence.

ccttacacaaACACGGGAgaattaatACACGGGAtcagatcaataaa (1)

Suppose that the 8mers acaaACAC and ACGGGAtc are classified as binding sites by a learner model. Then, we will count them as correct prediction because they have 50% and 75% overlaps with the true binding sites in sequence (1), respectively. Conversely, if classifiers classify them as non-binding sites, then we will count them as incorrect prediction because they have at least 50% overlaps with the true binding sites. Take another 8mer, GAgaatta, in (1). If it is classified by a learner model as a binding site, then it will be counted as a misclassified one because it has only 25% overlap with the true binding site ACACGGGA

Posted Date: 3/29/2013 5:36:08 AM | Location : United States







Related Discussions:- Dna sequences, Assignment Help, Ask Question on Dna sequences, Get Answer, Expert's Help, Dna sequences Discussions

Write discussion on Dna sequences
Your posts are moderated
Related Questions
Q. Define Master Construct in FORTAN? The master directive has following general form:  #pragma omp master structured_block  It causes master thread to execute structu

You have been asked to become the consulting technician for MobTex, a local auto servicing centre. The general manager (Jo) realises that the IT system has evolved over time with l

LoadRunner script code acquired from recording in the ANSI C language syntax, shown by icons in icon view until you click Script View.

What is Static timing a. Delays over all paths are added up. b. All possibilities, including false paths, verified without the need for test vectors. c. Faster than simul

Linear Array This is a mainly fundamental interconnection pattern.  In this processors are linked in a linear one-dimensional array. The intial and last processors are linked w

Q. Explain about Open System? The 'Open System' is a system within its environment. It receives input from environment as well as provides output to environment.  Illustrati

CGI is significant whenever you require to retain state information about a user, or run an application which communicates with the server. Things like guestbook's, Chat clients, d

Determine the approaches to organizing stored program control There are 2 approaches to organizing stored program control: 1.  Centralized: In this control, all control equi

A combinational circuit has 3 inputs A, B, C and output F.  F is true for following input combinations A is False, B is True A is False, C is True A, B, C

The Concept of Thread A thread is a sequential flow of control within a process. A process is able to have one or more threads. Threads have their own register-values and progr