Dna sequences, Computer Engineering

The dataset provided in this assignment contains a collection of real DNA sequences. The number of true binding sites is quite limited and that makes the problem challenging. In machine learning community, this is termed as imbalanced datasets. Some techniques dealing with imbalanced data classification, such as sampling or filtering, can be applied for the biological data. It is a good idea to find some relevant publications to see in which way you can build effective classifiers for motif recognition.

The whole dataset should be partitioned into a training dataset used to build the learner models, and a testing dataset used to evaluate generalization capability of the classification systems. System performance will be evaluated by looking at the recall, precision, F-measure and recognition rate for both the training dataset and the test dataset.

It is very important to notice that unlike traditional way for evaluating classifier's performance, here a kmer is classified as a motif instance if its location has at least 50% overlap with a true binding site in the DNA sequences. For example, consider two true binding sites ACACGGGA and ACACGGGA in the following DNA sequence.

ccttacacaaACACGGGAgaattaatACACGGGAtcagatcaataaa (1)

Suppose that the 8mers acaaACAC and ACGGGAtc are classified as binding sites by a learner model. Then, we will count them as correct prediction because they have 50% and 75% overlaps with the true binding sites in sequence (1), respectively. Conversely, if classifiers classify them as non-binding sites, then we will count them as incorrect prediction because they have at least 50% overlaps with the true binding sites. Take another 8mer, GAgaatta, in (1). If it is classified by a learner model as a binding site, then it will be counted as a misclassified one because it has only 25% overlap with the true binding site ACACGGGA

Posted Date: 3/29/2013 5:36:08 AM | Location : United States







Related Discussions:- Dna sequences, Assignment Help, Ask Question on Dna sequences, Get Answer, Expert's Help, Dna sequences Discussions

Write discussion on Dna sequences
Your posts are moderated
Related Questions
Q. Explain about Diodes? A single pn-junction with appropriate contacts for connecting the junction to external circuits is called a semiconductor pn-junction diode. The fundam

Before getting into the design the designer should go by the SRS prepared by the System Analyst. The main tasks of design are Architectural Design & Detailed Design. In Arch

Q. Illustrate basic working of Physical layer? Physical layer: Physical layer is concerned with sending raw bits between source and destination nodes over a physical medium.

Q. Explain the Use of functions in parallel programming? include "pvm3.h" main() {    int cc, tid, msgtag;    char buf[100];    printf("%x\n", pvm_mytid());

Determine the quivalence Partitioning? The division of domain data into dissimilar equivalence data classes is performed using Equivalence Partitioning. It is executed for redu

Q. Why we use Debugger? Debugger is a program which allows the user to test as well as debug the object file.  Theuser can use this program to perform subsequent functions.

The total number of class of IP address are? The total number of class of IP addresses are 5.

Q. The work function of a metal surface is 6.626 X 10-19 joule. Compute the frequency of the radiation? Work function                                     W = hγ o The fre

ALU ORGANISATION An ALU performs simple arithmetic and logic operation as well as shift operations. Complexity of an ALU relies on the type of instruction set for that it has b

For what is defparam used? Though,  during  compilation  of  Verilog  modules,  parameter  values  can  be  altered  separately  for every module instance. This allows us to pa