Dna sequences, Computer Engineering

The dataset provided in this assignment contains a collection of real DNA sequences. The number of true binding sites is quite limited and that makes the problem challenging. In machine learning community, this is termed as imbalanced datasets. Some techniques dealing with imbalanced data classification, such as sampling or filtering, can be applied for the biological data. It is a good idea to find some relevant publications to see in which way you can build effective classifiers for motif recognition.

The whole dataset should be partitioned into a training dataset used to build the learner models, and a testing dataset used to evaluate generalization capability of the classification systems. System performance will be evaluated by looking at the recall, precision, F-measure and recognition rate for both the training dataset and the test dataset.

It is very important to notice that unlike traditional way for evaluating classifier's performance, here a kmer is classified as a motif instance if its location has at least 50% overlap with a true binding site in the DNA sequences. For example, consider two true binding sites ACACGGGA and ACACGGGA in the following DNA sequence.

ccttacacaaACACGGGAgaattaatACACGGGAtcagatcaataaa (1)

Suppose that the 8mers acaaACAC and ACGGGAtc are classified as binding sites by a learner model. Then, we will count them as correct prediction because they have 50% and 75% overlaps with the true binding sites in sequence (1), respectively. Conversely, if classifiers classify them as non-binding sites, then we will count them as incorrect prediction because they have at least 50% overlaps with the true binding sites. Take another 8mer, GAgaatta, in (1). If it is classified by a learner model as a binding site, then it will be counted as a misclassified one because it has only 25% overlap with the true binding site ACACGGGA

Posted Date: 3/29/2013 5:36:08 AM | Location : United States







Related Discussions:- Dna sequences, Assignment Help, Ask Question on Dna sequences, Get Answer, Expert's Help, Dna sequences Discussions

Write discussion on Dna sequences
Your posts are moderated
Related Questions
Problem: (a) What are the two main advantages of sound? (b) UTMDigitlab ltd, specialized in digitizing sound, converts an audio stream of the latest album of Britney Spea

Given the subsequent FIFO and rules, how deep does the FIFO require to be to stop underflow or overflow? RULES: a. frequency(clk_A) = frequency(clk_B) / 4 b. per

design a FULL adder with two half adders and an or gate

What is CTS and CLS? CLS stands for common language specification CTS stands for common type system.

What is an algorithm The solution to any problem that is stated by an exact number of well-defined procedural steps is known as an algorithm.

Question: (a) What are the main challenges for Mobile Computing? (b) What is the ISM band and what is particular about this frequency band? Name two wireless technologies

Nonvolatile BIOS memory refers to a small memory on PC motherboards that is used to kept BIOS settings. It was traditionally known as CMOS RAM because it used a volatile, low-power

Level of a node The root is at level zero and the level of the node is 1 more than the level of its parent

In this problem you are given a board in which some of the elements are placed as shown in diagram below. Each element represents a color. Fill the other elements in the board, suc

When producing a datapool, you state the kinds of data (called data types) that the script will send for example, customer names, addresses, and unique order numbers or product nam