Implement a simple utility for drawing dotplots

Assignment Help Biology
Reference no: EM131262808

Coding for biologists:

SUBMISSION INSTRUCTIONS

You should submit a single zipped file containing the entire work directory for the assignment.

This should include: all FASTA files, all of your code, and an iPython notebook with the details of your work. All code should be either included in the notebook, or written in separate files that are either imported or run from the notebook via the %run iPython magic,

see - https://ipython.org/ipython-doc/rel-0.10.2/html/interactive/tutorial.html

All output and all comments should appear in the notebook. It should be possible to run the entire notebook by running the cells sequentially from the beginning to the end (check that this works by restarting the kernel and working through the notebook from the top). Code and graphic output not linked to (directly or indirectly) from the notebook will not be marked. Only the notebook,Python code, text files required by the software and graphic output produced by the softwarewill be marked.Comment your code thoroughly and format it properly.

MARKING CRITERIA

Your work will be marked based on:
- completeness and correctness: 60%
- quality of the algorithmic solutions (including appropriate use of data and control structures, use of functions, etc.): 30%
- coding style (comments, variable names, readability of code): 10%

Outline

For this assignment, you will implement a simple utility for drawing dotplots comparing two proteins. You can refer to the dotter program and the lecture notes for the Computational Genomics module for inspiration. The assignment is presented as a sequence of stages.

Attempt all questions in the "Requiredfunctionality" part before implementing any features marked as "Optional functionality". You can implement any subset you like of the optional functionality. Check that your program runs correctly in the terminal, then use the iPython %run magic to run it from within a notebook. Include sample output for each functionality you implement and any other relevantinformation in the notebook.

Indicate clearly near the top of the notebook which of the questions you have attempted.

Required functionality

a) Write a dotplot program that reads two proteins from FASTA files specified on the command line (see sys.argv in the Python documentation). The program should output a simple dotplot to the terminal. The dotplot should involve only the first 70 residues of the sequence displayed horizontally and the first 20 residues of the sequence displayed vertically, so as to fit in the standard terminal screen. The first row and the first column should display the two sequences. In the dotplot proper, an asterisk (*) should mark locations corresponding to matching entries, while the rest should be left empty. A sample output (limited here for convenience to 10 residues from one sequence and 5 from the other) should look like:

TSLWWAPQQR
A *
K
Q **
P *
R *
Include a sample output in your notebook.

b) Code a simple help message to be displayed when the program is invoked with wrong or insufficient arguments or with the string help on the command line. Run your program from within the notebook to display the help message. To allow for easy modification and translation, the help message should be stored in a separate text file and loaded and displayed upon request.

c) Program a simple menu system of the type found in clustalw that allows the user to specify the names of the input files, obtain help, and quit the program. The menu should be displayed if the program is invoked without command-line arguments, or in any case after a dotplot is produced. You should wait for the user to press the enter key before reverting to the menu, to avoid wiping out the dotplotimmediately when running in a terminal. For clarity, print the following line just below the dotplot: Hit <enter> to return to menu:
Include a screenshot of the menu in the notebook.

d) Implement panning through the sequences to visualise the rest of the dotplot. When a dotplot is displayed, the user should have a choice to press one of five keys to "page" forwards or backwards through either sequence, or return to the main menu. Following this a different portion of the dotplot should be displayed, or the user should be returned to the main menu. For example, a text line printed just below the dotplot should read:

Enter [r]ight, [l]eft, [u]p, [d]own or [m]enu:

The system should be able to handle sequences with a number of residues that isn't a multiple of 20 or 70. Demonstrate this feature in the notebook.

Optional functionality

e) Use a scoring matrix instead than a simple identity check to score corresponding amino acids. Only plot a (*) if the score is above a threshold. The scoring matrix should be stored in a separate file that is loaded as required. The user should be able to select the threshold with a command line option and through the menu; for example mydotplot -t0.3 proteinA.fastaproteinB.fastashould select a threshold value of 0.3. Include sample output in the notebook and comment on the difference with respect to the simpler scoring scheme, if any (you can return to identity matching by choosing the identity matrix as your scoring scheme).

f) Implement filtering with a window of length w.

If you are not implementing (e): only draw a (*) at position (i,j) on the dotplot if the number of matching residues in corresponding positions within windows of length w centred at positions i (respectively j) onthe two sequences is above a threshold t. So for instance if w=5 and t=3 a (*) should appear at any givenposition only if at least 3 corresponding residues within windows of length 5 match (both in the sense that they are the same residue, and that they are in the same position within the window; so for example if the two filtering windows contain "APKTR" and "AKQWR" then A and R count as a matches but K does not).

If you are implementing (e): For each position (i,j) in the two sequences, pairs of amino acids in corresponding positions in the filtering windows should be scored using the scoring matrix. These scores should be averaged and compared against the threshold. A (*) should then be printed only if the resulting average score is above the threshold.

In either case you should implement a command line option -f to allow the user to request the use of the filter and specify the length of the window, and an option -t for threshold selection. For instance mydotplot -f5 -t2.0 proteinA.fastaproteinB.fastashould produce a dotplot of protein Avs protein B, filtered with a window of length 5 and a threshold of 2.0. The same functionality should also be accessible through the menu. Invoke your dotplot program on two sample sequences, without and with filtering, include the output in the notebook and comment on the differences.

g) Give the user the option to display the dotplot for the entire sequences using a graphic library. I suggest the imshow function from the matplotlib library, but other equivalent choices are also fine (if this library is not present on your system, use the software installer to install python-matplotlib). https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.imshow

Note that you will not be able to display the sequences with imshow, only the dots will be displayed as an image.

For this to work, you will need to create a two-dimensional array of the appropriate size and set each single entry to 0.0 (black) or 1.0 (white) to differentiate between dots and background. You can pass the keyword argument cmap='gray' to imshow to select a grey scale colormap. If you have implemented point (e) and/or (f), you may want to display the matching score itself as a grey level, instead of creating a black-and-white (two-level) dotplot. It is still useful to set a threshold below which the point is set to white. According to your scoring scheme, you may need to rescale/normalize thethresholded scores for display with imshow (read the function description and the example carefully). Graphic output should be selectable from the command line (via option -g) and from the menu of your program. Include sample output in the notebook (you can get matplotlib images to display directly in the notebook by running the magic %matplotlib inline).

SUBMISSION CHECKLIST:
- Notebook contains links to all relevant code and all output required
- Notebook runs in a sequence from the first cell to the last with a fresh kernel
- Notebook and software include name of author and/or student number
- No Microsoft Word or other files other than Python code, text and a notebook file, and images generated by the code (with links in the notebook)
- All relevant files are included in the submission as a single .zip file

Reference no: EM131262808

Questions Cloud

Differences between ifrs and gaap : In 2009, the FASB completed a five-year effort to distill the existing GAAP literature into a single database known as: Financial statements follow: Differences between IFRS and GAAP include all of the following EXCEPT:
An effort to buy products made in america : Has Alex Rodriguez demonstrated that he is worth U.S.$30 million a year? Does his ethical behavior on and off the field have anything to do with this?
Career planning and fitness programs : Refer to Scenario 1.1. The career planning and fitness programs provided to A-OK employees help fulfill which fundamental goal of human resource management?
Diluted earnings per share for the year ended : On December 31, 2015, Berclair Inc. had 600 million shares of common stock and 16 million shares of 9%, $100 par value cumulative preferred stock issued and outstanding. On March 1, 2016, Berclair purchased 30 million shares of its common stock as tr..
Implement a simple utility for drawing dotplots : Implement a simple utility for drawing dotplots comparing two proteins. You can refer to the dotter program and the lecture notes for the Computational Genomics module for inspiration. The assignment is presented as a sequence of stages.
What management procedures could the ioc implement : Are the Olympics a domestic, an international, or a multinational sport organization? - What management procedures could the IOC implement before the 2016 Games to prevent any scandals?
What is the architects role to mitigate these issues : Cost overruns and schedule delays are the two most common causes of legal disputes for construction projects. What is the architect's role to mitigate these issues?
Different segments of the population : 1. Search for two advertising Ads that are directed towards two different segments of the population. 2. Your Ad segment can be any of the following segments(must pick one for each Ad):
Create a bcg matrix for jetblue airways : Create a BCG Matrix for JetBlue Airways. Recommend speci?c strategies and long-term objectives. Show how much your recommendations will cost. Clearly itemize these costs for each projected year. Compare your recommendations to actual strategies pl..

Reviews

Write a Review

Biology Questions & Answers

  Premature termination of transcription

Suppose the two tryptophan codons within the leader peptide ofthe trp operon were changed from tryptophan to asparagine. Predict if attenuation (premature termination of transcription) of the trp operon will occur if bacteria are grown on media co..

  Provide the driving force for the synthesis of atp

The role of O2 in electron transport In mitochondrial electron transport, what is the direct role of O2? a)to provide the driving force for the synthesis of ATP from ADP and Pi b)to function as the final electron acceptor in the electron transport..

  Complementary and alternative therapies against cancer

Write a Three pages paper on the topic Complementary and alternative therapies against cancer.

  Explain the theory of evolution by natural selection

What are two examples of evidence that support the theory of evolution by natural selection. One example should relate to the anatomy or physiology of animals.

  Which biome is characterized by an extensive canopy

Which biome is characterized by an extensive canopy that allows little light to penetrate to the ground and by the presence of epiphytes? A) temperate grassland B) coniferous forest C) tropical rain forest D) desert E) tundra.

  Review and identify the blood transfusion requirements

Blood transfusion- Review and identify the blood transfusion requirements within that consent form; explain where and how each element is noted within the actual form itself

  Why does changing concentrations of ions change

Why does changing concentrations of ions change equilibrium potential and changing permeability affect threshold? You might want to discuss the Nernst Equation vs the Goldman equation.

  Which of the would occur after administration of oubain

Total body sodium content determines extracellular fluid volume, and is regulated by the balance between sodium intake and sodium loss. Oubain is a poison that causes blockage of the Na+/K+ ATPase. Which of the following would occur after administ..

  A compound microscope has two lenses of focal lengths 2 cm

a compound microscope has two lenses of focal lengths 2 cm and 4 cm respectively. when an object is placed at 2.5 cm

  What might be the cause of this problem

A student, feeling it is immoral to eat plants or animals, decides to eat only artificial food. He places himself on a diet consisting of only D-Amino Acids and L-sugars.

  What is the dilution of the final tube

If you need to make 1 lite of a 1x solution of western transfer buffer, how would you prepare it from a 10x stock?

  A wild fire devastates the population

Suppose you have a population of 150 butterflies, but a wild fire devastates the population and only 24 butterflies survive.  What percent does the colony decrease by?

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd