7089CEM Introduction to Statistical Methods for Data Science

Assignment Help Other Subject
Reference no: EM132539187

7089CEM Introduction to Statistical Methods for Data Science - Coventry University

Modelling and analysis of gene expression data

Learning Outcome 1: Demonstrate knowledge of underlying concepts in probability and statistics used in Data Science.

Learning Outcome 2: Select and apply appropriate statistical methods or techniques to solve problems or analyse data sets.

Learning Outcome 3: Use modern software to solve real world problems and analyse large data sets.

Learning Outcome 4: Interpret the results of their analyses and communicate those results accurately.

Coursework Description:

The aim of this assignment is to fit a non-linear time series model to the gene expression data set. Gene expression is one of the most important biological processes where information from a gene is used to synthesize a functional gene product, such as protein. The expression of a gene can be controlled (or regulated) by another gene or several other genes, through a gene product (protein) called transcription factor. Understanding how genes regulate each other, i.e. gene regulation, is important to investigate a complex diseases, and how cell respond to environmental stimuli.

Data:
The ‘simulated' 5 gene expression time-series data, are given in the excel file (gene_data.csv). The first column contains the sampling time in minutes, the rest 5 columns are the time-course expression data of 5 genes{"#, "%, "&, "', "(}, respectively. All these 5 genes are subject to additive noise (assuming independent and identically distributed ("i.i.d") Gaussian with zero-mean) with unknown variance.

Part 1: Preliminary data analysis
You should first perform an initial exploratory data analysis, by investigating:

• Time series plots
• Distribution for each gene
• Correlation and scatter plots (between combination of two genes) to examine their dependencies

Part 2: Dimension reduction

• We would like to reduce the dimension of time (for all 5 genes) to two using PCA, you can choose to use either eigen-decomposition method or the singular value decomposition method.
• Plot these 5 genes in the reduced 2-dimensional space, with different notations or colours.

Part 3: Nonlinear regression - modelling gene regulation

We know one of the genes "& is regulated by the other two genes "' and "(, however, we do not know if such regulation is activation or repression, or if such a regulatory interaction is linear or nonlinear. Therefore, we will fit a generic nonlinear polynomial regression model (with 2 inputs) to the data with the following exemplar structure:

"& = +, + .#"' + .%"% + .&"& + ? + 0#"( + 0%"% + 0&"& + ? + 1
' ' ( (

Here +, is a bias term (denotes the basal transcription rate); {.#, .%, .&, ? , 0#, 0%, 0&, ? } are the parameters of the regression model to be estimated, and 1 denotes an additive, Gaussian, zero-mean noise.

The main objective of this Part is to identify the (polynomial) model structure, estimate model parameters from the training data, and use the identified model to predict the response/output signal.

Then you need to identify the nonlinear regression model structure and estimate its parameters, by

• Identify the correct model structure (by using a model selection approach - e.g. subset selection, AIC/BIC, or explore all possible different model structures), so that the model provides you a good mean square error (MSE) and the model residual/error is close to Gaussian. You can either:

i) Split the input and output dataset into two part: one part used to train the model, the other used for testing (e.g. 80% for training, 20% for testing). Apply the forward subset selection approach to select the best model structure iteratively (select the most significant term that reduce the MSE on testing data, in each iteration, and add it to the current model).
ii) Or select the best model, using BIC or AIC goodness-of-fit criteria, by exploring all possible combinations (or out of the different possible model structures).

The underlying nonlinear polynomial model may contain a bias term, a linear term, and one or few (input) nonlinear terms; the nonlinear terms can have a (maximum) nonlinearity up to 4th order, the maximum model terms will be no more than 3 (including bias, linear and nonlinear terms).
• Estimate the model parameters using least squares method. This step will be embedded within the above model structure identification process (since for each candidate model structure, you will need to estimate its parameters, in order to evaluate the model's performance against observation data).
• Once the best model structure is selected and its parameters are estimated, estimate the parameter covariance matrix, plot corresponding parameter uncertainty p.d.f. in the 3D and/or contours (similar to the example given in the lecture/lab notes). Plot the pair-wise combinations of all parameters, if you have more than 2 parameters in the selected model.
• Compute the model's output/prediction (on the training data), and also compute the 95% confidence intervals and plot them (with error bars) together with the mean values of the model prediction.
• Validate the model using train-test split validation approach (may use different splitting portion as the subset model selection stage), to check whether the identified model provide good prediction on the testing dataset.
• Using "Approximate Bayesian Computation (ABC)" method to compute the posterior distribution of the regression model parameters (using rejection ABC and assuming a Uniform prior). Plot the marginal posterior distribution for each parameter, and the joint posterior probability distribution for all pair-wise combinations of parameters.

Attachment:- Introduction to Statistical Methods for Data Science.rar

Reference no: EM132539187

Questions Cloud

Understanding of the concept of ethical dilemmas : Make sure your answers demonstrate a clear understanding of the concept of ethical dilemmas and the selected ethical rules.
How the expert assisted the jury and court process : Per the text, the need for expert testimony is based on the belief that experts have specialized knowledge that helps jurors understand the particulars.
Establishing security policy delineates responsibility : Establishing a security policy delineates responsibility and expected behaviour for the users of the system.
What extent do think women still have a better opportunity : What extent do you think women still have a better opportunity to forge deeper friendships than men? What needs to change to level the friendship
7089CEM Introduction to Statistical Methods for Data Science : 7089CEM Introduction to Statistical Methods for Data Science Assignment Help and Solution, Coventry University - Assessment Writing Service
Range of statistical data analysis tools : In the past several weeks, you have been introduced to a range of statistical data analysis tools. Consider what you have learned in the context
How you feel you have changed as a person : Reflect on how you feel you have changed as a person since your early teenage years. Do you feel you have been more or less prone to change as you have gotten.
Impact of copyright and intellectual property : Present the impact of copyright and intellectual property of Nike's company
How the media fosters unhealthy models of friendship : What is your own perception of how the media fosters unhealthy models of friendship through film and television? Please provide an example.

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd