Complete-linkage clustering algorithm

Assignment Help Other Subject
Reference no: EM132488935

Activity: Clustering genes (Part B)

Unlike the leukemia data in the first activity, which is very high-dimensional, the YeastGalactose dataset has only moderate dimensionality (20 dimensions), so density-based clustering algorithms may work in this scenario. In this activity we will experiment with the HDBSCAN* algorithm.

Important note: One problem with the HDBSCAN* implementation that we are familiar with, available in the package dbscan , is that the version currently available (when this assignment was prepared) says that "Euclidean distance is required". So, although the theoretical HDBSCAN* model works with any distance, in principle we should not run HDBSCAN* directly with Pearson using this package. Apart from the possible existence of other R implementations of the algorithm that could be used instead, we will stick with the package dbscan here by using a mathematical workaround. Specifically, it can be shown that there is a relation between Pearson similarity and Euclidean distance when the observations are normalised as unit vectors, that is, when the rows of the data matrix are rescaled so that each row is a vector with magnitude one (i.e., length = 1). Clustering the normalised data with Euclidean distance is expected to provide results that are similar to those that would be obtained by clustering the original data with Pearson similarity.

You are asked to:

4. Use the distance matrix as input to call the Single-Linkage clustering algorithm available from the base R package stats and plot the resulting dendrogram. Do not use any class labels to perform this step.

5. Use the distance matrix as input to call the Complete-Linkage clustering algorithm available from the base R package stats and plot the resulting dendrogram. Do not use any class labels to perform this step.

6. Use the distance matrix as input to call the Average-Linkage clustering algorithm available from the base R package stats and plot the resulting dendrogram. Do not use any class labels to perform this step.

7. Use the distance matrix as input to call Ward's clustering algorithm available from the base R package stats and plot the resulting dendrogram. Do not use any class labels to perform this step.

8. Compare the dendrograms plotted in Items 4 to 7. Visually, the dendrograms suggest that some clustering algorithm(s) generate more clear clusters than the others. In your opinion, which algorithm(s) may we be referring to and why? In particular, in which aspects do the results produced by this/these algorithm(s) look more clear? Perform Item 9 below only for this/those algorithm(s).

9. Redraw the dendrogram(s) for the selected algorithm(s) in Item 8, now using the class labels that you stored separately in Item 2 to label the observations (as disposed along the horizontal axis of the dendrogram). Do some prominent clusters in the dendrogram(s) correspond approximately to the classes (that is, the two subtypes of leukemia)?

15. Rescale the 205 x 20 data frame in a row-wise fashion so that each rescaled row has magnitude 1. You can achieve this by dividing each element of a row by the magnitude of the row.

16. Run HDBSCAN* (with Euclidean distance) on the rescaled version of the data frame obtained in Item 15. You can (optionally) try different values for the parameter MinPts, but MinPts = 5 is required. Plot the resulting HDBSCAN* dendrograms with and without the class labels along the horizontal axis, just like in Items 4-9 (Activity 1) and Item 14 (Activity 2).

17. Plot a contingency table. By setting MinPts = 5, the automatic cluster extraction method provided by HDBSCAN* extracts four clusters from the resulting hierarchy. Plot a contingency table of these clusters (labelled '0', 1 ' , ' 2 ' , ' 3 ' and '4', where '0' means objects left unclustered as noise/outliers) against the ground truth class labels that you stored separately in Item 12 (a factor with levels 'cluster1', 'cluster2', `cluster3', 'cluster4').

18. Interpret the contingency table. In particular: (a) What is the best correspondence between the four found clusters and the clusters according to the ground truth, that is, the best association between cluster labels 1 , 2 , 3 and '4' as named by HDBSCAN* and the four known functional categories `cluster'', 'cluster2', 'cluster3' and 'cluster4' as named in the ground truth? (b) What is the functional category for which most genes have been labelled as noise/outliers?

19. Plot the genes grouped by their class labels (that is, functional categories 'dusted', 'cluster2', 'cluster3' and 'cluster4'), in such a way that all the genes belonging to the same class are plotted in a separate sub-figure (four sub-figures in total, each one in a different colour). Plot each gene as a time-series with 20 data points (where each point is connected by lines to its adjacent points in the series).

20. Plot a figure analogous to the one in Item 19, but now with genes grouped in separate sub-figures according to their cluster as assigned by HDBSCAN* (`t, '2', '3' and '4'), rather than by class labels. Do not plot genes that were left unclustered as noise by HDBSCAN* (labelled '0'). Use the best class-to¬cluster association, as in your answer to Item 18, in order to assign each sub-figure of a cluster the same colour used in the sub-figure of the corresponding class in Item 19. For instance, supposing that the best association of class 'clusterX' in the ground truth is with HDBSCAN* cluster 'Y', according to the contingency table in Item 18, then if the genes belonging to class 'clusterX' have been plotted in red in Item 19, then the genes belonging to HDBSCAN* cluster 'Y' should also be plotted in red.

Attachment:- data.rar

Reference no: EM132488935

Questions Cloud

What are some good rules of thumb for interviews : Based on your experiences, as well as the chapter information, what are some good "rules of thumb" for conducting successful performance appraisal interviews?
Widget on the open market : Nancy counters that she was not notified of the resale on the open market and is therefore not liable. Who is correct? Why?
Determine the corporate tax liability : What is the tax expense? Round your answers to the nearest dollar.What is the firm's taxable income? Round your answer to the nearest dollar.
Where would the team expect to spend the least time : In which phase would the team expect to invest most of the project time? Why? Where would the team expect to spend the least time?
Complete-linkage clustering algorithm : Complete-Linkage clustering algorithm available from the base R package stats and plot the resulting dendrogram. Do not use any class labels to perform
Determine how much in dividends firm pay to shareholders : In its most recent financial statements, Del-Castillo Inc. How much in dividends did the firm pay to shareholders during the year?
Explain how the compromises could be avoided : Web servers are compromised for a number of reasons which may include any of the following: Improper file or directory permissions, installing the server with.
What is the different between bankrupt and insolvent : What is the different between bankrupt and insolvent? What is the reason for those multinational corporation have set up corporate universities on it?
Describe the types of information in the income statement : Describe the types of information in the income statement that are new information to financial statement users when the annual report is issued.

Reviews

Write a Review

Other Subject Questions & Answers

  What strategies seem to have the most promise

In this module, you read about Indian and Anglo-American cultures, and how vastly different they are. These differences have have hampered communication.

  Use in compositional contrast images

Identify the interaction(s) that are of use in compositional contrast images.

  Describe the event and what was the lecture about

Describe the event. What was it? Where was it? How many people do you think were there? What was the lecture about? Or the poster display?

  Contrast planning cash requirement

Contrast planning cash requirement, especially borrowing, using the statement of cash flows derived from forecast financial statement with a cash budget. Which is likely to be more useful in running a finance department?

  What you have learned about formal language

Apply what you have learned about formal language and academic writing to write your own 2- 2 ½ pages draft of an essay that analyzes a relevant change.

  Should the age to receive benefits be raised

From your readings and research this week it should be apparent that many feel the future of Social Security is at risk. What are your thoughts about the future

  Define clinical problem that arises from patient population

Select a nursing practice problem of interest to use as the focus of your research. Start with the patient population and identify a clinical problem or issue.

  Discuss miranda rights for terrorist suspects

the Miranda rights for terrorist suspects is a good idea because the terrorist known that we have respect

  Examine two aspects of teamwork training

Propose two (2) approaches geared toward introducing the topic of ethics or ethical behavior in the teamwork training project. Examine two (2) aspects of teamwork training that change a person into a transformational leader. Justify your response

  Critical examination of four discrete concepts

BUMGT5980 Assessment Task - Decision Making Application. Discuss Simon's quote in reference to a critical examination of four discrete concepts that may explain bias in decisions

  Identify the concept of mandatory reporting

Describe the purpose of the abuse regulation,Identify the concept of mandatory reporting within the regulation,methods that healthcare organizations must observe for compliance and the manner in which employees must report,penalties for noncompliance..

  Economic climate or social policies

Broader societal factors include the economic climate or social policies that contribute to the inequitable distribution of services or resources based on incom

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd