This programming assignment is about computing topological properties of Protein-Protein Interaction (PPI) networks. Recall that a PPI network is represented by a graph G=(V,E) where nodes of V represent proteins and an edge of E connecting two nodes represents interacting proteins (either physically or functionally). The properties include clustering coefficient and node centrality. A detailed description of these properties follows:

** Clustering coefficient**

In many networks, if node *u* is connected to *v*, and *v* is connected to *w*, then it is highly probable that *u* also has a direct link to *w*. This phenomenon can be quantified using the clustering coefficient

*C*_{u}** = 2***n*_{u}/* k*_{u }(*k*_{u }-1)

where *n*_{u} is the number of edges connecting the *k*_{u }neighbors of node *u* to each other. In other words, *C*_{u} gives the number of 'triangles' that go through node i*, *whereas *k*_{u}(*k*_{u} -1)/2 is the total number of triangles that could pass through node u*, *should all of node u's neighbors be connected to each other.

The average clustering coefficient,<*C*>, characterizes the overall tendency of nodes to form clusters or groups. An important measure of the network's structure is the function *C*(*h*),which is defined as the average clustering coefficient of all nodes of degree *h*, i.e. with *h *adjacent vertices.

**Closeness centrality of a node**

Closeness-centrality is a measure of node centrality and uses information about the length of the shortest paths within a network; it uses the sum of the shortest distances of a node to all other nodes. The closeness-centrality of node *u* is defined as the reciprocal of this sum:

*C*_{clo}**(***u*) = 1/(**Σ**_{v}_{∈}_{V}** dist (u , v)).**

**Implementation**

Write 5 subroutines:

1. Clustering-Coefficient

2. All-Pairs-Shortest-Paths

3. Create-Adjacency-Matrix

4. Node-Centrality

5. Get-Shortest-Paths

The subroutine **Clutering-Coefficient** outputs 1) the average clustering coefficient <*C*>, 2) *C*(*H*), for all H values, and 3) the top 5 nodes (proteins) with highest cluster coefficient. The protein ids should be those used in the .sif file.

The procedure **Create-Adjacency-Matrix** takes as input argument the .sif representation of a graph and generates the adjacency matrix representation of the graph. It uses hashes (as defined in perl) to map the protein ids into indexes of the adjacency matrix.

The procedure **All-Pairs-Shortest-Paths** has as input argument the adjacency matrix representation of a graph. The subroutine returns a two-dimensional matrix, A, with A[i,j] giving the length of the shortest path between nodes i and j.

** **The subroutine **Node-Centrality** outputs the top 5 nodes according to the centrality measure, that is the 5 proteins that are most central. The protein ids should be those used in the .sif file.

** **The subroutine **Get-Shortest-Path** takes in input a pair of nodes and the adjacency matrix representation of a graph and returns a shortest path between the two nodes. Call this subroutine from the main using as input parameters the two nodes with highest value of node-centrality.

** Data**

In this assignment you analyze the Protein-Protein Interaction (PPI) graph of the herpes Kaposi virus. The file kshv.cys (available at t-square, Resources) contains such a graph in cytoscape format.

You select the subgraph of this network consisting of the nodes with degree **equal or less than k**,(k=7).

This will be the input to your perl program Properties-PPI-Networks.pl that calls the subroutines described above.

Electronically submit the following:

a. the file Properties-PPI-Networks.pl containing the main program and all subroutines.

b. A document where you state the time complexity of each subroutine as a function of the number of nodes and/or edges and explain how you obtained it.

** **