# Principal Component Analysis

The Principal Component Analysis (PCA) uses spot expression levels across gels to determine the principle axes of expression variation. Transforming and plotting the expression data in principle component space allows us to separate the gel samples according to expression variation. This is useful in identifying gel outliers.

Consider a simple experiment with 2 gels and 15 spots on each gel. We can plot the normalised volumes of the spots in a 2-dimensional graph:

The first step in PCA is to draw a new axis representing the direction of maximum variation through the data. This is known as the first principal component.

Next, another axis is added orthogonal to the first and positioned to represent the next highest variation through the data. This is the second principal component.

The data is then transformed (rotated) to view the points on the new axes.

Obviously, with just 2 gels this is seems fairly pointless, as our brains can quite easily see the relationships between points in a two-dimensional space. However, with 3 gels the points are plotted in a 3-dimensional space, with 4 gels the points are plotted in a 4-dimensional space, and so on. In these cases, the process of adding more principal components continues, each one orthogonal to the previous one and each one accounting for less and less of the variance in the data set.

The result of this is that we can visualise spots (and gels) in two- or three-dimensional space in such a way that spots that are "close together" (i.e. not showing much variation) will appear together on the PCA plot and vice versa. By displaying gels as well as spots on the same graph (called a biplot), we can help show which spots are contributing to the difference between gels. This can then be used to determine which spots are most important in distinguishing a particular gel or group from the other gels or groups.

# The PCA biplot

We can graph both transformed spot and gel data on a biplot. The biplot contains a lot of information and can be helpful in interpreting relationships between experimental groups and spots. Also, it can help to identify outlier gels, i.e. gels that have different properties to other gels in the same groups. In the biplot shown below, we can see that gels from each group (the coloured dots) are close to each other. However, the gels in group "Drug C" (the orange dots) are not as close as the gels in the other three groups. The spot are also shown and appear to form two distinct groups.

It is important to realise that if only those spots that are significant (e.g. p-value < 0.05) are chosen, the PCA plot will be more likely to clusters gels according to their group. This is because a significant spot is one which exhibits differences between groups, and PCA captures differences between groups. Therefore, using significant spots for the PCA will always see some sort of grouping. On the other hand, if we select all spots and look at the biplot, we would still hope to see the groupings we expect. This can be a better indication of whether we have any gel outliers. Finally, if all spots are used in the biplot, it may be more useful to look at the second and third principle components. This is simply because PCA captures the variation that exists in the spot data and you have chosen all spots. However, most of them will show no significant change (i.e. little variation) and so some other underlying source of variation may be captured in the first dimension.

# Interpretation of spot position

The spot positions can be interpreted as follows. We can consider the spot number 20 in red on the biplot. Imagine a line going from the (0,0) position to the spot and also in the opposite direction. We can think of this as the spot axis. Like all axes, it has a positive side in the direction of the spot and a negative side in a direction away from the spot and on the other side of the (0,0) point.

Now, gels on the positive side of the axis will have a high expression value for the spot while gels on the negative side will have a low expression value for this spot. The closer the gel is to the axis, the more the influence of this spot for that gel position. However, gels positions are determined by all spots. Looking at the expression profile we see that this is indeed the case.

Drug C has a low expression value for spot 20 while Drug A and Drug B have higher expression values for the spot. So, in general we can say that spots which are close to a gel group on the biplot will have higher expression value in this group than spots further away. Also, spots that are clustered together on the biplot should have similar expression profiles and therefore should be clustered together in the dendrogram.