SOLUTION: the original positions of objects

Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.
Page 1 of 10
Tutorial Solutions – Week 11 (MDS)
Question 1:
Describe the process of MDS.
Solution:
 Define the original positions of objects in multidimensional space
 Specify the number m of reduced dimensions (typically 2).
 Construct an initial configuration of the samples in 2-dimensions
 Regress distances in this initial configuration against the observed (measured) distances.
 Determine the stress (disagreement between 2-D configuration and predicted values from
regression).
 If stress is high, reposition the points in m dimensions in the direction of decreasing stress,
and repeat until stress is below some threshold.
Question 2:
What are the similarities and differences between MDS and CA and PCA?
Solution:
MDS and CA are both based on distances and neither requires apriori groups.
They differ in how they use the distance matrix to reduce dimensions.
 CA sequentially pairs and recalculates distances based on the given clustering algorithm
(nearest neighbour etc). The objects/cases/samples are then displayed using the 2D
dendrogram.
 MDS tries to represent the distance matrix on a 2D ordination plot, i.e. spatially. All
distances are subtly (hopefully) altered to allow the closest approximation of the true
distances to be plotted.
A metric MDS based on Euclidian distances and a linear regression process (a linear MDS) should
produce very similar results to a PCA which also reduces dimensions by finding a linear
approximation of the original variables in lower dimensions.
Question 3:
Complete the exercise at the end of Chapter 11 of Manly using the ‘europeemploy.txt’ file available
on the StudyDesk.
Solution:
> em > str(em)
‘data.frame’: 30 obs. of 9 variables:
$ AGR: num 2.6 5.6 5.1 3.2 22.2 13.8 8.4 3.3 4.2 11.5 …
$ MIN: num 0.2 0.1 0.3 0.7 0.5 0.6 1.1 0.1 0.1 0.5 …
$ MAN: num 20.8 20.4 20.2 24.8 19.2 19.8 21.9 19.6 19.2 23.6 …
$ PS : num 0.8 0.7 0.9 1 1 1.2 0 0.7 0.7 0.7 …
$ CON: num 6.3 6.4 7.1 9.4 6.8 7.1 9.1 9.9 0.6 8.2 …
$ SER: num 16.9 14.5 16.7 17.2 18.2 17.8 21.6 21.2 18.5 19.8 …
$ FIN: num 8.7 9.1 10.2 9.6 5.3 8.4 4.6 8.7 11.5 6.3 …
$ SPS: num 36.9 36.3 33.1 28.4 19.8 25.5 28 29.6 38.3 24.6 …
$ TC : num 6.8 7 6.4 5.6 6.9 5.8 5.3 6.8 6.8 4.8 …
> distem> attr(distem, “Labels”) > distem
dij

dij
Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.
Page 2 of 10
Belgium Denmark France Germany Greece Ireland Italy Luxembourg
Belgium 0.000
Denmark 3.938 0.000
France 4.915 4.184 0.000
Germany 10.042 10.319 7.319 0.000
Greece 26.321 24.037 22.292 22.233 0.000
Ireland 16.102 14.073 11.773 12.367 10.702 0.000
Italy 12.826 12.701 10.091 9.017 16.918 8.639 0.000
Luxembourg 9.311 10.400 6.814 6.888 21.996 12.169 7.430 0.000
Netherlands 7.070 7.927 8.733 14.663 27.267 17.613 16.255 13.352
.
. (subset only)
There are 30 countries and 9 workforce types (variables). Therefore the distance matrix between
countries is 30×30.
Using monotonic non-metric MDS
> (cmmds Call:
monoMDS(dist = distem, k = 2)
Non-metric Multidimensional Scaling
30 points, dissimilarity ‘euclidean’, call ‘dist(x = em, method = “euclidian”, diag
= TRUE, upper = FALSE)’
Dimensions: 2
Stress: 0.05355538
Stress type 1, weak ties
Scores scaled to unit root mean square, rotated to principal components
Stopped after 59 iterations: Stress nearly unchanged (ratio > sratmax)
> (cmmds Call:
monoMDS(dist = distem, k = 3)
Non-metric Multidimensional Scaling
30 points, dissimilarity ‘euclidean’, call ‘dist(x = em, method = “euclidian”, diag
= TRUE, upper = FALSE)’
Dimensions: 3
Stress: 0.02255763
Stress type 1, weak ties
Scores scaled to unit root mean square, rotated to principal components
Stopped after 158 iterations: Stress nearly unchanged (ratio > sratmax)
Even with only 2 dimensions the stress=0.054 which is really good. Adding a third dimension
(stress=0.023) drops the stress by 0.031, however it is debatable whether the reduction is worth
the additional complexity in interpretation. We will look at some plots to help us decide.
> plot(cmmds,choices = c(1,2))
> plot(cmmds,choices = c(1,3))
Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.
Page 3 of 10
Albania, and Turkey are distant from other countries on dimension 1. Czech, Hungary, Yugoslavia,
Bulgaria and Romania stand out from the other countries along dimension 2. Gibraltar is distant
from the others along both dimension 2 and 3.
Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.
Page 4 of 10
Question 4:
The data file ‘plantsp.txt’ contains data for 17 variables measured on 25 plant species.
a) Create a distance matrix for this data based on Euclidian distances.
Solution:
> ps > head(ps)
Species X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17
1 Festuca_ovina 38 43 43 30 10 11 20 0 0 5 4 1 1 0 0 0 0
2 Anemone_nemorosa 0 0 0 4 10 7 21 14 13 19 20 19 6 10 12 14 21
3 Stallaria_holostea 0 0 0 0 0 6 8 21 39 31 7 12 0 16 11 6 9
4 Agrostis_tenuis 10 12 19 15 16 9 0 9 28 8 0 4 0 0 0 0 0
5 Ranunculus_ficaria 0 0 0 0 0 0 0 0 0 0 13 0 0 21 20 21 37
6 Mercurialis_perennis 0 0 0 0 0 0 0 0 0 0 1 0 0 0 11 45 45
> str(ps)
‘data.frame’: 25 obs. of 18 variables:
$ Species: Factor w/ 25 levels “Achillea_millefolium”,..: 8 4 22 2 19 16 18 20 24
6 …
$ X1 : int 38 0 0 10 0 0 1 0 0 0 …
$ X2 : int 43 0 0 12 0 0 0 7 0 0 …
$ X3 : int 43 0 0 19 0 0 5 0 1 0 …
$ X4 : int 30 4 0 15 0 0 6 10 4 0 …
$ X5 : int 10 10 0 16 0 0 2 9 6 0 …
$ X6 : int 11 7 6 9 0 0 8 9 9 8 …
$ X7 : int 20 21 8 0 0 0 10 3 9 0 …
$ X8 : int 0 14 21 9 0 0 15 9 9 14 …
$ X9 : int 0 13 39 28 0 0 12 8 11 2 …
$ X10 : int 5 19 31 8 0 0 15 9 11 14 …
$ X11 : int 4 20 7 0 13 1 4 2 6 3 …
$ X12 : int 1 19 12 4 0 0 5 5 5 9 …
$ X13 : int 1 6 0 0 0 0 6 5 4 8 …
$ X14 : int 0 10 16 0 21 0 7 1 1 7 …
$ X15 : int 0 12 11 0 20 11 0 7 7 7 …
$ X16 : int 0 14 6 0 21 45 0 0 0 2 …
$ X17 : int 0 21 9 0 37 45 0 0 0 1 …
> d > d
1 2 3 4 5 6 7
2 88.780629
3 97.846819 40.669399
4 62.337790 55.856960 50.596443
5 97.066987 47.623524 65.711491 70.228199
6 104.259292 61.057350 79.221209 79.271685 36.138622
7 77.265775 38.418745 38.832976 35.242020 58.068925 71.512237
8 74.363970 42.755117 48.062459 32.403703 55.641711 68.293484 19.183326
.
. (subset only)
There are 25 plant species with measurements on 17 variables for each. The distance matrix is
therefore 25×25. The first column is species name and needs to be excluded for the distance
matrix.
Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.
Page 5 of 10
b) Perform metric MDS and discuss the GoF test and the number of dimensions you think
would be useful. Plot dimensions 1 and 2 and then 1 and 3 and discuss the relationships
among species, particularly the position of Stallaria holostea.
> (fit2 $points
[,1] [,2]
[1,] 60.4850221 -28.3156608
[2,] -20.7374748 1.8342349
[3,] -18.5655061 24.7037206
[4,] 20.3210729 5.6372771
[5,] -32.1398123 -28.9827810
[6,] -37.6545200 -43.5452577

[7,] 0.1855776	14.2840618
[8,] 3.3652378 7.6634749
[9,] [10,]	-1.6367840 11.9407654 -7.5146840 10.5929759
[11,] -10.1085755 0.4505646 [12,] 5.9047583 2.9353344
[13,]	-0.8124214 9.4202416
[14,] 13.6299304 [15,] 14.1650637	-3.9186646 -4.3547278
[16,]	-0.1661296 9.9131133
[17,] 16.2330364	-7.4864704 -2.3563177
[18,] [19,] [20,] [21,] [22,]	-8.2162971
-1.9096582 10.3937218 -3.4548870 6.7744489 -4.3606353 7.3178109 -5.1331735 3.5356719
[23,] 14.0967222	-6.0313082 -0.6331857 -1.7730440
[24,]	-5.6337291
[25,] 9.6578666

AssignmentTutorOnline

$eig
[1] 8.607303e+03 5.252585e+03 4.053839e+03 1.439755e+03 9.671400e+02
8.112199e+02
[7] 6.324233e+02 3.846597e+02 1.624553e+02 1.385690e+02 1.108017e+02
5.813185e+01

[13] 4.994791e+01 4.916706e+01
4.502666e-13

2.950288e+01 1.528035e+01 7.539100e+00

[19] 3.872906e-13 2.905052e-13 4.561327e-14 -1.008677e-13 -1.464934e-13 –
1.566099e-13
[25] -1.133859e-12
$GOF
[1] 0.6086822 0.6086822
The GoF is only mid range – try 3D and 4D
> (fit3 $GOF
[1] 0.7867139 0.7867139
> (fit4 $GOF
[1] 0.8499433 0.8499433
As dimensions are added GoF increases. 3D gives a 0.13 improvement of GoF. 4D gives only a
0.06 improvement. I think 3D would be the best for ease of interpretation.
> x > y Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.
Page 6 of 10
> y2 > plot(x, y, xlab=”Dimension 1″, ylab=”Dimension 2″, main=”Metric MDS”, type=”n”)
> text(x, y, labels = ps$Species, cex=.7)
> plot(x, y2, xlab=”Dimension 1″, ylab=”Dimension 3″, main=”Metric MDS”, type=”n”)
> text(x, y2, labels = ps$Species, cex=.7)
Dimension 1 is the largest scale – representing most differences among species.
Several of the species separated along dimension 2 also separate along dimension 3, although to
different degrees.
Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.
Page 7 of 10
From the dataset we can see that Stallaria_holostea is the 3rd species.
Species X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17
1 Festuca_ovina 38 43 43 30 10 11 20 0 0 5 4 1 1 0 0 0 0
2 Anemone_nemorosa 0 0 0 4 10 7 21 14 13 19 20 19 6 10 12 14 21
3 Stallaria_holostea 0 0 0 0 0 6 8 21 39 31 7 12 0 16 11 6 9
And from the MDS coordinates we can see that it is very separated from the group on dimensions 2
(representing the maximum positive distance of 24.7) and also the most distance on dimension 3
(representing the maximum positive distance of 34.16)
$points
[,1] [,2] [,3]
[1,] 60.4850221 -28.3156608 25.9083930
[2,] -20.7374748 1.8342349 26.8002953
[3,] -18.5655061 24.7037206 34.1638978
[4,] 20.3210729 5.6372771 14.5269066
c) Perform monotonic non-metric MDS. Discuss the stress value and plot dimensions 1 and 2.
Solution:
> library(vegan)
> (psmds Call:
monoMDS(dist = d, k = 2)
Non-metric Multidimensional Scaling
25 points, dissimilarity ‘euclidean’, call ‘dist(x = ps[, 2:18])’
Dimensions: 2
Stress: 0.07061471
Stress type 1, weak ties
Scores scaled to unit root mean square, rotated to principal components
Stopped after 80 iterations: Stress nearly unchanged (ratio > sratmax)
The Stress value of 0.071 is fairly good at ranks of the differences not the magnitudes of the differences.
> plot(psmds,choices = c(1,2)) #simple plot
> #to add labels to plot first extract coordinates
> names(psmds)
Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.
Page 8 of 10
[1] “nobj” “nfix” “ndim” “ndis” “ngrp”
[6] “diss” “iidx” “jidx” “xinit” “istart”
[11] “isform” “ities” “iregn” “iscal” “maxits”
[16] “sratmx” “strmin” “sfgrmn” “dist” “dhat”
[21] “points” “stress” “grstress” “iters” “icause”
[26] “call” “model” “distmethod” “distcall”
From the names() function we can see that there is a section of the output called ‘points’. This
should hold the co-ordinates, but let’s check:
> psmds$points
MDS1 MDS2
1 -2.87749686 0.633550270
2 1.16313325 -0.569749557
3 0.87872781 -1.442779502
4 -0.88849627 -0.513720874
5 1.44200580 0.818388938
6 1.55893907 1.627973932
7 0.10416314 -0.441481136
8 -0.09443011 -0.137461848
9 0.06892845 -0.247692661
10 0.29527112 -0.181017874
11 0.27390054 0.074388009
12 -0.18709870 0.060661213
13 -0.08377407 -0.853051008
14 -0.36883408 0.139612798
15 -0.41628623 0.287403007
16 -0.20036420 -0.332983865
17 -0.54416885 0.215378496
18 0.19071568 0.194766380
19 0.05967630 -0.154004028
20 0.06985502 -0.008426629
21 0.12392560 -0.016878282
22 0.13010259 0.088686937
23 -0.46567226 0.208534651
24 0.12560049 0.167525075
25 -0.35832323 0.382377559
The labels currently used on the plot are just the row labels of each set of co-ordinates.
We can use the species labels from the original data:
> x > y > plot(x, y, xlab=”Dimension 1″, ylab=”Dimension 2″,main=”non-Metric MDS”,
type=”n”)
> text(x, y, labels = ps$Species, cex=.7)
Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.
Page 9 of 10
The non-metric MDS is very similar to the metric, although the mirror image.
d) Perform PCA on this data and produce a biplot. Compare this plot to the metric MDS plot
from part b).
> ps.prcomp > biplot(ps.prcomp,cex=c(1,0.7))
The row labels have been used on the biplot for species so to recreate the metric MDS with
rowlabels as species:
> plot(x, y, xlab=”Dimension 1″, ylab=”Dimension 2″,
+ main=”Metric MDS”, type=”n”)
> text(x, y, labels = rownames(ps))
A metric MDS based on Euclidian distances (a linear MDS) should produce very similar results to a
PCA.
Below are the biplot from a PCA and a metric MDS ordination plot of the species data. Species
1,2,3,4 and 6 are in relatively similar positions on both plots. We know that only 2D on the MDS
was not a perfect fit (GoF=0.61) and the eigenvalues for the PCA suggest that at least the first five
PCs should be considered. SO interpreting these 2D plots must be done with caution and be sure to
state that they represent only a moderate amount of the overall variation or distances represented
among species.
The axes on an MDS cannot be related back to the individual dependent variables, while variable
loadings (and biplot vectors) are directly relevant to the PCs in PCA.
Because it was hard to see any relationships among species in the central cluster it may be worth
removing the ‘outlier’ species from this analysis and rerun the distance matrix and MDS to see if
there are any relationships currently hidden among the ‘centre’ species.
Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press, 07/2004.
Page 10 of 10