SOLUTION: Assignment 1 | My Assignment Tutor

Page 1 of 6Assignment 1Total Marks: 100; Weighting: 20%Due: 13/4/21 11.55pmInstructions: Please ensure you have completed the Academic Integrity Course.Submit only one file in pdf format to the link on the Study Desk.Assume that your report will be read by someone familiar with the data set butwith limited statistical knowledge. Fully explain plots and when stating statisticsor results. Explain what they mean statistically AND in context of the data.Presentation should be neat, consistent, spell-checked and proof read. Allquestions should be clearly labelled, and all answers should clearly and conciselyaddress the questions.If you convert a Word document to pdf for submission check that all symbols,equations etc. have converted correctly, i.e. proof-read your work.All answers must be typed – do not include handwritten/scanned or stylus/tabletwritten responses in your document.If you do not use knitr to compile your submission, where asked to provide R code, paste relevant code within the assignment document and italicise (orotherwise highlight or distinguish from other content). Do not include code in anappendix. Do not include an appendix at all. Any work included in an appendix will not bemarked.Please note that referencing text books and other resources is not the goal of thisassessment. This work requires students to demonstrate their understanding ofthe analysis and interpretation, not provide quotes from resources.When interpreting output, you are expected to do so in context of the data and the method (i.e. ensure that you comment on aspects of the method that affectyour interpretation with respect to the variables and sample). A maximum of 10 marks will be deducted from your total marks for poorpresentation. Marks: Question 1: 25 Question 2: 20 Question 3: 30 Question 4: 25Page 2 of 6Data File:The same data set will be used for all four questions in this assignment.The data file ‘countries.dat’ contains data compiled in 2005 on 6 variables related to‘population health’ indicators of 30 countries from 4 Regions. Region:Irrigated:Population:Under.14:Life.expectancy:Literacy.Rate:Unemployment:The region of the country: Africa, Asia, Europe or South.America.The area of irrigated land (in square kilometres)The population (in millions)The population under 14 (in percent)The life expectancy at birth (in years)The reported literacy rate (in percent)The unemployment rate (in percent) Although you may not find these data to be MVN in Question 1, you should proceed withall analysis requested in Questions 2 to 4 assuming MVN, and comment on this limitationwhere relevant.Question 1 (25 marks):Provide R code, output and written interpretation for parts a) to d) of this question.Provide only output that is directly relevant to address each section.Test for multivariate normality (MVN) by:a) Describe the structure of the ‘countries.dat’ data. (1 mark total)b) Produce (2 marks) and interpret (2 marks) univariate QQ plots and histograms andunivariate Shapiro-Wilks tests of normality for each of the six population healthvariables. Which are the most non-normally distributed variables (1 mark)? (5 markstotal)c) Produce (1 mark) and interpret (1 mark) perspective and contour plots for theUnder.14 and Life.expectancy variables. What is an inherent problem with usingthese plots to assess MVN (1 mark)? (3 marks total)d) Perform the analysis necessary to provide the results of the Mardia, Henze-Zirklerand Royston tests of MVN based on all six population health variables. Include inyour interpretation: (8 marks total) i.ii.The Chi-Square QQ plot (1 mark) and interpretation (1 mark)Describe how the QQ plot is constructed and its relationship to the univariatenormal QQ plots (2 marks).Output and interpretation for the 3 tests (3 marks).What is a key limitation of these MVN statistical tests (1 mark)?iii.iv. Page 3 of 6e) One way to try and meet the MVN assumption could be to remove some of thevariables from the multivariate analysis. Which two variables would you choose toremove to see if that helped meet the MVN assumption (do not perform this analysis)(1 mark)? Suggest three additional ways that you might improve univariate andmultivariate normality for data sets in general (3 marks). (4 marks total)f) In part e) we suggested removing some variables to try and help the data approachMVN. Suggest one other reason why reducing the number of variables used inmultivariate analysis may be important (this question does not necessarily relatespecifically to this particular data set)? (2 marks total)g) If we were to use some form of transformation on the population health variables tomake them meet univariate normality, would this ensure multivariate normality?Briefly discuss (max 50 words). (2 marks total)Question 2 (20 marks):For all of Question 2, use only the population health variables: Under.14,Life.expectancy, Literacy.Rate and Unemployment. For the purposes of this assignment,assume that the MVN assumption has been met.Provide R code, output and written interpretation for parts b) and d) of this question.a) Is the data balanced or unbalanced across the 4 regions? Discuss including afrequency table from R showing the sample sizes in each Region. (2 marks total)b) Produce a draftsman display for the population health variables. Use the functionscatterplotMatrix (from week 2) and check the help documentation(?scatterplotMatrix) to help you produce a plot with observations grouped by regionalgroup using different colours, making sure that you include the associated legend.Your plot should not include smoothing, regression lines, or distribution curves in thediagonal panels of the plot (1 mark). Interpret these plots, relating back to theoriginal data where it may add to the interpretation (2 marks). What are the y and xaxes on plot [3,2] of the scatterplotMatrix (1 mark)? (4 marks total)Hint: to move the legend in scatterplotMatrix try something like:legend=list(coords=”bottomleft”).c) In the context of MANOVA, list the dependent and independent variables (1 mark)and define the relationship that the MANOVA would test (1 mark). (2 marks total)Page 4 of 6d) Using MANOVA in R, test for differences in ‘population health’ between the fourcountry regions. Include tests using all four test statistics covered in this course (2marks) and interpret output (3 marks). (5 marks total)e) Which of the four tests used in part c) would be the best to interpret if there areconcerns about multivariate normality or covariance equality? (1 mark total)f) Produce output that specifically compares each of the Regions with each other (youshould have 6 comparisons) using Hotelling’s T2 test and a significance level of 0.05(2 marks). Determine the multiple test corrected significance level (1 mark). Do notprovide R output; instead reproduce and complete the following table for allcomparisons and interpret. How may sample sizes have affected these results andthose in part d) (2 marks)? Will deviation from MVN influence these results (1 mark)?(6 marks total) ComparisonHotelling’sp-valueSignificant(Y/N)Significant aftercorrection (Y/N)Region 1Region 2 Question 3 (30 marks):For all of question 2, use only the population health variables: Under.14,Life.expectancy, Literacy.Rate and Unemployment. For the purposes of this assignment,assume that the MVN assumption has been met.Provide R code, output and written interpretation for parts a) to e) of this question.a) Produce (2 marks) and interpret (2 marks) the correlation and covariance matrices.Explain the difference between these matrices in detail (i.e. explain clearly how thevalues are adjusted mathematically and the effect of these changes) (2 marks).Would using the covariance matrix in PCA on this data be appropriate (1 mark)? Why(1 mark)? (8 marks total)b) Perform PCA analysis on the 4 population health variables using the prcomp function.Provide the eigenvalues (1 mark), %variation (1 mark) and scree plot (1 mark).Interpret each of these results (3 marks) and discuss how they influence yourPage 5 of 6decision on how many PCs to interpret from this analysis (2 marks). Remember tokeep in mind the overall purpose of PCA (8 marks total).c) Interpret (2 marks) the first PC. Include the Z equation (1 mark) and a plot of theloadings on the first PC in your answer (1 mark). (4 marks total)d) What is the correlation between the first and second PCs and what does this tell you?(2 marks total)e) Produce (1 mark) and interpret (2 marks) a biplot based on the first 2 PCs. Inparticular, explain your interpretation of the population health variables in Kenyacompared to Hong Kong (1 mark). Relate your interpretation back to the originaldata (1 mark). (5 marks total)f) Was this a useful analysis for this data set? Explain with specific reference to theresults of your prior analysis in this question. (3 marks total)Question 4 (25 marks):For all of question 2, use only the population health variables: Under.14,Life.expectancy, Literacy.Rate and Unemployment. For the purposes of this assignment,assume that the MVN assumption has been met.Provide R code, output and written interpretation for parts a), b) and d) of this question.a) Perform a Factor Analysis using the factanal function. Initially try using 2 componentsand apply no rotation. You will get an error message. In order to problem solve thisissue and make further decisions about your analysis you will need to have read theadditional notes available in the Week 6 block on the Studydesk called “notes on dflimiting number of factors.pdf”. Provide your initial line of code, subsequent errormessage and your final line of code that successfully performs the factanal analysis(2 marks). What did you need to change and why (2 marks)? (4 marks total)b) From your successful factanal analysis in part a) provide output and interpretation for(8 marks total):• Variance explained (2 marks)• Chi-square test (2 marks)• Variable loadings (2 marks)• Difference in uniqueness values for the variables FIN and SPS (2 marks)c) How would your results change if you applied a rotation? Explain your reasoning. (4marks total)d) Perform parallel analysis using a seed value of 245 and 500 iterations. Produce thescree plot for the PC results only (1 mark). Discuss how many PC’s are recommendedby this analysis and use the plot to help you explain these results (2 marks). As partPage 6 of 6of your explanation provide the values for the 95th percentile for components 1 and 2(1 mark). (4 marks total)e) Explain in your own words how the parallel analysis works. (5 marks total)************** End of Assignment 1 *****************