Page 1 of 3Assignment 3 – CSC2062 “AIDA”Assignment is marked out of 100 marks. Assignment is worth 30% of the module assessment.Deadline: 11pm Friday, 30th April 2021.This version: 2021-03-14.IntroductionIn this assignment, we will use the features you developed in Assignment 2 to solve classificationproblems using machine learning. Specifically, you will fit classifiers to your image … Continue reading “Assignment 3 – CSC2062 “AIDA” | My Assignment Tutor”
Page 1 of 3Assignment 3 – CSC2062 “AIDA”Assignment is marked out of 100 marks. Assignment is worth 30% of the module assessment.Deadline: 11pm Friday, 30th April 2021.This version: 2021-03-14.IntroductionIn this assignment, we will use the features you developed in Assignment 2 to solve classificationproblems using machine learning. Specifically, you will fit classifiers to your image data, in order tobuild and evaluate useful models that can predict the class labels for unseen images. Thisassignment is to be completed individually.This assignment must be completed in R. You may not use Excel for any calculations in theassignment, or for figures (using Excel to construct tables for your report is OK).Convenient and commonly used machine learning packages are available for R, such as “class”,“caret” and “randomForest”. When you use a procedure that has an element of randomness (e.g.creating cross-validation folds) please use the seed value 42 (your code should give the same resultseach time it runs). The seed only needs to be set once at the top of each source file: set.seed(42).You should briefly interpret all results.Section 1 (30 marks)For this section, you will make use of the dataset you created for Assignment 2. Your feature fileSTUDENTNR_features.csv, which was part of your work on Assignment 2, is the starting pointfor this task. At a minimum, your “features” file should contain the non-custom features specified atthe beginning of Assignment 2 (all students should at least have these features, if you do not thenplease contact the lecturer, quoting your student number). The file should be placed in the maindirectory for the assignment.1.1. Using the nr_pix feature only, and all 168 items, fit a single logistic regression model to predictthe probability of belonging to the “math symbol” category of images. Present the results table forthe logistic regression, including the coefficient estimates, the z-scores and associated p-values.Briefly visualise and interpret the results of the logistic regression.1.2. Use the same model as in 1.1 as a classifier to predict whether an item is a math symbol or not.Use a decision threshold of 0.5. Evaluate the model using 5-fold cross-validation (there is no need tocreate a separate testing set; perform crossvalidation over all 168 items). Report the accuracy, truepositive rate, false positive rate, precision, recall and F1-score for the crossvalidated model (hint:savePredictions = T in trainControl() in the caret package may be useful).1.3. Plot an ROC curve for the classifier.Section 2 (30 marks)In this section, you will perform 3-way classification for letters, numbers and digits.Page 2 of 32.1. Perform k-nearest-neighbour classification with all odd values of k between 1 and 25 (inclusive)using the first 6 features in the “*_features.csv” file (note that the first 2 columns are just the labeland the index; the features start from the third column). It is recommended that you use knn fromthe class package for this section. Report the accuracy over the full set of 168 items for each valueof k (use all 168 items in this subsection as training data and do not worry in this subsection aboutoverfitting to the training data; i.e., do not use cross-validation).2.2. Perform k-nearest-neighbour classification with all odd values of k between 1 and 25 (inclusive),using 5-fold cross-validation, using the same 6 features as in 2.1. Report the cross-validated accuracyfor each value of k. Create a figure similar to FIGURE 2.17 of the ISLR text book, showing theclassification error rate (or accuracy rate) over the training set and the cross-validated classificationerror rate (or accuracy rate) for each value of 1/k. Briefly interpret the results of 2.1 and 2.2 withreference to your graph.Section 3 (40 marks)Larger sets of doodle data have been created for you. You can download the data at the URL thatwill be emailed to you. These data are for your use only and should not be shared with others (to useother people’s data will be considered collusion).These data consist of a dataset of 100 training items for each of the 21 doodle types (2100 trainingitems in total). The features are in the same format as Assignment 2. The custom feature has beenomitted (not needed for the assignment).In this section, you are to perform classification with respect to the 21 image categories.3.1 Perform classification with random forests using 5-fold cross-validation. Calculate multiplerandom forest solutions using number of trees, Nt between 25 and 400 (increments of 25) andnumber of predictors considered at each node, Np = {2, 4, 6, 8}. Find the combination of treenumber (Nt) and predictor-number (Np) giving the best cross-validated accuracy (this is called a “gridsearch” of the two hyper-parameters, number of trees and number of predictors). Briefly visualise,explain and interpret the results for this set of models.3.2 Random forests and cross-validation have an element of randomness, so let’s see how variablethe accuracy is across different independent runs. For the best model in 3.2 (i.e. best values of treenumber and predictor-number) refit the model 15 times, to obtain 15 cross-validated accuracyscores. Report the mean and standard deviation of the accuracies.3.3 Build the best model you can to predict the 21 image categories. You may use any method (knn,random forest, or other) and any features, justifying your choices. You should fit at least two modelsin this section, evaluating using 5-fold crossvalidation. Report the best model that you have found.Briefly discuss your model’s performance, and any further experiments you might do to furtherdevelop your classifier.Assessment criteria and marking processThe most important criteria in marking is the quality and clarity of your report, including thecorrectness and accuracy of your models (approximately 75% weighting). In your report, you shoulddemonstrate that you understand the methods used in each sub-task. Explain your reasoning,assumptions and steps of the procedures used. You should explain and interpret your results. WhatPage 3 of 3are your results telling you? Are the results what you would expect? If you ran into difficulties,explain what they were and the efforts you made to try to overcome them.Code has a weighting in marking of approximately 15%. Your code should be clear and logicallyorganised, and do what is required, but code efficiency and code sophistication is not important (thisassignment does not require complex programming). Logical organization of code includesappropriate use of variables, iteration, functions, etc., rather than repetition of the same steps with“hard-coded” values. If you use freely licenced code, packages, or libraries (which is encouraged),these should be appropriately referenced (e.g. by citing a URL in a comment). The code must be easyto use and the comments must include information about the required steps to replicate the resultsthat you have obtained and are presenting in your report (transparency and replicability areessential in data analysis). Do not upload unnecessary code (e.g. the entire codebase of some thirdparty library you are using).Attention to detail and following the assignment instructions accurately will also be considered inmarking (approximately 10% weighting). Each sub-task has a precise specification. Make sure youcarefully follow the instructions, and use the features specified for each task, and the specifiedprocedures (number of cross-validation folds, seed value, etc). Make sure you upload yourdeliverable files in the specified formats.DeliverablesYou must submit your assignment online, using Canvas, by 11pm Friday, 20th April 2021.The online uploaded file must be a ZIP file called assignment3_STUDENTNR.zip, containing multiplefiles and directories. The contents of the zip file are specified below (bold text indicates foldernames):• STUDENTNR_assignment3_report.pdf• STUDENTNR_features.csv• STUDENTNR_2100items_features.csv• codeo section_1.ro section_2.ro section_3.rThe current working directory should be the location of the source file. Your code should use relativepaths; i.e. it should read the training data from “../doodle_data”.A RAR file is not a ZIP file. A broken or corrupt ZIP file is not a ZIP file. It is your responsibility toensure the assignment is uploaded and double-checked before the deadline.Please use the provided report template for preparing your report (or create an equivalent LaTEXformat). Ensure that the header and footer information (student name, student number) is clearlyvisible on the printout. The word limit for the report is 4000 words (excluding tables and figures).By submitting this assignment you acknowledge that it is your own work and that you are aware ofuniversity regulations regarding academic offences, including (but not restricted to) plagiarism andcollusion.Standard university penalties apply for late submission.