ITECH 2303 – Data Analytics
Semester 1-2022
Final Test 2
Writing Time: 60minutes Total Marks: 30
Reading time: 0 minutes
This is an open book test worth 30% of the marks for the course. Accessing internet resources during the test is permitted. Write answers in the word file and upload the file with answers to the Final Test Submission Link TurnItIn in Moodle within 60 minutes of your download when you have finished. Student Name: Student ID: Name: Date: |
Answer all questions: Question 1:
A data scientist runs the following lines of python code. Identifier names may be misleading. Add comments in your own words to the following python code that succinctly describes what the code does. Include in your comments a description of argument parameters
- import seaborn as pd
- print(df.dtypes)
- lm.fit(df[[‘highway-mpg’]], df[‘price’])
- df.loc[10:25,[‘rank’, ‘age’,’points’]]
- nnn.regplot(x=’age’, y=’salary’, data=f)
- lre.score(x_train[[‘horsepower’]], y_train)
[1*6=6 marks] Question 2
Name and describe the chart above. How would you describe the relationship between the variable depicted on the horizontal axis and the one on the vertical axis? Explain how well you would expect a linear regression model to predict values on the vertical axis from those on the horizontal axis.
[6 marks]
Question 3
A data scientist aims to train a machine learning model that will predict if a patient who attends a hospital’s emergency is likely to be suffering from the effects of a heatwave. She extracts data on 30 features from hospital records of 500,000 patients over the last 20 years where 2,000 of the patients have conditions caused by heatwaves. She trains a neural network model for 2000 epochs and is frustrated that the model only correctly classifies the 500 of patients with heatwave related conditions. Provide an explanation for the poor predictive accuracy. Include in your answer what you would do to rectify the problem.
[6 marks]
Question 4
The presence of missing values in data can reduce the effectiveness of a data analytics exercise. Provide a brief explanation of the approaches that can be applied to deal with missing values. Include in your answer possible sources of missing values and an indication of the advantages and disadvantages of each approach to deal with them. Use examples to illustrate your answer.
Question 5
The first step in 2-means clustering is to randomly assign 2 cluster centres as represented in this figure. Each iteration of the k-means algorithm results in moving the k-means centre. Illustrate where you expect the two cluster centres to be in the next iteration. Explain why.