SOLUTION: Outline what would happen if we directly apply K-means with Euclidean distance to this data. Can it achieve the clustering objective? How will it split/group the data and why? [3 mar

COMPSCI5100

DEGREES OF MSc, MSci, MEng, BEng, BSc, MA and MA (Social Sciences)

Machine Learning & Artificial Intelligence for Data Scientists

Question 1: Regression (Total marks: 20)

Consider using regression to predict the world population growth rate using the data shown in the following figure:

Figure 1.1. Size of training data used in machine learning models from 1950-2023. Source: Modified from https://ourworldindata.org/grapher/artificial-intelligence-number-training-datapoints

(a) A rescaling method was used to rescale the years to values displayed in Figure 1.2. Describe which rescaling method was used (with enough details of the procedure), and why you think it was applied.

Figure 1.2: Size of training data used in machine learning models with rescaled years.

[4 marks]

(b) Consider fitting the data with a polynomial regression of order 2, identify the two most likely poorly fitted data points (use years in Figure 1.1 as reference) and explain why. [6 marks]

Outline one advantage and disadvantage of using this radial basis function over polynomials with the data in Figure 1.1. [4 marks]

(d) Suppose we use the radial basis function in (c), with μk set to be xn and s = 10, to fit the data. We used two fitting strategies, namely ridge regression and lasso, and obtained the following fitting models in Figure 1.3 A and B. Identify which fitting strategy is used in each figure and explain why and how the chosen fitting method could have generated the result. (note, each method is used only once). [6 marks]

Question 2: Classification (Total marks: 20)

(a) Assume the following training data in the two-dimensional plane of X1 and X2 is available (Figure 2.1). The target variables for the points in the red and blue are +1 and -1. We summarise the data as the following tuples: <(2,0), 1>, <(0,2),-1>, <(0,2),-1>, <(3,0),1>, and <(-1,0), -1>, respectively. (Note, you can use LATEX notation, for example X_1, X_2, \alpha_1, \alpha_2 and etc)

i. Design a k-NN classifier with k=1 and write down the equations that specify the decision boundary between the two classes. [6 marks]

ii. Using the classifier above, determine the class variables C1, C2, and C3 for the following test data points: <(0,0), C1>, <(0.7,0), C2 > and <(0.3,0), C3 > [3 marks]

(b) In the same data set in Figure 2.1, we apply a linear SVM model with the predictor y(X1, X2) for classification.

i. Which data points are the support vectors? Write down the equation for y(X1, X2) (Hint: First visually assess the data to determine the decision boundary and the support vectors. Observe the constraints for the margin and SVM classifier.) [4 marks]

ii. Specify the Lagrange multipliers a1, a2, a3, a4, a5 for each of the data points in the training data (2,0), (0,2), (0,-2), (3,0), and (-1, 0), respectively. [5 marks]

iii. Which k-NN or SVM classifiers (designed above) will be more accurate? Explain your answer in up to two sentences. [2 marks]

3. Unsupervised learning question (Total marks 20)

Consider using the K-means algorithm to perform. clustering on the following scenario Figure 3.1 A. We expect to form. three clusters as shown in Figure 3.1 B.

(a) Outline what would happen if we directly apply K-means with Euclidean distance to this data. Can it achieve the clustering objective? How will it split/group the data and why? [3 marks]

(b) An alternative approach is to use Kernel K-means. Would kernel K-means could help in this dataset and why? [2 marks]

(c) An alternative approach is to use mixture models. Would mixture models help to better classify this dataset than K-means and why? [3 marks]

(d) The plot in Figure 3.2 shows some 2D data. PCA is applied to this data. Explain how the first principal component would look if it is overlaid on the plot. Explain your reasoning. (Hint: if you cannot draw, indicate points/axis on the grid)

[2 marks]

(e) Similar to the previous question, explain what the second principal component would look like and why. (Hint: if you cannot draw, you could refer to the x, y-axis for reference) [2 marks]

(f) Explain how you would choose the number of clusters in an application of mixture models. [3 marks]

(g) Explain how you could detect an outlier point with mixture models. Write a high-level pseudo code and describe each step. (Hint: Start with the expectation-maximization algorithm and how it can facilitate the detection) [5 marks]