You are just hired as a Senior Business Analyst in Bank of Universe (BOU). BOU is a private company founded twenty years ago and now has more than 5,000 employees. BOU did all the research, chose the insurance company, and picked plan options for employees twenty years ago. The new CEO, Mr. Buffet, wants to make changes and offer self-funded Health Plans (SHP) starting next year. SHP is cheaper for BOU, since BOU does not have to pay for the separate insurance carrier by taking some risks. BOU has received several years’ medical costs in fileinsurance.csv from the current insurance carrier.

Data source:https://www.kaggle.com/mirichoi0218/insurance/home (Links to an external site.)

It contains the following columns:

· age: age of primary beneficiary

· sex: insurance contractor gender, female, male

· bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

· children: Number of children covered by health insurance / Number of dependents

· smoker: smoking

· region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.

· charges: individual medical costs billed by health insurance

Assignment Tasks

You are asked to perform the following tasks by writing a script in R. Submit both R codes and a Word document.

1. Load the datasetinsurance.csv into memory.

2. Convert the following predictors to factors using the function factor():

a. sex

b. smoker

c. region

3. Build a multiple linear regression model.

a. Perform multiple linear regression with charges as the response and the predictors are age, sex, bmi, children, smoker, and region. Print out the results using the summary() function.

b. Is there a relationship between the predictors and the response?

c. Does sex have a statistically significant relationship to the response?

d. Perform best subset selection using the bestglm() function based on BIC. What’s the best model based on BIC?

e. Compute the test error of the best model in #3d based on BIC using LOOCV.

f. Calculate the test error of the best model in #3d based on BIC using 10-fold CV.

4. Build a random forest model using function randomForest(), where charges is the response and the predictors are age, sex, bmi, children, smoker, and region.

a. Split the dataset into a training set containing 80% of the original data and the test set containing the remaining 20%.

b. Compute the test error using the test data set.

c. Extract variable importance measure using the importance() function.

d. Plot the variable importance using the function, varImpPlot(). Which are the top 3 important predictors in this model?

5. Build a support vector machine model

a. Split the dataset into a training set containing 80% of the original data and the test set containing the remaining 20%.

b. The response is charges and the predictors are age, sex, bmi, smoker, and region. Please use the svm() function with radial kernel and gamma=5 and cost = 50.

c. Perform a grid search to find the best model with potential cost: 1, 10, 50, 100 and potential gamma: 1,3 and 5 and using radial kernel and training dataset.

d. Print out the model results. What’s the best model parameters?

e. Forecast charges using the test dataset and the best model found in c).

f. Get the true observations of charges in the test dataset.

g. Compute the MSE (Mean Squared Error) on the test data.

6. Perform the k-means cluster analysis.

a. Remove the sex, smoker, and region, since they are not numerical values.

b. Determine the optimal number of clusters. Justify your answer. It may take longer running time since it uses a large dataset.

c. Perform k-means clustering using the 3 clusters.

d. Visualize the clusters in different colors.

7. Build a neural networks model.

a. Remove the sex, smoker, and region, since they are not numerical values.

b. Standardize the inputs using the scale() function.

c. Convert the standardized inputs to a data frame using the as.data.frame() function.

d. Split the dataset into a training set containing 80% of the original data and the test set containing the remaining 20%.

e. The response is charges and the predictors are age, bmi, and children. Please use 1 hidden layer with 1 neuron.

f. Plot the neural networks.

g. Forecast the charges in the test dataset.

h. Get the observed charges of the test dataset.

i. Compute test error (MSE).