The intention of this project is to allow students to apply the data mining techniques they learnt in class to analyze a real business/economic problem concerning classification.

The response variable is either 1 or 0, i.e., belongs to a certain group or not. The main objective of this project is to develop a useful model to make classification to the 1 or 0 group based on the predictor variables.

The project report 1 will include problem definition, data collection, model building and analysis using XLMiner, and conclusion. The report should be double-space typed and presented in a professional manner. Other than the correctness of the analysis, presentation and organization also counts.

Format of Project Report 1:

1) Author’s name and e-mail address

2) The title and purpose of your study.

Discuss why you are examining this topic. Clearly state the problems to be investigated and the objective of the study.

3) A list of variables and clearly state the sources (URL web addresses) for your data.

* Define the target variable, and use at least 4 predictor variables ( x1 , x2 , x3 , x4 ).

* If you are collecting your own data and if it is hard to collect a large number of observations, a sample size at least 30 to 100 or is acceptable.

If the dataset is from the data science website (e.g., like www.kaggle.com), the number of observations may be much larger. If the data set is too large, you can select a subset of it (e.g., a thousand or several hundreds) for your study.

4) Analysis.

Training and validation data: Divide the data randomly into training (60%) and validation (40%) partitions. The training data is used in developing a model, and its usefulness is tested in the validation data. (Note: If the sample size is small, you can consider partition the data into 80% training and 20% validation so that there is at least 20+ observations in the training data to build model).

Model building:

Use the training data to develop classification model using Logistic Regression technique.

You can start with the full model with all the X’s. Based on the p-value, remove insignificant predictors. (You can use 20% or 30% as the significance level). Try a few models and select one that you think is the best overall (based on the classification confusion matrix and the various metrics like the false positive, false negative, overall percentage error; and the accuracy, sensitivity, specificity of the model). Comment on the model performance for both the training and validation data.

5) Summary and Conclusion

Write a summary, conclusion and insights about the data you study based on the analysis that you have done.

Plagiarism Free Assignment Help

Expert Help With This Assignment — On Your Terms

✓ Native UK, USA & Australia writers ✓ Deadline from 3 hours ✓ 100% Plagiarism-Free — Turnitin included ✓ Unlimited free revisions ✓ Free to submit — compare quotes

Write My Assignment FREE Get A Free Quote →