NIT6160: Data Warehousing and Mining |
The goal of this project is to perform the classification and clustering methods on the Mushroom data set. For detailed information about the mushroom data set, refer to the Machine Learning Repository provided by the University of California, Irvine.
Task 1: Data Pre-processing
1. Read the xlsx file with pandas.read_excel
2. For the clustering experiments, the column for class labels needs to be removed.
3. Data cleaning. For example, replacing missing values, attribute range normalization, converting numerical or string to nominal values, etc.
Task 2: Data Mining
For the below experimentations, you could try different parameter settings to fine-tune the outcome.
- Classification experiments: Using to construct classifiers on the mushroom dataset. Randomly split the data set in the training and test data set (80% v.s. 20%). Select at least one classifier from each of the following two categories of classifiers: Tree-based models, Bayes classifiers, and Rule-based classifiers. Compare the result of the chosen classifiers.
- Clustering experiments: Using the k-Means clustering algorithm to cluster the data samples and
visualize the results