MIS Excel Data Analytics Capstone Project
This project requires that you use the theory and tools learned throughout this of the course:
· Big Data
· Business Intelligence and the Decision-Making Model
· Customer Relationship Management Systems
· Defining Data Problems
· Data Modeling
· Data Cleansing
· Data Mining with Classification and Prediction
· Mining analysis using error and lift-chart reports
· Data Forecasting
Identify a Business Problem for Classification Mining
Using the Stanley A Milner Library data referenced in the assignments, analyze the data available to define one real-world Classification business problem to solve. Here are some examples:
· Can we classify the frequency of visits to the Stanley A Milner Library from a combination of library survey variables? If, so what were the variables and how can they be used to improve the efficient use of library resources?
· Can we classify the primary mode of transportation based on a combination of library survey variables? If so, we can collaborate with Edmonton Transit Services and share data to improve popular bus routes to the downtown library.
· Can we correctly determine if library visitors own a home? If so, we can have a more efficient marketing campaign to homeowners.
· Can we correctly determine if library visitors are newcomers to Canada? If so, the library can leverage the information to create learning programs for the target demographic.
As illustrated by the bolded words in each problem statement, there will be one specific output for each business problem.
Please have a look at the “Legend” tab in the library worksheet to see all available variables.
Creating a Classification Model and Appendix with Experiment Summary (Appendix Using Appendix Template)
As you are going through this process you must document what changes you make to the data model with each iteration and the decisions behind the changes and why you are making those decisions. The process of developing your model is the most important part of this process, so ensure you are making improvements and documenting the reasoning and impact (screenshots and point form). Generally, this will involve reducing/changing the number of inputs (some columns less relevant to the problem, determined by the tree viewer) and checking to see if the error reports and lift charts have improved. Note that the columns do not need to be deleted from the spreadsheet to execute another run. Instead, remove the parameter from the Orange “Select Columns” widget.
Use the included appendix template to build your appendix. Don’t forget to attach the appendix file to your submission.
For example, after obtaining the first set of error reports and lift charts of homeownership from the full set of data, the configuration below will disregard the consideration of whether a person is an employee of the City of Edmonton on the next trial:
Complete a total of 6 data mining trials (including the initial run with all data) for your defined Classification business problem. Change/Remove variables between each run and check the performance scores of Tree, kNN, Logistic Regression, and Naïve Bayes.
Minimum Guidelines for Running the Simulations:
For your Classification problem:
· Run 6 Classification Tree mining simulations for the stated business problem to get a variety of results to analyze.
· For each run, report on the numerical and lift chart performance of Tree, kNN, Logistic Regression, and Naïve Bayes.
After executing the initial run with all other variables as input, record the results into the Appendix Word template (see Tip in point 2). Next, this is where experimentation comes into play. For example:
Decide which 5 or 6 variables can be taken out (move from the “Features” column to the “Available Variables” column in the “Select Columns” widget) from the next data mining run because it logically (can be a logical guess or use the tree viewer to assist) doesn’t have any relationship to your output variable. The example of moving City_Employee out of the model run can be seen on page 3 of the project description document.
After moving the variables and Orange completes the next mining calculation, record the results into the Appendix Word template and repeat the process again please include confusion matrix, accuracy metrics, and lift charts for all mining methods (Tree, kNN, Logistic Regression, and Naïve Bayes). Tip: Use the summary “Evaluation Results” in Test and Score and the summary lift chart to minimize the number of screenshots taken.
Repeat until a minimum of 6 runs is complete. Between runs, you can experiment by swapping variables between the “Available Variables” column in the “Select Columns” or reduce the number of “Features” even further, minimum 4 features (do not go below 4 features in a run).
After completing all runs (minimum 6 as per template), use the analyzing technique from the data mining assignment (confusion matrix, accuracy metrics, and lift charts) to select the single best results between all runs.
Note that it is entirely possible that a new trend may be discovered solving an entirely different business problem. Recall that discovering new trends is a key concept of Business Intelligence. If such a trend is found, document the new discovery in the report and determine if it is valuable information that can be added to the business recommendation in the next section.
Write a Business Recommendation based on the Mining Results
As IT business consultants, write a recommendation to the Edmonton Public Library on how they should use the mining results to improve their business or, if the results are deemed to be insufficient, recommend how the library can use the various systems to improve their data. Be sure to indicate the level of confidence using the error reports and lift charts from the best classification run.
For example, if our model is accurate with categorizing volunteers from a combination of variables, how can the Library make use of the information to conduct better business?
Other business application examples can be found in the Rainer Textbook for reference. The chapter slides are a good reference for ideas regarding potential topics to discuss:
· Chapter 5, Pages 127-130 (Big Data)
· Chapter 11, Customer Relationship Management and Supply Chain Management
· Chapter 12, Business Intelligence and Analytics
· Chapter 12, Pages 338-341 (Data Mining)
The format of the business recommendation should adhere to the following guidelines:
· 12pt Arial font, 1.5 spacing, page numbers.
· Sections of the document:
o Cover Page
o Table of Contents
o 1 Page Executive Summary. This section should condense the main paper into a single page for executives to read. Do not have more than 1 paragraph discussing the mining data results in this section.
o 4 Pages outlining the business opportunity/opportunities with the trends discovered. Only include content from the best classification and prediction models and how it supports your recommendation. Do not have more than a half-page discussing the mining data results in this section.
The deliverables for this project are:
· A paper that containing
o Your recommendation for the identified business problem based on the single best classification mining result.
o A classification business problem
o The technical results of your modeling exercise including
§ the initial run
§ each trial trail run with a report on the changes between each trial run
o Identifying which single run and model is the best out of all other runs
· Your Orange workflow file (.ows) with the best mining result.
Rubric: 3 Deliverables
Best-Scored Orange Workflow with Data Cleaned
Classification Business Problem Statement
Classification Mining Trials
Appendix Report: Technical Explanation of Process
Report: Executive Summary
Primary Word Document
Report: Recommendation based on findings (main paper)
Primary Word Document
Overall Quality of Presentation of Report
Cover page, table of contents, page numbers, headers, logical presentation of information