Classification analysis
MA609 Business Analytics and Data Intelligence Week 10: TUT 10
- How businesses collect and produce data? They do it via:
- Sales and returns transactions
- Bar code scans
- Credit card transactions
- GPS and RFID tracking
- Clicks on a webpage
- Define data mining
- Data mining is the process of finding and extracting useful information and insights from large datasets
- Like geological mining
- It is often hard, dirty work
- It takes the right tools
- Explain data mining process
- Identify Opportunity
- Don’t dig randomly
- Begin with the end in mind
- What is the business problem/opportunity?
- Collect Data
- Decided where to dig
- Get the right data – internally or externally
- Millions of records aren’t required – use samples
- 10p to 15p records is OK (where p = # of variables)
- Understand, Explore & Prepare the Data
- Know what the data represents
- Make sure it is clean & complete
- Eliminate unneeded/redundant variables
- Transform variables as needed
- You might spend most of your data mining time here!
- Identify Task & Tools
- Classification (supervised)
- Prediction (supervised)
- Segmentation/Clustering (unsupervised)
- Partition Data
- Training
- Validation
- Testing (optional)
- Build & Evaluate Models
- Try different models
- Try different parameter settings
- Avoid overfitting
- Deploy Models
- Integrate models in operational systems
- Train users
- Monitor results
- Look for opportunities for continuous improvement
- Define classification and give a few examples of its application.
Classification determines into which of m mutually exclusive group does an observation of unknown origin belong. Some areas of classification application are:
- Character/target recognition
- Oil/gold exploration
- Loan approval
- Diagnose diseases
- Identify defects
- Predict bond ratings
- Fraud detection (credit card, tax, trading, etc)
- Predict winners of sports events
- What are steps to classify using Full Bayes Classifier? What could be the problem in this case?
- To classify a new record
- Find all matching records
- Put new record in most frequently occurring matching group
- Problem
- Continuous variables are unlikely to match exactly
- Even with nominal variables, there might not be a match
The remainder of today’s session is allocated to the group project. When finished tutorial please work on the project.