CISC333 Data Mining: Fall 2013 |
This course is offered in Fall 2013 in slot 2 in Ontario 207. The prerequisites are CISC121 and a stats course. Although the course is numbered at 3rd Year, second-year students who wish to take it may do so (although this was more likely when it was offered in the Winter Term).
This course is available on Moodle, so look for most of the information there.
The tool we will use for most of the practical work in the course is Rapidminer, a development from the Weka toolkit. You can find extensive tutorial material on the Rapid website.
Rapidminer is available for free download at http://rapid-i.com/; you need Rapidminer Community Edition (should be Rapidminer 5).
Rapidminer is also available on the Caslab machines, under either Windows or Linux.
Tutorial on using Matlab (useful for visualization) and the SVD and ICA matrix decompositions.
Exercises are a chance for you to get some hands-on experience. The exercise questions will often be open-ended. You might expect to spend 3 or 4 hours on these sheets each week. Each one is marked on this scale: acceptable; inadequate; or not seriously attempted. There will be five exercise sheets in the first half of the term.
(Exercise sheets will appear a week or ten days before they are due.)
This table describes what we will cover, keyed to the modules and text.
Avail means that the basic powerpoint is available. Done means that the marked-up powerpoint is available.
Tan, Steinbach, Kumar, Introduction to Data Mining, Addison-Wesley, 2006, ISBN 0-321-32136-7 ($105 at Amazon, $97.95 at Campus Bookstore).
There are 3 deliverables:
Note that the assessment in the course is backloaded, so please take this into account when planning your procrastination.
Instructor
David Skillicorn
528 Goodwin Hall
skill cs queensu ca
533 6065
Questions? Try asking me before or after class, or come and find me at my office any time I'm there. I have a schedule posted by my door.
I will schedule an office hour once term has started.
Teaching Assistant
Study skills. You probably know all of the conventional wisdom about how to learn, but perhaps you don't actually use it. Here is an excellent link: Study Hacks.
You may also want to subscribe to, or read: KD Nuggets
Module | Content | Text Refs |
Module 0 | Introduction | Chapter 1 |
Module 1 | Prediction and Exploration | Bits of Chapter 2 |
Module 2 | Data preparation and model quality | Chapter 2 |
Module 3 | Simple predictors | |
Module 4 | Decision trees | Chapter 4 |
Module 5 | More decision trees | |
Module 6 | Neural networks | Chapter 5.4 |
Module 7 | Support Vector Machines | Chapter 5.5 |
Module 8 | Rule based systems | Chapter 5.1 |
Module 9 | Object selection: sampling, ensemble techniques | Chapter 5.6 |
Module 10 | Attribute selection | Section 2.3.4 |
Module 11 | Prediction case studies | |
Module 12 | Clustering I: similarities, k-means | Chapter 8.2 |
Module 13 | Clustering II: Expectation-Maximization | Section 9.2.2 |
Module 14 | Clustering III: Top-down and bottom-up clustering | Chapter 8.3 |
Module 15 | Clustering IV: Matrix decompositions | |
Module 16 | Clustering V: Clustering Large Datasets | Section 9.5.2,Section 8.4.2 |
Module 17 | Visualization | Chapter 3.3 |
Module 18 | Clustering Case Study | |
Module 19 | Clustering VI: Biclustering | |
Module 20 | Biclustering Case Study: Topic Detection | |
Module 21 | Mining the Web | |
Module 22 | Collaborative Filtering | |
Module 23 | Adversarial Data Mining | |
Module 24 | Summary |