Professor Harry Wechsler
Department of Computer Science
Fairfax, VA 22030
e- mail : wechsler@cs.gmu.edu
web : http://cs.gmu.edu/~wechsler/
(703)
993-1533 (office)
(703) 993-1530 (sec)
(703) 993-1710 (fax)
FALL '2006
CS 750 Theory and Applications of Data
Mining
Class Information
001 70627 R 7:20 – 10:00 p.m. T 110
Prerequisites
CS 450
(“databases”), CS 580 (“AI”) or permission of instructor
Office Hours
Thursday 6:15
p.m. – 7:00 p.m. or by appointment (SITE II - Rm. 461)
Textbook
Introduction
to Data Mining, Tan, Steinbach and
Kumar,
Pearson (Addison
Wesley), 2006
web site for textbook slides : http://www-users.cs.umn.edu/~kumar/dmbook/
Reference
Data Mining: Concepts and
Techniques (2nd. edition), Han
and Kamber,
Elsevier, 2006
web site for textbook slides http://www-faculty.cs.uiuc.edu/~hanj/bk2/
WEKA web site for data mining software
http://www.togaware.com/datamining/survivor/Weka.html
Background
for Pattern Recognition and Classification
http://research.cs.tamu.edu/prism/lectures.htm
UCI
Machine Learning Repository Content Summary
http://www.ics.uci.edu/~mlearn/MLSummary.html
Course
Description
Concepts and
techniques on data mining and their multidisciplinary applications. Topics include review of databases
and data warehousing, data cleaning and transformation, dimensionality
reduction and data compression, concept description and rule classifiers,
associations and rule generation, data classification and predictive modeling, learning
ensembles, clustering, and performance analysis. Emerging themes and future challenges related
to biometrics, intrusion detection, and social networking are also discussed
Term team project and topical review are required.
Motivation
The explosive
growth in generating, collecting and storing data has generated an urgent need
for new techniques and automated tools that can intelligently assist in
transforming the vast amounts of data into useful information and knowledge.
Data mining is a multidisciplinary field, drawing from areas including AI,
database technology, data visualization, information retrieval, high
performance computing, machine learning, mathematical programming, neural
networks, pattern recognition, statistical learning theory, and statistics. The course provides the graduate students the
opportunity to learn about the management and use of large data repositories
based upon a multidisciplinary approach.
Goals
The objective of this course is to introduce graduate students to current
research, technological advances and trends in data mining. Data mining, which supports knowledge
discovery in databases (KDD), helps with the automated extraction of patterns
representing knowledge implicitly stored in large databases, data warehouses,
and other massive information repositories.
The course focuses on issues related to the feasibility, usefulness,
efficiency, and scalability of automated techniques for the discovery of
patterns hidden in large databases.
Students will be exposed to the above topics via lectures and reading
assignments, including recent journal and conference papers. Students are
expected to complete a term project and to make an in depth presentation on a
topic related to data mining. As data mining has matured, the field is now
advancing on three new fronts: (i) ability to mine
data in real time; (ii) predictive analysis rather than merely explain past
trends; and (iii) analyze messy “unstructured” data, e.g., video.
Follow – Up Studies
with Professor Wechsler : 1. CS 667 – Biometrics
– Summer Session 2007; 2. CS 775 / IT 844
-- Pattern Recognition – Spring 2007; 3. Certificate in Biometrics; 4.
PhD dissertation.
Grading
Homework à 20 %
Midterm à 25 %
(Team) Term Project à 40 %
Science and Technology REVIEW (5%) and Class Participation (10%) à 15%
Term Project
Students work in teams on the term project.
Scope and range for the project has to be agreed asap with the instructor.
Task involves meaningful data mining functionality and significant amounts of
data.
Project includes the following STEPS :
1. Problem definition,
requirements analysis and conceptual design.
2. Data selection / sampling // visualization //
3. Cleaning and integration / Preprocessing // visualization //
4. Data transformation / Data and Dimensionality Reduction // visualization //
5. Data Mining // visualization //
6. Model selection, testing & evaluation, and performance assessment //
visualization //
7. Knowledge discovery // visualization //
Use domain
knowledge and visualization for all the steps.
Iteratively refine
the quality and scope of your project
Reviews and class presentations are conducted stepwise
throughout the course (see tentative schedule). First a draft for each step is
expected
the week the STEP is listed in the tentative schedule listed below.
Based upon feedback received in class the same step is completed and
presented again the following week.
Final (In Class)
Project Presentation (SLIDES)
(about 30 – 45 minutes)
1. Survey / Literature Review
of (a) application
and (b) task / functionality - data mining (STEP 5)
and model selection (“training strategy”).
2. Brief
Description of STEPS 1 – 7.
3. Performance Evaluation and Assessment of your project.
Final Project Report (HARD COPY) (at
most 15 pages)
Submit Technical Report (TR) that
covers your Final Project Presentation.
Tentative Schedule
|
August 31 |
Ch. 1:
Introduction – Databases and Data Warehouses, Data Mining and Knowledge
Discovery, Knowledge Management, and the Semantic Web (http://www.w3.org/2001/sw) Appendix C – Probability and Statistics – Appendix D – Regression
- |
|
September 7 |
Performance
Evaluation and Model Selection (Notes) (see also Sect. 5.7) STEP 1 Appendix A – Linear Algebra - |
|
September 14 |
- feature
extraction and selection; dimensionality reduction - Appendix B –
Dimensionality Reduction - |
|
September 21 – September 28 |
Appendix E –
Optimization - STEPS
2 – 3 <September 21 > |
|
October 5 |
|
|
October 12 |
STEP 4 |
|
October 19 |
Review for Midterm |
|
October 26 |
Midterm (covers
August 31 – October 12 lectures) (bring
exam bluebook and calculator) (closed
books) |
|
November 2 - 9 |
Ch. 8 - 9:
Cluster Analysis. STEP
5 <November 2 > |
|
November 16 |
mining data
streams; social network analysis STEPS
6 - 7 |
|
November 23 |
Thanksgiving |
|
November 30 |
Biometrics |
|
December 7 |
FINAL PROJECT PRESENTATION |
|
December 14 |
FINAL PROJECT PRESENTATION |