Professor Harry Wechsler

Department of Computer Science

George Mason University

Fairfax, VA 22030

e- mail : wechsler@cs.gmu.edu

web : http://cs.gmu.edu/~wechsler/

           (703) 993-1533 (office)

(703) 993-1530 (sec)

(703) 993-1710 (fax)

 

GEORGE MASON UNIVERSITY

       FALL   '2006

       CS 750 Theory and Applications of Data Mining

      

      Class Information

001  70627   R   7:20 – 10:00 p.m.   T   110

Prerequisites

CS 450 (“databases”), CS 580 (“AI”) or   permission of instructor

Office Hours

Thursday 6:15 p.m. – 7:00 p.m. or by appointment (SITE II - Rm. 461)

 

Textbook

Introduction to Data Mining, Tan, Steinbach and Kumar,

Pearson (Addison Wesley), 2006

web  site for textbook slides  : http://www-users.cs.umn.edu/~kumar/dmbook/

 

            Reference

Data Mining: Concepts and Techniques (2nd. edition), Han  and  Kamber, Elsevier, 2006

web site for textbook slides  http://www-faculty.cs.uiuc.edu/~hanj/bk2/

 

 WEKA web site for data mining software

 

http://www.togaware.com/datamining/survivor/Weka.html

 

Background for Pattern Recognition and Classification

 

http://research.cs.tamu.edu/prism/lectures.htm

 

UCI Machine Learning Repository Content Summary

 

http://www.ics.uci.edu/~mlearn/MLSummary.html

 

          Course Description

Concepts and techniques on data mining and their multidisciplinary  applications. Topics include review of databases and data warehousing, data cleaning and transformation, dimensionality reduction and data compression, concept description and rule classifiers, associations and rule generation, data classification and predictive modeling, learning ensembles, clustering, and performance analysis.  Emerging themes and future challenges related to biometrics, intrusion detection, and social networking are also discussed Term team project and topical review are required.

Motivation

The explosive growth in generating, collecting and storing data has generated an urgent need for new techniques and automated tools that can intelligently assist  in transforming the vast amounts of data into useful information and knowledge. Data mining is a multidisciplinary field, drawing from areas including AI, database technology, data visualization, information retrieval, high performance computing, machine learning, mathematical programming, neural networks, pattern recognition, statistical learning theory, and statistics.  The course provides the graduate students the opportunity to learn about the management and use of large data repositories based upon a multidisciplinary approach.

 

Goals

The objective of this course is to introduce graduate students to current research, technological advances and trends in data mining.   Data mining, which supports knowledge discovery in databases (KDD), helps with the automated extraction of patterns representing knowledge implicitly stored in large databases, data warehouses, and other massive information repositories.  The course focuses on issues related to the feasibility, usefulness, efficiency, and scalability of automated techniques for the discovery of patterns hidden in large databases.  Students will be exposed to the above topics via lectures and reading assignments, including recent journal and conference papers. Students are expected to complete a term project and to make an in depth presentation on a topic related to data mining.   As data mining has matured, the field is now advancing on three new fronts: (i) ability to mine data in real time; (ii) predictive analysis rather than merely explain past trends; and (iii) analyze messy “unstructured” data, e.g., video.

 

Follow – Up Studies with Professor Wechsler :  1. CS 667 – Biometrics – Summer Session 2007;  2. CS 775 /  IT 844  -- Pattern Recognition – Spring 2007; 3. Certificate in Biometrics; 4. PhD dissertation.

 

Grading

Homework à 20 %

Midterm à 25 %

(Team) Term Project à  40 %

Science and Technology REVIEW (5%) and Class Participation (10%) à  15%

Term Project

Students work in teams on the term project.
Scope and range for the project has to be agreed asap with the instructor.
Task involves meaningful data mining functionality and significant amounts of data.
Project includes the following  STEPS :


1. Problem definition, requirements analysis and conceptual design.
2. Data selection / sampling // visualization //
3. Cleaning and integration / Preprocessing // visualization //
4. Data transformation / Data and Dimensionality Reduction // visualization //
5. Data Mining // visualization //
6. Model selection, testing & evaluation, and performance assessment // visualization //
7. Knowledge discovery // visualization //

Use domain knowledge and visualization for all the steps.

Iteratively refine the quality and scope of your project

Reviews and class presentations are conducted stepwise
throughout the course (see tentative schedule). First a draft for each step is expected
the week the STEP is listed in the tentative schedule listed below.
Based upon feedback received in class the same step is completed and
presented again the following week.

Final (In Class)  Project Presentation (SLIDES) (about 30 – 45 minutes)

1.  Survey / Literature Review of  (a) application
and (b) task / functionality - data mining (STEP 5)
and model selection (“training strategy”).

2.    Brief   Description of STEPS 1 – 7.

3.    Performance Evaluation and Assessment of your project.

Final Project Report (HARD COPY) (at most 15 pages)

         Submit Technical Report (TR) that covers your Final   Project  Presentation.

 

Tentative Schedule

August 31

Ch. 1: Introduction – Databases and Data Warehouses, Data Mining and Knowledge Discovery, Knowledge Management, and the Semantic Web (http://www.w3.org/2001/sw)

Appendix C – Probability and Statistics –

Appendix D  – Regression -

September 7

Performance Evaluation and Model Selection (Notes) (see also Sect. 5.7)

STEP 1

Appendix A – Linear Algebra  -

September 14

Ch. 2: Data; Ch. 3: Exploring Data

- feature extraction and selection; dimensionality reduction -

Appendix B  – Dimensionality Reduction  -

September 21 –

September 28

Ch. 4:  Classification; Ch. 5: Classification (Part I)

Appendix E  – Optimization  -

STEPS 2 – 3 <September 21 >

October 5

            Ch. 5:  Classification (Part II)

 

October 12

Ch. 6: Association Analysis

STEP 4

October 19

Review for Midterm

 

October 26

Midterm

(covers August 31 – October 12 lectures)

(bring exam bluebook and calculator)

(closed books)

November 2 - 9

Ch. 8 - 9: Cluster Analysis.

STEP 5  <November 2 >

November 16

Ch. 10: Anomaly (“intrusion”) Detection;

mining data streams; social network analysis

STEPS  6 - 7

November 23

Thanksgiving

November 30

Biometrics

December 7

FINAL  PROJECT   PRESENTATION

December 14

FINAL  PROJECT   PRESENTATION