CS412 Introduction to Data Mining (Fall 2007)

Assignment 3

(40 marks)

Due time and date: Nov. 16 (Friday), in class


Submission Requirements

·         All answers must be computer generated (including text and diagrams).

·         The hand-in version must be ordered correctly and stapled in the top left corner.

·         The hand-in version must include a header page (or with sufficient space) indicating: student name, student number, net id, course number and assignment number.

·        Question 1: It is often interesting to set up different min_support thresholds for different group of items and/or for different levels of concepts at mining transaction databases. For example, digital camera is usually more rarely purchased in comparison with milk. Also, the frequency that a particular kind of Milk and a particular kind of Bread purchased together is much lower than that of Milk and Bread. Modify the FP-growth algorithm in each of the above two cases, that is

1.    Propose an efficient extended FPgrowth algorithm to mine frequent patterns where min_support is set differently for different groups of items.

2.    Propose another efficient extended FPgrowth algorithm to mine multi-level frequent patterns where min_support is set lower at lower level of concepts in comparison with the higher level ones.

·        Question 2: In association mining, there are four null-invariant measures to judge whether two items ai and aj in a transaction database are correlated: all-confidence, Kulczynski, cosine, and max-confidence.

1.    Generalize the definitions of these four measures from two itemsets to k-itemsets for any k.

2.    Prove for any data set of size two (i.e., k = 2), all-confidence =< cosine =< Kulczynski =< max-confidence.

3.    Based on this property, work out an efficient mining algorithm that mines all the strongly (positively) correlated k-itemsets for any k that satisfy both min_support and min_cosine constraints.

·        Question 3: Scalability is an important issue in classification of large data sets.

1.    Explain why RainForest is a scalable method for decision-tree induction in large databases.

2.    Workout a similar scalable induction method for Naive-Bayesian algorithm.

3.    Outline a scalable method for support vector machines.

·        Question 4. Frequent pattern-based classification is an interesting classification method

1.    Reason why frequent pattern-based classification method may lead to high classification accuracy.

2.    Most frequent pattern-based classification method need to first mine all of the frequent (or closed) patterns before classification, which is inefficient and wasteful. Outline a method that may lead to high classification accuracy but avoid mining all of the frequent (or closed) patterns.


Page maintained by: Jiawei Han
Last update: November 5, 2007