·
All answers must be computer generated (including text and diagrams).
·
The hand-in version must be ordered
correctly and stapled in the top left corner.
·
The hand-in version must include a
header page (or with sufficient space) indicating: student name, student
number, net id, course number and assignment number.
·
Question
1: It is often interesting to set up different min_support thresholds for
different group of items and/or for different levels of concepts at mining
transaction databases. For example, digital camera is usually more rarely
purchased in comparison with milk. Also, the frequency that a particular kind
of Milk and a particular kind of Bread purchased together is much lower than
that of Milk and Bread. Modify the FP-growth algorithm in each of the above two
cases, that is
1.
Propose
an efficient extended FPgrowth algorithm to mine
frequent patterns where min_support
is set differently for different groups of items.
2.
Propose
another efficient extended FPgrowth algorithm to mine
multi-level frequent patterns where min_support is set lower at lower level of concepts in
comparison with the higher level ones.
·
Question
2: In association mining, there are four null-invariant measures to judge
whether two items ai and aj in a transaction database are correlated: all-confidence, Kulczynski,
cosine, and max-confidence.
1. Generalize the definitions of these four measures
from two itemsets to k-itemsets
for any k.
2. Prove for any data set of size two (i.e., k = 2), all-confidence =< cosine =< Kulczynski =< max-confidence.
3. Based on this property, work out an efficient mining
algorithm that mines all the strongly (positively) correlated k-itemsets for any k that satisfy both min_support
and min_cosine constraints.
·
Question
3: Scalability is an important issue in classification of large data sets.
1. Explain why RainForest is
a scalable method for decision-tree induction in large databases.
2. Workout a similar scalable induction method for
Naive-Bayesian algorithm.
3. Outline a scalable method for support vector
machines.
·
Question
4. Frequent pattern-based classification is an interesting classification
method
1.
Reason
why frequent pattern-based classification method may lead to high
classification accuracy.
2.
Most
frequent pattern-based classification method need to first mine all of the frequent (or closed) patterns before classification, which is
inefficient and wasteful. Outline a method that may lead to high classification
accuracy but avoid mining all of the frequent (or closed) patterns.
Page maintained by: Jiawei Han
Last update: November 5, 2007