CS412 Introduction to Data Mining (Fall 2007)

Assignment 4

(30 marks)

Due time and date: Dec. 7 (Friday), in class

(Due to the short of time for working on the homework, please answer any three out of four questions in this homework)


Submission Requirements

All answers must be computer generated (including text and diagrams).

The hand-in version must be ordered correctly and stapled in the top left corner.

The hand-in version must include a header page (or with sufficient space) indicating: student name, student number, net id, course number and assignment number.

·        Question 1: Given a set of 14 points as follows:

 (0, 1), (1, 2), (3, 5), (4, 6), (5, 3), (6, 4), (7, 5), (2, 1), (3, 2), (4, 3), (2, 3), (5, 4), (7, 3), (2, 2)

Outline how the following algorithms will derive a set of clusters for the above set of points:

1.       The k-means algorithm, where k = 2,

2.       The k-medoids algorithm, where k = 2,

3.       DBSCAN, a density-based algorithm, where MinPts = 2, MaxRadius, i.e., Eps = 1.5

4.       AGNES, an agglomerative algorithm.

Discuss the advantages and disadvantages of each of the above algorithms.

·        Question 2: Links can be explored for effective cluster analysis. For example, SimRank has been used to cluster authors based on their publications in different or similar conferences.

1.       Describe how the SimRank algorithm works for effective cluster analysis, and what is its computational complexity, and

2.       Can you design another link-based clustering algorithm that can effectively reduce the computational complexity without reducing the quality of clustering?

·        Question 3: For clustering high-dimensional data set, such as micro-array data set, a frequent pattern-based clustering algorithm has been developed.

1.    Design such an algorithm and show how such an algorithm works.

2.    To further improve its computational efficiency, there have been studies on using pattern-growth-based method in clustering. Outline one such algorithm, and explain why it is efficient.

·        Question 4. There are many variants of constraint-based clustering. Three of them are as follows: (1) semi-supervised clustering where a user needs to specify a small set of object must-link (i.e., must be in the same cluster) or cannot link (i.e., cannot be in the same cluster; (2) user-guided clustering, where a user may specify an attribute which may influence the result of clustering (e.g., clustering students somehow guided by GPA instead of by birth-country); and (3) (user-specified) constraint-based clustering (e.g., each cluster must contain at least 100 objects).

1.    Outline each of such clustering methods,

2.    Discuss the major differences among these clustering methods, and what are the advantages and disadvantages of each method; and

3.    Give one application example that makes one method the most desirable one in comparison with the other two.

Page maintained by: Jiawei Han
Last update: December 4, 2007