All answers
must be computer generated (including text and diagrams).
The
hand-in version must be ordered correctly and stapled in the top left corner.
The
hand-in version must include a header page (or with sufficient space)
indicating: student name, student number, net id, course number and assignment
number.
·
Question
1: Given a set of 14 points as follows:
(0, 1), (1, 2), (3, 5), (4, 6), (5, 3), (6,
4), (7, 5), (2, 1), (3, 2), (4, 3), (2, 3), (5, 4), (7, 3), (2, 2)
Outline how the
following algorithms will derive a set of clusters for the above set of points:
1.
The
k-means algorithm, where k = 2,
2.
The
k-medoids algorithm, where k = 2,
3.
DBSCAN,
a density-based algorithm, where MinPts = 2, MaxRadius, i.e., Eps = 1.5
4.
AGNES,
an agglomerative algorithm.
Discuss the advantages
and disadvantages of each of the above algorithms.
·
Question
2: Links can be explored for effective cluster analysis. For example, SimRank has been used to cluster authors based on their
publications in different or similar conferences.
1.
Describe
how the SimRank algorithm works for effective cluster
analysis, and what is its computational complexity, and
2.
Can
you design another link-based clustering algorithm that can effectively reduce
the computational complexity without reducing the quality of clustering?
·
Question
3: For clustering high-dimensional data set, such as micro-array data set, a
frequent pattern-based clustering algorithm has been developed.
1. Design such an algorithm and show how such an
algorithm works.
2. To further improve its computational efficiency, there
have been studies on using pattern-growth-based method in clustering. Outline
one such algorithm, and explain why it is efficient.
·
Question
4. There are many variants of constraint-based clustering. Three of them are as
follows: (1) semi-supervised
clustering where a user needs to
specify a small set of object must-link (i.e., must be in the same cluster) or
cannot link (i.e., cannot be in the same cluster; (2) user-guided clustering, where a user may specify an attribute which may
influence the result of clustering (e.g., clustering students somehow guided by
GPA instead of by birth-country); and (3) (user-specified)
constraint-based clustering
(e.g., each cluster must contain at least 100 objects).
1.
Outline
each of such clustering methods,
2. Discuss the major differences among these clustering
methods, and what are the advantages and disadvantages of each method; and
3. Give one application example that makes one method
the most desirable one in comparison with the other two.
Page maintained by: Jiawei Han
Last update: December 4, 2007