CS412 Introduction to Data Warehousing and Data Mining
(Fall 2007)
Assignment 2 (mainly, Programming Assignment)
(30 marks)
Due time and date: In class, Wed. Oct. 31, 2007
Please read: Guidelines
to Assignment 2
Submission Requirements
There are three typical data cube computation methods:
multi-way array aggregation (Zhao, Despande and Naughton, SIGMOD’97), BUC (Beyer and Ramakrishnan, SIGMOD’99), and
star-cubing (Xin, Han, Li, and Wah, VLDB’03).
1. Implement any one of these cube computation
algorithms and describe your implementation, experimentation, and
performance. Then find another student who has implemented a different
algorithm on the same platform (e.g., C++ on Linux) and compare your algorithm
performance with his/hers.
Input:
i.
An n-dimensional fact table (for n < 20), which is essentially a relational
table with n attributes;
ii.
An iceberg condition: count (C) ≥ k where k is a positive integer as a
parameter.
i.
The set of computed cuboids that satisfy the iceberg condition, in the order of
your output generation;
ii.
Summary of the set of cuboid in the form of “cuboid id: the number of nonempty
cells”, sorted in alphabetical order of cuboids, e.g., a:155, ab:120, abc:22,
abcd:4, abce:6, abd:36, … (where the number after : represents the number of
nonempty cells). ---- This is used to quickly check the correctness of
your results.
2. Based on your
implementation, discuss the following questions:
·
What are the
challenging computation problems if the number of dimensions grows large?
·
Why iceberg
cubing may solve the problems for some data sets? Characterize such data
sets.
i.
Give one simple example to show that sometimes iceberg cube cannot provide a
good solution.
ii.
In many cases, one may like to materialize the cuboids with only a small number
of dimension combinations in a high dimensional cube (such a cube is called shell
cube). For example, compute only the cuboids with every 5-dimension
combination in a 30-dimensional data cube. Discuss how easy or hard to
modify your cube computation algorithm to facilitate such computation.
3. (Extra credit, maximum 10
points) Develop an algorithm (with convincing
performance, reasoning and clear presentation) that
computes closed iceberg cubes efficiently.