CS412 Introduction to Data Warehousing and Data Mining (Fall 2007)

 

Assignment 2 (mainly, Programming Assignment)

 

(30 marks)

 

Due time and date: In class, Wed. Oct. 31, 2007

Please read: Guidelines to Assignment 2


Submission Requirements

 

There are three typical data cube computation methods: multi-way array aggregation (Zhao, Despande and Naughton, SIGMOD97), BUC (Beyer and Ramakrishnan, SIGMOD99), and star-cubing (Xin, Han, Li, and Wah, VLDB03).

1.   Implement any one of these cube computation algorithms and describe your implementation, experimentation, and performance.  Then find another student who has implemented a different algorithm on the same platform (e.g., C++ on Linux) and compare your algorithm performance with his/hers.

Input: 

                 i.      An n-dimensional fact table (for n < 20), which is essentially a relational table with n attributes;

                 ii.      An iceberg condition: count (C) ≥ k where k is a positive integer as a parameter.

Output:

                 i.      The set of computed cuboids that satisfy the iceberg condition, in the order of your output generation;

                ii.      Summary of the set of cuboid in the form of “cuboid id: the number of nonempty cells”, sorted in alphabetical order of cuboids, e.g., a:155, ab:120, abc:22, abcd:4, abce:6, abd:36, … (where the number after : represents the number of nonempty cells).  ---- This is used to quickly check the correctness of your results.

2.   Based on your implementation, discuss the following questions:

·        What are the challenging computation problems if the number of dimensions grows large?

·        Why iceberg cubing may solve the problems for some data sets? Characterize such data sets. 

i.      Give one simple example to show that sometimes iceberg cube cannot provide a good solution.

ii.      In many cases, one may like to materialize the cuboids with only a small number of dimension combinations in a high dimensional cube (such a cube is called shell cube).  For example, compute only the cuboids with every 5-dimension combination in a 30-dimensional data cube.  Discuss how easy or hard to modify your cube computation algorithm to facilitate such computation.

3. (Extra credit, maximum 10 points) Develop an algorithm (with convincing performance, reasoning and clear presentation) that computes closed iceberg cubes efficiently.