CS512 “Data Mining: Principles and Algorithms” (Spring 2007)

 Assignment 1

(20 marks)

Due time and date: Feb. 22 (Thursday), in class


Submission Requirements

·         All answers must be computer generated (including text and diagrams).

·         The hand-in version must be ordered correctly and stapled in the top left corner.

·         The hand-in version must include a header page (or with sufficient space) indicating: student name, student number, user id, course number and assignment number.

1.   Multi-dimensional stream data outlier analysis: An important task in stream data analysis is to detect outliers in a multidimensional environment, such as detect unusual power surge in a multi-dimensional space, including time (i.e., comparing with the normal duration), region (i.e., comparing with surrounding regions), sector (i.e., university, residence, government, and so on), etc.  Outline an efficient stream OLAP method that may detect outliers effectively and efficiently in data streams, and reason why your design can ensure such quality.

2.    Frequency counts for frequent items over data streams: The lossy counting algorithm ensures a theoretical error bound for maintaining counts of frequent items over data streams.  However, it does not consider how to maintain the approximate counts for maximal number of items over data streams given a limited amount of main memory.  Moreover, it does not give more weights to more recent items than older ones.   Outline two efficient stream counting methods that overcome these deficiencies, respectively, that is, one method that takes limited memory into consideration, and the second one considers the assigned weights of items arriving at different times.  Reason why your design has such advantages.

3.    Proposing an effective method for mining botnet: Botnet is a serious kind of attack that uses a large pool of compromised hosts called zombie machines to launch malicious activities, such as DOS attacks, illegal data hosting, valuable information gathering, etc. These machines are infected with a bot, which is typically an executable program, acting as Trojan horses, backdoors, or worms, under the control of the attacker. The collection of bots, running on thousands of zombie machines, is therefore formed a botnet. An attacker could remotely control the large number of infected zombie machines via the botnet to attack against other target machines, such as school, enterprise, or government resources, or use it for other malicious activities. Propose a data mining method that may automatically detect as many as possible infected machines in the botnet and eventually uncover the whole botnet as well as the hidden controlled attackers.

4.    Sequential patterns in the same sequence: Sequential pattern mining algorithm introduced by Srikant and Agrawal (1996) is to find sequential patterns among a set of sequences. Although there has been interesting follow up studies to develop interesting mining algorithms, such as PrefixSpan (Pei, et al. 2001), Spade (Zaki 2001), and CloSpan (Yan et al, 2003), the basic definition of sequential pattern has not been changed. However, sometimes one may like to find sequential patterns (i.e., frequent subsequences) within the same sequence (suppose gap is not allowed, i.e., we do not consider AG is a subsequence of the sequence ATG). For example, a string ATGCTCGAGCT may contain a substring GCT with support 2. Derive an efficient algorithm that finds the complete set of subsequences satisfying a minimum support threshold, explain how your algorithm works using a small example, and show some performance results of your algorithm implementation.


Page maintained by: Jiawei Han
Last update: January 31, 2007