CS512: Data Mining: Principles and Algorithms (Spring 2007)

Assignment 2

(20 marks)

Due time and date: April 10 (Tuesday)


Submission Requirements

·         All answers must be computer generated (including text and diagrams).

·         The hand-in version must be ordered correctly and stapled in the top left corner.

·         The hand-in version must include a header page (or with sufficient space) indicating: student name, student number, user id, course number and assignment number.

1.   Constraint-based mining of graph patterns: Constraints often plays an important role in efficient graph mining. There are many potential constraints based on user requests in graph mining. For example, one may want graph patterns containing or excluding certain vertices (or edges), with minimal or maximal size, containing certain subgraphs, with certain summation values, etc. Based on how a constraint behaves in graph mining, give a classification of constraints and work out the rules on how to do constraint-based graph mining.

2.   Mining approximate graph patterns in a single graph: The graph algorithms that we introduced mine graph patterns in a set of graphs. Many applications may need to mine subgraph patterns within one huge interconnected graph such as the Web, publication-citation graphs, transportation road networks, or large biological networks. Moreover, one may like to find approximate graph patterns, such as ignoring minor differences on edge weights or label differences. Design an efficient algorithm that may perform such mining effectively and reason why this algorithm may have good performance.

3.   Mining multiple social networks: An object or a person can be involved in multiple social networks. For example, a student could be in a class, a research project group, a family member, member of a neighborhood, etc. It is often beneficial to consider their joint effect or interaction in data mining. Design an efficient method in social network analysis that may incorporate multiple social networks in data mining.

4.   Link analysis for data integration: Many data sets contain semantically rich implicit or explicit links. However, some different names may imply the same entity (e.g., Bill Clinton and William Clinton) but some same names may imply different entities (e.g., the John Smith problem). Design an efficient data integration method that resolves such problems by link analysis.


Page maintained by: Jiawei Han
Last update: March 7, 2007