NSF/SGER: CS-BibCube: OLAPing and Mining of Computer Science Literature

National Science Foundation Award Number: NSF IIS 08-42769 (September1, 2008―Feb. 28, 2010)

 

 

Contact Information

 

Jiawei Han,  PI
Department of Computer Science
University of Illinois, Urbana-Champaign
1304 West Springfield Ave. , Urbana, Illinois 61801 U.S.A.
Office: (217) 333-6903,   Fax: (217) 265-6494

E-mail: hanj at cs.uiuc.edu, URL: http://www.cs.uiuc.edu/~hanj

 

List of Supported Students and Staff

 

         Zhijun Yin, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign

         Yintao Yu, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign

         Bo Zhao, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign

Project Award Information

  • Award Number: NSF IIS 08-42769
  • Duration: September1, 2008―Feb. 28, 2010
  • Title: NSF/SGER: CS-BibCube: OLAPing and Mining of Computer Science Literature
  • Keywords: text databases, text mining, information network analysis, online analytical processing, scalable OLAP and mining algorithms, data mining applications

Project Summary

This research project is to investigate issues in the design and development of CS-BibCube, a multidimensional text data cube, constructed based on multidimensional categorical dimensions (e.g., author list, venue, and date) and unstructured text attributes (e.g., title, abstract, and contents), to facilitate multidimensional online analytical processing (OLAP) and mining of computer science literature. Data cube has become an essential engine in data warehouse industry and has been extended to handle relatively structured non-relational data, including spatiotemporal data, sequences, graphs, data streams, etc. However, it is still challenging to handle unstructured text data. This project is to explore the possibilities and alternatives on the design, multidimensional modeling, implementation, performance improvement, and deployment of text-cubing and text-OLAP. The work will integrate multiple disciplinary approaches derived from data cube and OLAP, information retrieval, text mining, and machine learning, and further study is expected to be expanded to other multidimensional text databases with broad applications in business, industry, government agencies, scientific research, and education. The research results are to be published in research forums on information retrieval, data mining, and database systems, and be integrated into the educational program at UIUC. The progress of the project and the research results will be disseminated via the project Web site (http://www.cs.uiuc.edu/~hanj/projs/csbibcube.htm)..

Publications and Products:

Journal articles (including accepted)

 

  • Chen Chen, Xifeng Yan, Feida Zhu, Jiawei Han, Philip S. Yu, “Graph OLAP: A Multi-Dimensional Framework for Graph Data Analysis", Knowledge and Information Systems (KAIS) (Special Issue of Selected Papers from ICDM'08), 2009.
  • Jing Gao, Bolin Ding, Wei Fan, Jiawei Han, and Philip S. Yu, “Classifying Data Streams with Skewed Class Distribution and Concept Drifts", IEEE Internet Computing (Special Issue on Data Stream Management), 12(6):37-49, 2008.
  • Chao Liu, Xiangyu Zhang, and Jiawei Han, “A Systematic Study of Failure Proximity", IEEE Transactions on Software Engineering, 34(6):826-843, 2008.

 

Book and Book Chapters

 

  1. H. J. Miller and J. Han (eds.), Geographic Data Mining and Knowledge Discovery, 2nd ed., Springer Verlag, 2009.
  2. Hillol Kargupta, Jiawei Han, Philip S. Yu, and Rajeev Motwani (eds.), Next Generation of Data Mining, (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series), 2009 (605 + xxiv pages).
  3. Jiawei Han, Y. Dora Cai, Yixin Chen, Guozhu Dong, Jian Pei, Benjamin W. Wah, and Jianyong Wang, “Multi-Dimensional Analysis of Data Streams Using Stream Cubes”, in C. C. Aggarwal (ed.), Data Streams: Models and Algorithms, Kluwer Academic Publishers, pp. 103-126, 2006.
  4. Jiawei Han, “Data Mining", in M. Tamer Ozsu and Ling Liu (eds.), Encyclopedia of Database Systems, Springer, 2009
  5. Hong Cheng and Jiawei Han, “Frequent Itemsets and Association Rules", in M. Tamer Ozsu and Ling Liu (eds.), Encyclopedia of Database Systems, Springer, 2009
  6. Hong Cheng and Jiawei Han, “Pattern-Growth Methods", in M. Tamer Ozsu and Ling Liu (eds.), Encyclopedia of Database Systems, Springer, 2009
  7. Jiawei Han and Bolin Ding, “Stream Mining", in M. Tamer  Ozsu and Ling Liu (eds.), Encyclopedia of Database Systems, Springer, 2009
  8. Ronnie Alves, Joel Ribeiro, Orlando Belo, and Jiawei Han, “Ranking Gradients in Multi-Dimensional Spaces", in T. M. Nguyen (ed.), Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development: Innovative Methods and Applications, IGI Global, 2009.
  9. Jiawei Han and Jing Gao, “Research Challenges for Data Mining in Science and Engineering", in H. Kargupta, et al., (eds.), Next Generation of Data Mining, Chapman & Hall/CRC, 2009, pp. 3-28.
  10. Feida Zhu, Xifeng Yan, Jiawei Han and Philip S. Yu, \Mining Frequent Approximate Sequential Patterns", in H. Kargupta, et al., (eds.), Next Generation of Data Mining, Chapman & Hall/CRC, 2009, pp. 69-90.
  11. Jiawei Han and Xiaolei Li, “Classification and Clustering for Homeland Security", in John G. Voeller (ed.), Wiley Handbook of Science and Technology for Homeland Security, John Wiley & Sons, 2009.
  12. Jiawei Han, “OLAP, Spatial", in Shashi Shekhar and Hui Xiong (eds.), Encyclopedia of GIS, Springer, 2008

 

Refereed Conference Publications

 

1.       Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, “Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams", Proc. 2009 European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD'09), Bled, Slovenia, Sept. 2009.

  1. Min-Soo Kim and Jiawei Han, "A Particle-and-Density Based Evolutionary Clustering Method for Dynamic Networks", Proc. 2009 Int. Conf. on Very Large Data Bases (VLDB'09), Lyon, France, Aug. 2009.
  2. Tianyi Wu, Dong Xin, Qiaozhu Mei, and Jiawei Han, "Promotion Analysis in Multi-Dimensional Space", Proc. 2009 Int. Conf. on Very Large Data Bases (VLDB'09), Lyon, France, Aug. 2009.
  3. Chen Chen, Cindy Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, and Jiawei Han, "Mining Graph Patterns Efficiently via Randomized Summaries", Proc. 2009 Int. Conf. on Very Large Data Bases (VLDB'09), Lyon, France, Aug. 2009.
  4. Yintao Yu, Cindy X. Lin, Yizhou Sun, Chen Chen, Jiawei Han, Binbin Liao, Tianyi Wu, ChengXiang Zhai, Duo Zhang, and Bo Zhao, “iNextCube: Information Network-Enhanced Text Cube", Proc. 2009 Int. Conf. on Very Large Data Bases (VLDB'09) (system demo), Lyon, France, Aug. 2009.
  5. David Lo, Hong Cheng, Jiawei Han, SiauCheng Khoo, and Chengnian Sun, “Classification of Software Behaviors for Failure Detection: A Discriminative Pattern Mining Approach", Proc. 2009 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'09), Paris, France, June 2009.
  6. Yizhou Sun, Yintao Yu, and Jiawei Han, “Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema", Proc. 2009 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'09), Paris, France, June 2009.
  7. Zhijun Yin, Rui Li, Qiaozhu Mei, and Jiawei Han, “Exploring Social Tagging Graph for Web Object Classification", Proc. 2009 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'09), Paris, France, June 2009.
  8. Jing Gao, Wei Fan, Yizhou Sun, and Jiawei Han, “Heterogeneous Source Consensus Learning via Decision Propagation and Negotiation", Proc. 2009 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'09), Paris, France, June 2009.
  9. Deng Cai, Xiaofei He, Xuanhui Wang, Hujun Bao, Jiawei Han, Locality Preserving Nonnegative Matrix Factorization”, Proc. 2009 Int. Joint Conf. on Arti_cial Intelligence (IJCAI-09), Pasadena, CA, July 2009.
  10. Mohammad Maifi Hasan Khan, Tarek Abdelzaher, Jiawei Han, and Hossein Ahmadi, “Finding Symbolic Bug Patterns in Sensor Networks", Proc. 2009 IEEE Int. Conf. on Distributed Computing in Sensor Systems (DCOSS '09), Marina Del Rey, CA, June 2009.
  11. Jing Gao, Guofei Jiang, Haifeng Chen, and Jiawei Han, “Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems”, Proc. 2009 Int. Conf. on Distributed Computing Systems (ICDCS'09), Montreal, Quebec, Canada, June 2009.
  12. Mohammad M Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, “A Multi-Partition Multi-Chunk Ensemble Technique to Classify Concept-Drifting Data Streams”, Proc. 2009 Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD'09), Bangkok, Thailand, Apr. 2009.
  13. Xin Jin, Sangkyum Kim, Jiawei Han, Liangliang Cao, and Zhijun Yin, “GAD: General Activity Detection for Fast Clustering on Large Data", Proc. 2009 SIAM Int. Conf. on Data Mining (SDM'09), Sparks, NV, April 2009.
  14. Marisa Thoma, Hong Cheng, Arthur Gretton, Jiawei Han, Hans-Peter Kriegel, Alexander J. Smola, Le Song, Philip S. Yu, Xifeng Yan, and Karsten M. Borgwardt, “Near-Optimal Supervised Feature Selection among Frequent Subgraphs", Proc. 2009 SIAM Int. Conf. on Data Mining (SDM'09), Sparks, NV, April 2009.
  15. Duo Zhang, Chengxiang Zhai and Jiawei Han, “Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases", Proc. 2009 SIAM Int. Conf. on Data Mining (SDM'09), Sparks, NV, April 2009. (One of “Best of SDM’09”)
  16. Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng, Tianyi Wu, “RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis”, Proc. 2009 Int. Conf. on Extending Data Base Technology (EDBT'09), Saint-Petersburg, Russia, Mar. 2009.
  17. Jiawei Han, Xifeng Yan, and Philip S. Yu, “Scalable OLAP and Mining of Information Networks”, 2009 Int. Conf. on Extending Data Base Technology (EDBT'09), Saint-Petersburg, Russia, Mar. 2009.
  18. Bolin Ding, David Lo, Jiawei Han, and Siau-Cheng Khoo, “Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database”, Proc. 2009 Int. Conf. on Data Engineering (ICDE'09), Shanghai, China, Mar. 2009.
  19. Xiaolei Li, Zhenhui Li, Jiawei Han, and Jae-Gil Lee, “Temporal Outlier Detection in Vehicle Traffic Data”, Proc. 2009 Int. Conf. on Data Engineering (ICDE'09), Shanghai, China, Mar. 2009.
  20. Chen Chen, Xifeng Yan, Feida Zhu, Jiawei Han, and Philip S. Yu, "Graph OLAP: Towards Online Analytical Processing on Graphs", Proc. 2008 Int. Conf. on Data Mining (ICDM'08), Pisa, Italy, Dec. 2008.
  21. Deng Cai, Xiaofei He, Xiaoyun Wu, and Jiawei Han, “Non-negative Matrix Factorization on Manifold”, Proc. 2008 Int. Conf. on Data Mining (ICDM'08), Pisa, Italy, Dec. 2008.
  22. Cindy Xide Lin, Bolin Ding, Jiawei Han, Feida Zhu, and Bo Zhao, "Text Cube: Computing IR Measures for Multidimensional Text Database Analysis", Proc. 2008 Int. Conf. on Data Mining (ICDM'08), Pisa, Italy, Dec. 2008.
  23. Luiz Mendes, Bolin Ding, and Jiawei Han, "Stream Sequential Pattern Mining with Precise Error Bounds", Proc. 2008 Int. Conf. on Data Mining (ICDM'08), Pisa, Italy, Dec. 2008.
  24. Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, "A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data", Proc. 2008 Int. Conf. on Data Mining (ICDM'08), Pisa, Italy, Dec. 2008.

26.   Mohammad Maifi Hasan Khan, Hieu Le, Hossein Ahmadi, Tarek Abdelzaher, and Jiawei Han, “DustMiner: Troubleshooting Interactive Complexity Bugs in Sensor Networks”, Proc. 2008 ACM Int. Conf. on Embedded Networked Sensor Systems (Sensys'08), Raleigh, NC, Nov. 2008.

27.   Chen Chen, Cindy Xide Lin, Xifeng Yan, and Jiawei Han, “On Effective Presentation of Graph Patterns: A Structural Representative Approach”, Proc. 2008 ACM Conf. on Information and Knowledge Management (CIKM'08), Napa Valley, CA, Oct. 2008.

28.   Deng Cai, Qiaozhu Mei, Jiawei Han, and ChengXiang Zhai, “Modeling Hidden Topics on Document Manifold”, Proc. 2008 ACM Conf. on Information and Knowledge Management (CIKM'08), Napa Valley, CA, Oct. 2008.

Project Impact

 

         Education: Parts of the new research results are used in Data Mining courses (CS412, CS512) for both undergraduate and graduate students being taught in the Department of Computer Science, the University of Illinois at Urbana-Champaign.    Moreover, the research results have been and will continuously be published timely in international conferences and journals and be distributed world-wide for education and research.  The new progress will also be integrated into the new edition of our data mining textbook and other research collections.

         Collaborations: For this project we have established collaborations with NASA, HP Labs, IBM T.J. Watson Research Center, Yahoo! Research, Microsoft Research, Boeing, and NCSA (National Center of Supercomputer Applications).  Through such collaborations we expect to have access to real datasets and applications and produce more research results.

 

Current and Future Activities

The following are some of the highlights of our ongoing work. Please refer to the section: Publications and Products section for related references

         Development of efficient and scalable mechanisms for OLAP and mining networks: see ICDM’08, EDBT’09, SDM’09, KDD’09 and VLDB’09 papers.

         Development of multi-dimensional text database analysis techniques: see ICDM’08 (text cube), SDM’09 (topic cube), VLDB’09 (iNextCube) demo.

         Development of efficient methods for data intensive knowledge discovery and data mining: SDM’09, KDD’09, VLDB’09.

 Area Background

 

This project is based on the previous research on data mining, text data analysis, and data cube and multidimensional analysis.    There have been many research papers published on these themes.   Several textbooks on data mining, information retrieval and information network analysis provide good overviews of the principles and algorithms, including (Han and Kamber, 2006, (Hastie, Tibshirani, and Friedman,  2001) and (Manning, Raghavan and Schutze 2008).

 

Area References

 

  1. Chen Chen, Xifeng Yan, Feida Zhu, Jiawei Han, and Philip S. Yu, "Graph OLAP: Towards Online Analytical Processing on Graphs", Proc. 2008 Int. Conf. on Data Mining (ICDM'08), Pisa, Italy, Dec. 2008..
  2. Cindy Xide Lin, Bolin Ding, Jiawei Han, Feida Zhu, and Bo Zhao, "Text Cube: Computing IR Measures for Multidimensional Text Database Analysis", Proc. 2008 Int. Conf. on Data Mining (ICDM'08), Pisa, Italy, Dec. 2008.
  3. J. Han and M. Kamber. Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann, 2006.

4.       T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag 2001.

5.       C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008

  1. Duo Zhang, Chengxiang Zhai and Jiawei Han, “Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases", Proc. 2009 SIAM Int. Conf. on Data Mining (SDM'09), Sparks, NV, April 2009.

 

Potential Related Projects

This project is related to most of data mining and text database and OLAP.   In particularly, it is related to P.I.'s NSF IIS 020-9199 (Mining Sequential and Structured Patterns: Scalability, Flexibility, Extensibility and Applicability), P.I.'s NSF IIS-03-08215 (Mining Dynamics of Data Streams in Multi-Dimensional Space), and PI’s NASA project NNX08AC35A (Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events). We wish to collaborate or exchange research ideas with most of the research projects related to knowledge discovery in databases, text information systems, and OLAP analysis, and their applications.

Project Web site URL:  http://www.cs.uiuc.edu/~hanj/projs/csbibcube.htm

Online software:  Online software related to this project can be downloaded at www.illimine.cs.uiuc.edu

Online resources:  Research publications related to this project can be downloaded at Selected Publications