Datasets used in KDD-2011

Title: Probabilistic Topic Models with Biased Propagation on Heterogeneous Information Networks
Authors: Hongbo Deng, Jiawei Han, Bo Zhao, Yintao Yu, Cindy Xide Lin

DBLP subset

Description

The Digital Bibliography and Library Project (DBLP) is a collection of bibliographic information on major computer
science journals and proceedings, which can be used to build a heterogeneous information network with multi-typed objects
along with rich text data as Figure 1 (a). Each paper is represented by a bag of words that appeared in the abstract
and title of the paper. Besides the rich-text documents, we also obtain two other types of objects: author and venue
(i.e., conference). In this experiment, we use a subset of the DBLP records that belongs to four areas: database, data
mining, information retrieval and artificial intelligence, and contains 28,569 documents, 28,702 authors and 20 conferences.
The abstract is collected for representing each document, and this data collection has 11,771 unique terms.
Within the heterogeneous information network, we observe two explicit types of relationships: paper-author and papervenue,
which consist of a total number of 103,201 links. Moreover, we use a labeled data set [22] with 4,057 authors,
100 papers and all 20 conferences for quantitative accuracy evaluation.

Distribution of words, documents, authors, venues, and labels

NSF-Awards subset

Description

The NSF Research Awards Abstracts (NSF-Awards) consists of 129,000 abstracts describing NSF awards for basic
research from 1990 to 2003, which are grouped into more than 640 research programs. For each NSF award, we obtain
the abstract represented by a bag of words, and the affiliated investigator(s), forming a heterogeneous information
network. In our test, we extract a subset of documents that belong to the largest 10 research programs, such as ap-
plied mathematics, economics and geophysics, thus leaving us with 16,405 documents and 9,989 associated investigators.
Within the heterogeneous information network, there are a total of 20,717 links between documents and investigators.
Moreover, this data collection has 18,674 unique terms which appear in all the abstracts.

Distribution of words, documents, investigators, and labels

Code

Description

Code and dataset can be downloaed here.

@inproceedings{DBLP:conf/kdd/DengHZYL11,
  author    = {Hongbo Deng and
               Jiawei Han and
               Bo Zhao and
               Yintao Yu and
               Cindy Xide Lin},
  title     = {Probabilistic topic models with biased propagation on heterogeneous
               information networks},
  booktitle = {KDD},
  year      = {2011},
  pages     = {1271-1279},
  ee        = {http://doi.acm.org/10.1145/2020408.2020600},
  crossref  = {DBLP:conf/kdd/2011},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

If you use any contents in this package, please cite the above paper as your reference.


**** Package Contents ****

./code
    ./step1 compile the mex files, only needs to be done once
        - mex mex_Pw_d.c
        - mex mex_EMstep.c
        - mex mex_logL.c (the code plsa.zip provided by Peter from http://people.kyb.tuebingen.mpg.de/pgehler/code/index.html)

    ./step2 run sample code for DBLP dataset
        - sampleRun_TMBP_DBLP.m
        - sampleRun_RW-DBLP.m

    other files:
    ./TMBP.m Probabilistic topic models with biased propagation
    ./RandomWalkNormal.m TMBP-RW
    ./CalcMetrics.m calculate the evaluation results
    ./WriteResult.m write the results to file
    ./WriteTopics_new.m output the topic information


./dataset
    ./dblp_4area_abstract.mat
        - Mda: Paper-Author matrix (value is 1 or 0, Mda(i,j) = 1 means the paper i is written by author j)
        - Mdc: Paper-Conference(Venue) matrix (value is 1 or 0, Mdc(i,j) = 1 means the paper i is published in conf j)
        - Mdt: Paper-Term matrix (value is the occurrence)
        - name: contains the original names of authors/confs/papers/terms
        - label: labels for authors/confs/papers
 
./ReadMe.txt 
    This file itself.


Date Created: Feb 18, 2011