We provide here a subset of the original TDT2 corpus. The TDT2 corpus ( Nist Topic Detection and Tracking corpus ) consists of data collected during the first half of 1998 and taken from 6 sources, including 2 newswires (APW, NYT), 2 radio programs (VOA, PRI) and 2 television programs (CNN, ABC). It consists of 11201 on-topic documents which are classified into 96 semantic categories. In this subset, those documents appearing in two or more categories were removed, and only the largest 30 categories were kept, thus leaving us with 9,394 documents in total.
Data File:
contains variables 'fea' and 'gnd'.
'fea' is the term-document matrix, each row is a document; 'gnd' is the label.
Random Clusters Index Files:
2 Classes | 3 Classes |4 Classes |5 Classes |6 Classes |7 Classes |8 Classes |9 Classes |10 Classes
Given a cluster number, there are 50 randomly cases. Each case file contains variables 'sampleIdx' and 'zeroIdx'.
The following matlab codes can be used to generate the particular set
%===========================================
fea = fea(sampleIdx,:);
gnd = gnd(sampleIdx,:);
fea(:,zeroIdx) = [];
%===========================================
Please find the homepage of 20 Newsgroups data set at here. We use the 20 Newsgroups sorted by date version (20news-bydate.tar.gz). The original website reports that there are 18941 documents which is not correct. There are only 18846 documents, with 11314 (60%) training and 7532 (40%) testing.
This bydate version is recommended by the orignal provider since "I recommend the "bydate" version since cross-experiment comparison is easier (no randomness in train/test set selection), newsgroup-identifying information has been removed and it's more realistic because the train and test sets are separated in time. "
Data File:
contains variables 'fea', 'gnd', 'trainIdx' and 'testIdx'.
'fea' is the term-document matrix, each row is a document; 'gnd' is the label; 'trainIdx' and 'testIdx' are the indexes of the train/test split.
Feature File:
Corresponding word for each dimension. The number is the df of each word.
The following matlab codes can be used to generate training and test sets
%===========================================
feaTrain = fea(trainIdx,:);
gndTrain = gnd(trainIdx,:);
feaTest = fea(testIdx,:);
gndTest = gnd(testIdx,:);
%===========================================
Besides the orignal (60%,40%) split, we provide here other splits (5%, 10%, ... Training). Different from the orignal split, these splits are purely random (not separated in time).
5% Training | 10% Training | 20% Training | 30% Training | 40% Training | 50% Training |