Reuters-21578 text categorization collection 
(after preprocessing by Gytis Karciauskas)


The original Reuters-21578 text categorization collection is available at the UCI repository.
What we make available below are the Reuters data preprocessed by Gytis Karciauskas.

The split of data to the training and testing sets is according to time of publication of the documents (ModApte). Classes containing only one document are eliminated together with the corresponding documents. The resulting training set contains 7769 documents and testing set 3018 documents. After removing function words and words that appear only in one document the feature set contains 15715 words. Additionally, only ten classes containing the most documents are selected for testing. Finally, for each class only 500 most informative words (measured by information gain) are selected. Note that the last column in each dataset denotes whether the record belongs to the corresponding class or not.

Class Number of documents
of this class
in the testing set
Training set Testing set
earn 1087 TRAIN_EARN_DAT.GZ TEST_EARN_DAT.GZ
acq 719 TRAIN_ACQ_DAT.GZ TEST_ACQ_DAT.GZ
crude 189 TRAIN_CRUDE_DAT.GZ TEST_CRUDE_DAT.GZ
money-fx 179 TRAIN_MONEY_FX_DAT.GZ TEST_MONEY_FX_DAT.GZ
grain 149 TRAIN_GRAIN_DAT.GZ TEST_GRAIN_DAT.GZ
interest 131 TRAIN_INTEREST_DAT.GZ TEST_INTEREST_DAT.GZ
trade 117 TRAIN_TRADE_DAT.GZ TEST_TRADE_DAT.GZ
ship 89 TRAIN_SHIP_DAT.GZ TEST_SHIP_DAT.GZ
wheat 71 TRAIN_WHEAT_DAT.GZ TEST_WHEAT_DAT.GZ
corn 56 TRAIN_CORN_DAT.GZ TEST_CORN_DAT.GZ