Reuters-21578 text categorization collection
(after preprocessing by Gytis Karciauskas)

The original Reuters-21578 text categorization collection is available at the UCI repository.
What we make available below are the Reuters data preprocessed by Gytis Karciauskas.

The split of data to the training and testing sets is according to time of publication of the documents (ModApte). Classes containing only one document are eliminated together with the corresponding documents. The resulting training set contains 7769 documents and testing set 3018 documents. After removing function words and words that appear only in one document the feature set contains 15715 words. Additionally, only ten classes containing the most documents are selected for testing. Finally, for each class only 500 most informative words (measured by information gain) are selected. Note that the last column in each dataset denotes whether the record belongs to the corresponding class or not.

Class	Number of documents of this class in the testing set	Training set	Testing set
earn	1087	TRAIN_EARN_DAT.GZ	TEST_EARN_DAT.GZ
acq	719	TRAIN_ACQ_DAT.GZ	TEST_ACQ_DAT.GZ
crude	189	TRAIN_CRUDE_DAT.GZ	TEST_CRUDE_DAT.GZ
money-fx	179	TRAIN_MONEY_FX_DAT.GZ	TEST_MONEY_FX_DAT.GZ
grain	149	TRAIN_GRAIN_DAT.GZ	TEST_GRAIN_DAT.GZ
interest	131	TRAIN_INTEREST_DAT.GZ	TEST_INTEREST_DAT.GZ
trade	117	TRAIN_TRADE_DAT.GZ	TEST_TRADE_DAT.GZ
ship	89	TRAIN_SHIP_DAT.GZ	TEST_SHIP_DAT.GZ
wheat	71	TRAIN_WHEAT_DAT.GZ	TEST_WHEAT_DAT.GZ
corn	56	TRAIN_CORN_DAT.GZ	TEST_CORN_DAT.GZ

Reuters-21578 text categorization collection (after preprocessing by Gytis Karciauskas)

Reuters-21578 text categorization collection
(after preprocessing by Gytis Karciauskas)