Classificate documents on topics, using Reuters-21578 dataset.
Classificate documents on topics, using Reuters-21578 data.
Please see requirements.txt.
To install these packages, use the following command in a virtualenv.
$ pip install -r requirements.txt
Based on Reuters-21578 files.
Available in sgm format on
classification/data/
Trained data’s topics can be found in
classification/data/all-topics-strings.lc.txt
To train and test, run the following from classification/
$ python train_and_classify_reuters_data.py
Flags
--no-stemming # don't use stemming when transforming raw data
# or
--no-stopwords # don't use "remove stopwords" when tranforming data
Last flag, if mentioned
--svm # use Support Vector Machine classifier
--naive-bayes # use Naive-Bayes
--perceptron # use Perceptron