document-classification-reuters21578

Classificate documents on topics, using Reuters-21578 dataset.


Project maintained by marius92mc Hosted on GitHub Pages — Theme by mattgraham

Document Classification Reuters-21578

Classificate documents on topics, using Reuters-21578 data.

Requirements

Please see requirements.txt.
To install these packages, use the following command in a virtualenv.

$ pip install -r requirements.txt

Training data

Based on Reuters-21578 files.
Available in sgm format on

classification/data/ 

Trained data’s topics can be found in

classification/data/all-topics-strings.lc.txt

To train and test, run the following from classification/

Train

$ python train_and_classify_reuters_data.py 

Flags

--no-stemming  # don't use stemming when transforming raw data 
# or
--no-stopwords # don't use "remove stopwords" when tranforming data 

Last flag, if mentioned

--svm         # use Support Vector Machine classifier 
--naive-bayes # use Naive-Bayes
--perceptron  # use Perceptron

Learning Methods used

Support Vector Machine
Naive-Bayes
Perceptron