Kaggle Keyword Extraction competition by Facebook. Implemented in Python.
Features: known kws, multi-word kws, unknown kws in title
Scoring: Naive Bayes probability, tf-idf
Heuristics: #Tags >= 1, 'tag-dash-kws' == 'tag dash kws', c#
- Clone repository
- Create virtual environment
virtualenv env
- Install required libraries
pip install -r requirements.txt
- Download
Train.zip
from the Kaggle site to local folderdata
- Split training data into new train and test files using
split_data.py
- creates train and test csvs (94 sec) - Train models using
train.py
- creates json models inclassifiers/
(161 sec) - Generate predictions using
predict.py
- creates prediction csvdata/Pred.csv
- Evaluate predictions using
evaluate.py
- outputs mean precision, recall, and F1 score
- Input data files (Train, Test, Pred) are in a local folder
data
- Test file generated from random sample of Train file due to inability to score Test file on Kaggle
- Mean F1 scoring based on Kaggle's wiki