Scalable Document Classification

Preprocessing

Remove Stopwords

Removing stopwords is able to partially solve the problem of bag-of-word. Stopwords are those words that does not contain as much “informational content” to the model. Therefore, they don't contribute to successful classification. By removing stopwords, more "meaningful" words will have higher weight, which, in turn, contributes to classifying each document into accurate label.
Stemming

Different forms of a word are used in documents for grammatical reasons, such as work, works, working and worked. In addition, words such as democracy, democratic, and democratization are related words with similar meaning. It seems to be useful to consider such words as one word.
Removing Punctuation

Punctuation in text normally appears right before, after or in between words. Therefore, it is impossible to get rid of them by simple splitting by space. We can get a lot of fuzzy words after splitting and in process, each of them will be considered as different words which could result in information loss in classification. Therefore, removing punctuation is essential.

Feature Extraction

Bag-of-words Bag-of-words, or BOW for short, is a simple feature extraction method. It is a representation of text that describes the occurrence of words within a document. The reason why it is called "bag" of words is that it does not care about the order of words but only cares about whether the word appears in the document or not.

For example, for the two text documents as below:
```
(1) John likes to watch movies. Mary likes movies too.
(2) John also likes to watch football games.
```
Based on the text documents, a list of distinct word is created.
```
[
  "John",
  "likes",
  "to",
  "watch",
  "movies",
  "Mary",
  "too",
  "also",
  "football",
  "games"
]
```
From here, document vectors could be created according to each word count in each document. In this case, vectors will look like:
```
(1) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
(2) [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]
```
TF-IDF There's a problem with simple bag-of-words model that words appearing very frequently start to dominate in the document with large score. However, such words may not contain as much “informational content” to the model.

TF-IDF is able to rescale the frequency of words by how often each of them appear in all documents. This could potentially weaken or solve the problem stated above.

TF-IDF is short for Term Frequency – Inverse Document Frequency, or TF-IDF, where:
```
Term Frequency: a scoring of the frequency of the word in the current document.
Inverse Document Frequency: a scoring of how rare the word is across documents.

IDF is calculated by log(N/n_t) 
(where N is the number of documents
       nt is the number of documents the specific word t appears in)
```

Possible Models

There are several possible models for document classification. For example, there are Naive Bayes, Logistic Regrassion, Random Forest and K Nearest Neighbors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalable Document Classification

Preprocessing

Feature Extraction

Possible Models

Contents

Clone this wiki locally