Abstract: |
We describe a Text Categorization (TC) classifier that does not require a target function. When performing TC, there is a set of predefined, labeled categories that the documents need to be assigned to. Automated TC can be done by either describing fixed classification rules or by applying machine learning. Machine learning based TC usually occurs in a supervised learning fashion. The learner generally uses example document-to-category assignments (the target function) for training. When TC is introduced for any application or when new topics emerge, such examples are not easy to obtain because they are time-intensive to create and can require domain experts. Unsupervised document classification eliminates the need for such training examples. We describe a method capable of performing unsupervised machine learning-based TC. Our method provides quick, tangible classification results that allow for interactive user feedback and result validation. After uploading a document, the user can agree or correct the category assignment. This allows our system to incrementally create a target function that a regular supervised learning classifier can use to produce better results than the initial unsupervised system. To do so, the classifications need to be performed in a time acceptable for the user uploading documents. We based our method on word embedding semantics with three different implementation approaches; each evaluated using the reuters21578 benchmark (Lewis, 2004), the MAUI citeulike180 benchmark (Medelyan et al., 2009), and a self-compiled corpus of 925 scientific documents taken from the Cornell University Library arXiv.org digital library (Cornell University Library, 2016). Our method has the following advantages: Compared to key word extraction techniques, our system can assign documents to categories that are labeled with words that do not literally occur in the document. Compared to usual supervised learning classifiers, no target function is required. Without the requirement of a target function the system cannot overfit. Compared to document clustering algorithms, our method assigns documents to predefined categories and does not create unlabeled groupings of documents. In our experiments, the system achieves up to 66.73 % precision, 41.8 % recall and 41.09% F1 (all reuters21578) using macroaveraging. Using microaveraging, similar effectiveness is obtained. Even though these results are below those of contemporary supervised classifiers, the system can be adopted in situations where no training data is available or where text needs to be assigned to new categories capturing freshly emerging knowledge. It requires no manually collected resources and works fast enough to gather feedback interactively thereby creating a target function for a regular classifier. |