ISSN : 2319-7323





INTERNATIONAL JOURNAL OF COMPUTER SCIENCE ENGINEERING


Open Access

ABSTRACT

Title : A Machine Learning Approach for Text and Document Mining
Authors : Hrishikesh Deka, Parismita Sarma
Keywords : Text classification, KNN classification algorithm, Cluster, Decision tree
Issue Date : May 2017
Abstract : Text Classification or Text Categorization is performed to automatically categorize a set of documents into its respective categories. With the rapid growth of World Wide Web and increasing electronic documents, the Text Categorization becomes an essential method for knowledge discovery and organizing the information. Different tools of Information Retrieval (IR) and Machine Learning (ML) are used for the classification process. A review on machine learning approach for text classification have been done in this paper and also a new system has been proposed for efficient classification. KNN (K Nearest Neighbor) is one of the most promising classification methods for Information retrieval. The main disadvantage of this method is that it has very high computational complexity. This is due to the fact that it considers all the training samples. A combination of traditional KNN classification algorithm and K-Means clustering algorithm has been proposed to overcome this difficulty. The terms are weighted using Term Frequency- Inverse Document Frequency after the Preprocessing steps are performed. Then the K-Means clustering algorithm will be used to group all the training samples and the cluster centers will be considered as the new training samples. The KNN classification algorithm is then performed to find out the category of the documents. A Decision Tree algorithm will then be performed to find out the sub category of the documents. The accuracy will be evaluated using precision, recall and F measure.
Page(s) : 115-123
ISSN : 2319-7323
Source : Vol. 6, No.5