Document Type : Original Article
Authors
NA
Abstract
Objective:
This research aims to develop a data-driven model to automatically identify the subject of a book using machine learning algorithms and text mining techniques. The study utilizes a structured dataset containing book titles, summaries, and subjects. The ultimate goal is to build a model that is technically accurate and practical for real-world applications like book recommendation engines and digital reading platforms.
Method:
Data were collected through web scraping and crawling from reliable sources including Goodreads, Ketabrah, and Fidibo. The raw data went through preprocessing steps including removal of special characters, morphological stemming, and stop-word elimination. Feature extraction was performed on the cleaned summaries using the Tf-idf statistic. Various statistical models, such as Logistic Regression and Support Vector Machines (SVM), were applied to discover hidden relationships between the book’s subject and its summary.
Findings:
Using an 80-20% split for training and testing, Logistic Regression, Linear SVM, and RBF SVM achieved accuracies of 80%, 79%, and 79.7%, respectively. With a 90-10% split, the accuracies were 82.2%, 78.6%, and 79.3%, respectively.
Conclusion:
Results indicate that Logistic Regression provides the best prediction accuracy. Its fast training and prediction times make it a suitable choice for textual analysis and multi-class classification tasks related to book subject identification.
Keywords
Main Subjects