Classifying Spams Using Apache Spark MLlib

Mayowa Timothy Adesina 1, * and Joshua Mayokun Adesina 2

1 College of Business Administration, Kansas State University, Manhattan, KS 66502
2 Sociology Department, Federal University, Oye-Ekiti, Nigeria.
 
Research Article
International Journal of Science and Research Archive, 2024, 12(02), 2091-2112.
Article DOI: 10.30574/ijsra.2024.12.2.1332
Publication history: 
Received on 26 June 2024; revised on 11 August 2024; accepted on 13 August 2024
 
Abstract: 
This paper provides a comprehensive overview of various machine learning algorithms, including Logistic Regression, Decision Trees, and Random Forests, with a focus on their application in predictive modeling. The discussion emphasizes the importance of feature selection, engineering, and model evaluation techniques like cross-validation to ensure robust and generalizable models. By leveraging the Spambase dataset from the UCI Machine Learning Repository, the performance of these algorithms is compared and contrasted using key metrics such as accuracy, precision, recall, and F1-score. The paper also highlights the significance of understanding dataset characteristics and feature importance in optimizing model performance. The findings demonstrate that while each algorithm has its strengths and limitations, Random Forests generally provide superior predictive performance, especially in handling complex and high-dimensional datasets. This work serves as a valuable resource for data scientists and researchers looking to understand the practical implications of different machine learning techniques and their impact on real-world data.
 
Keywords: 
Machine Learning; Artificial Intelligence; Spam Messages; Logistic Regression; Decision Tree; Random Forest; Apache Spark
 
Full text article in PDF: