Stacked ensemble improvement of phishing Email corpus detection based on frequency-based count vector embedding

Olayemi Olasehinde 1, Olayemi Olufunke Catherine 2 and Peter Adetola Adetunji 3, *

1 Department of Computing and Engineering, University of Huddersfield, UK.
2 Department of Computing and Games, Teesside University, Middlesborough, UK.
 
Research Article
International Journal of Science and Research Archive, 2024,13(02), 3774-3788.
Article DOI: 10.30574/ijsra.2024.13.2.1830
Publication history: 
Received on 18 November 2024; revised on 26 December 2024; accepted on 28 December 2024
 
Abstract: 
Email users are at risk from phishing attacks, which utilize a combination of technological and social engineering techniques to obtain sensitive information from targets and cause significant financial loss. It is the fastest-rising online crime for stealing personal and financial data. In this work, natural language processing was applied to process an unstructured email corpus and convert it to a word vector matrix suitable to build machine learning models implemented using the Python programming language. The test corpus was evaluated using the four base models, and the results indicate that the random forest model had the highest accuracy (92.71%), closely followed by the logistic regression model (89.01%), the Naive Bayes recoded model (83.52%), and the KNN model (79.95%) with the lowest accuracy. A notable improvement in classification accuracy and a decrease in the false alarm rate observed by all base models were demonstrated by the stacked ensemble evaluation of the base model predictions, which yielded an accuracy of 97.14%. It recorded a classification improvement of 21.5%, 5.4%, 16.3%, and 9.1% over the KNN, RF, NB, and LR models, respectively, and a drop in false alarm rate by 79.0%, 36.0%, 76.4%, and 64.0% over the KNN, RF, NB, and LR models, respectively. The implementation of this approach on the mail server to filter incoming phished emails.
 
Keywords: 
Identity theft; Corpus Embedding; Phishing Detection; Meta-Learners
 
Full text article in PDF: