ADDRESSING EMAIL MALICIOUS CORPORA USING THE LATENT DIRICHLET ALLOCATION ALGORITHM IN NATURAL LANGUAGE PROCESSING

1Nankang Gabriel Garba, 2Chai Dakun Mang & 3Luka Panpe Yakubu

 1Dept of Computer Science, Plateau State Polytechnic Barkin Ladi, Plateau State

1Naankang.garba@aun.edu.ng

2Dept of Computer Science, Plateau State Polytechnic Barkin Ladi, Plateau State

2Naankang.garba@aun.edu.ng

3Dept of Public Administration, Plateau State Polytechnic, Barkin Ladi, Plateau State

panpeerr@gmail.com

 

Abstract

In recent years, several organizations and companies have been affected by malicious threats. The focus of threats has been from both internal and external sources. unauthorized users have become potential threats to organizational data wellbeing. Existing threats detection methods such as rule-based approach rely on expert knowledge making it not robust. In order to overcome this limitation, a threat detection method is proposed based on email user behaviour and Latent Dirichlet Allocation Algorithm. An email content based on IT administrator role is constructed from CERT r6.2 dataset using Tokenisation, stop word removal and stemming. Topic modeling is performed on the dataset to generate a vector space which serves as input to anomaly detection algorithms to detect malicious email contents.  The experimental results demonstrate that the proposed model has 93.2% detection over baseline model.

Keywords: Outsider-Threat; Anomaly Detection; Latent Dirichlet Allocation; Topic Modeling; Natural Language Processing.

1.0       Introduction

The use of email messaging services has become the world’s leading text messaging format. With about 306.4 billion messages daily Johnson (2021) it remains the ‘numero uno’ for many users around the world Dada et al. (2019). The presence of unsolicited and unwanted content amidst our message remains a major concern.

Various techniques have been deployed around batch processing and artificial intelligence to address this dominant threat, like the extraction of harmful content (payload) from the analysis of email headers for sender addresses and delivery paths Wang and Chen (2007), the drawback of this techniques is its inability to analyse the content of the message itself: the metadata end up as a threat hence making the user vulnerable Abd Razak and Mohamad (2013).

A novel technique to identify suspicious emails based on the analysis of email textual content is proposed to address this concomitant issue. Using topic modeling and the Latent Dirichlet Allocation algorithm in natural language processing in machine learning we will be able to tokenize and detect anomaly in email contents.

An Outsider threat is malicious email emanating from outside an organization. The senders are outside the organization and the sent mail often serves as the payload carrier for distribution of the malicious code Walton (2006).