In this article I would like to illustrate how Stanford’s Natural Linguistic Processing and Java 9 can be used to create a spam filter for an email account.
The goal is that all incoming messages will scanned and if they contain any spam information this will result in them being moved to a spam folder.
Firstly we download the following:
- Stanford NLP 3.9.1
- JavaEE- we will use the javamail api for the connection and manipulation of the email account.
- Eclipse Oxygen 4.7.2
Next we create a project called emailspamfilter in Eclipse Oxygen. We will create an application with the following architecture:
This is similar to the MVC pattern only instead of a model and view we have the Emaillistener and MessageNLP. In the source code below the class that represents the MessageNLP is the EmailTextClassifier. The controller enables pure separation of concerns. The controller carries out all the orchestration.
The project structure is as follows:
The EmailController will run in a infinite loop reading the inbox for new mails at a given interval. Here I have set it to five seconds. For personal use one can set the interval to be much larger like every few hours.
Note that the since we are using Java 8 or 9 the “stream” can be changed to a parallel stream for optimized performance when using multicore systems. The beauty is a threading or concurrency model can be super imposed on the controller as it delegates functionality to the EmailTextClassifier and EmailListener classes.
Next we train our application to be able to detect spam. In order to do this we will implement Named Entity Recognition (NER). All we need is to use a Sentence from the package edu.stanford.nlp.simple.Sentence;
The commented out code illustrates an extension to the mailLanguageClassifier where we can process the email subject, email body and email text attachments. We can then pass this around as a list of triple strings by using flatMaps to create spamEmails For the example I then just analyse the text in the email body.
The emailspamfilter_ner.txt contains the spam items that we will look out for in emails. Here is an example:
If I get any mail with Buy peanuts or Sale on biscuits I can now classify it as spam. Note this can be extended with the use of Stanford’s NLC where you can train it to look out for certain phrases or words. In addition you could just look for NERs like “Sale on”, “Get discounted”. Also as you get more mails that you don’t like you can add the NERs to this list and as time goes on your spamfilter becomes more intelligent.
The EmailListener contains the methods getEmails where it retrieves all new emails and these are sent by the controller to the emailtextclassifier. The second important method is the move to spam. If the mail is spam it will then be moved from the inbox folder to a spam folder that I call MySpam(I created a new folder called MySpam in my gmail inbox for this article).
Note not all the code in Emaillistener utilizes Java 8 or 9’s capabilities. This is because some of the methods I have are from Java 6. So I implemented some good code reuse. However moveEmailToSpamFolder implements the use of Optionals which is a java capability.
The code can all be found on github in the repository: https://github.com/gabrieljeremiahcampbell/emailspamfilter