Tutorial: spam detection
FSharpML is a functional-friendly lightweight wrapper of the powerful ML.Net library. It is designed to enable users to explore ML.Net in a scriptable manner, while maintaining the functional style of F#.
After installing the package via Nuget we can load the delivered reference script and start using ML.Net in conjunction with FSharpML.
To get a feel how this library handles ML.Net operations we rebuild the Spam Detection tutorial given by ML.Net. We will start by instantiating a MLContext, the heart of the ML.Net API and intended to serve as a method catalog. We will now use it to set a scope on data stored in a text file. The method name might be misleading, but ML.Net readers are lazy and the reading process will start when the data is processed (see).
Now that we told our interactive environment about the data we can start thinking about a model (EstimatorChain in the ML.Net jargon) we want to build. As the MLContext serves as a catalog we will use it to draw transformations that can be appended to form a estimator chain. At this point we will see FSharpML comming into play enabling us to use the beloved pipelining style familiar to FSharp users. We will now create an EstimatorChain which converts the text label to a bool then featurizes the text, and add a linear trainer.
This is already pretty fsharp-friendly but we thought we could even closer by releaving us from carring around our instance of the MLContext explicitly. For this we created the type EstimatorModel which contains our EstimatorChain and the context. By Calling append by we only have to provide a lambda expression were we can define which method we want of our context.
Way better. Now we can concentrate on machine learning. So lets start by fitting our EstimatorModel to the complete data set. The return value of this process is a so called TransformerModel, which contains a trained EstimatorChain and can be used to transform unseen data. For this we want to split the data we previously put in scope into two fractions. One to train the model and a remainder to evaluate the model.
Now that we can examine the metrics of our model evaluation, see that we have a accuracy of 0.99 and be tempted to use it in production so lets test it first with some examples.
As we see, even so our accuracy when evaluating the model on the test data set was very high, it does not set the correct lable true, to the second and the fourth message which look a lot like spam. Lets examine our training data set:
The chart clearly shows that the data we learned uppon is highly inhomogenous. We have a lot more ham than spam, which is generally preferable but but our models labeling threshold is clearly to high. Lets have a look at the precision recall curves of our model. For this we will evaluate the model with different thresholds and plot both
