Tutorial: spam detection

FSharpML is a functional-friendly lightweight wrapper of the powerful ML.Net library. It is designed to enable users to explore ML.Net in a scriptable manner, while maintaining the functional style of F#.

After installing the package via Nuget we can load the delivered reference script and start using ML.Net in conjunction with FSharpML.

1: 
2: 
3: 
4: 
5: 
6: 
7: 
8:

#load "../../FSharpML.fsx"


open System
open Microsoft.ML
open Microsoft.ML.Data
open FSharpML
open TransformerModel

To get a feel how this library handles ML.Net operations we rebuild the Spam Detection tutorial given by ML.Net. We will start by instantiating a MLContext, the heart of the ML.Net API and intended to serve as a method catalog. We will now use it to set a scope on data stored in a text file. The method name might be misleading, but ML.Net readers are lazy and the reading process will start when the data is processed (see).

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14:

let mlContext = MLContext(seed = Nullable 1)

let trainDataPath  = (__SOURCE_DIRECTORY__  + "./data/SMSSpamCollection.txt")
    
let data = 
    mlContext.Data.ReadFromTextFile( 
            path = trainDataPath,
            columns = 
                [|
                    TextLoader.Column("LabelText" , Nullable DataKind.Text, 0)
                    TextLoader.Column("Message" , Nullable DataKind.Text, 1)
                |],
            hasHeader = false,
            separatorChar = '\t')

Now that we told our interactive environment about the data we can start thinking about a model (EstimatorChain in the ML.Net jargon) we want to build. As the MLContext serves as a catalog we will use it to draw transformations that can be appended to form a estimator chain. At this point we will see FSharpML comming into play enabling us to use the beloved pipelining style familiar to FSharp users. We will now create an EstimatorChain which converts the text label to a bool then featurizes the text, and add a linear trainer.

1: 
2: 
3: 
4: 
5: 
6:

let estimatorChain = 
    EstimatorChain()
    |> Estimator.append (mlContext.Transforms.Conversion.ValueMap(["ham"; "spam"],[false; true],[| struct (DefaultColumnNames.Label, "LabelText") |]))
    |> Estimator.append (mlContext.Transforms.Text.FeaturizeText(DefaultColumnNames.Features, "Message"))
    |> (Estimator.appendCacheCheckpoint mlContext)
    |> Estimator.append (mlContext.BinaryClassification.Trainers.StochasticDualCoordinateAscent(DefaultColumnNames.Label, DefaultColumnNames.Features))

This is already pretty fsharp-friendly but we thought we could even closer by releaving us from carring around our instance of the MLContext explicitly. For this we created the type EstimatorModel which contains our EstimatorChain and the context. By Calling append by we only have to provide a lambda expression were we can define which method we want of our context.

1: 
2: 
3: 
4: 
5: 
6:

let estimatorModel = 
    EstimatorModel.create mlContext
    |> EstimatorModel.appendBy (fun mlc -> mlc.Transforms.Conversion.ValueMap(["ham"; "spam"],[false; true],[| struct (DefaultColumnNames.Label, "LabelText") |]))
    |> EstimatorModel.appendBy (fun mlc -> mlc.Transforms.Text.FeaturizeText(DefaultColumnNames.Features, "Message"))
    |> EstimatorModel.appendCacheCheckpoint
    |> EstimatorModel.appendBy (fun mlc -> mlc.BinaryClassification.Trainers.StochasticDualCoordinateAscent(DefaultColumnNames.Label, DefaultColumnNames.Features))

Way better. Now we can concentrate on machine learning. So lets start by fitting our EstimatorModel to the complete data set. The return value of this process is a so called TransformerModel, which contains a trained EstimatorChain and can be used to transform unseen data. For this we want to split the data we previously put in scope into two fractions. One to train the model and a remainder to evaluate the model.

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17:

let trainTestSplit = 
    data
    |> Data.BinaryClassification.initTrainTestSplit(estimatorModel.Context,Testfraction=0.1) 

let trainedModel = 
    estimatorModel
    |> EstimatorModel.fit trainTestSplit.TrainingData

let evaluationMetrics = 
    trainedModel
    |> Evaluation.BinaryClassification.evaluate trainTestSplit.TestData

let scoredData = 
    trainedModel
    |> TransformerModel.transform data 

evaluationMetrics.Accuracy

Now that we can examine the metrics of our model evaluation, see that we have a accuracy of 0.99 and be tempted to use it in production so lets test it first with some examples.

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18:

type SpamInput = 
    {
        LabelText : string
        Message : string
    }

let exampleData = 
    [
        "That's a great idea. It should work."
        "free medicine winner! congratulations"
        "Yes we should meet over the weekend!"
        "you win pills and free entry vouchers"
    ] 
    |> List.map (fun message ->{LabelText = ""; Message = message})

exampleData 
|> (Prediction.BinaryClassification.predictDefaultCols trainedModel)
|> Array.ofSeq

As we see, even so our accuracy when evaluating the model on the test data set was very high, it does not set the correct lable true, to the second and the fourth message which look a lot like spam. Lets examine our training data set:

1: 
2: 
3: 
4:

scoredData.GetColumn<bool>(mlContext,DefaultColumnNames.Label)
|> Seq.countBy id
|> Seq.unzip
|> Chart.Doughnut

No value has been returned

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24:

[
    scoredData.GetColumn<bool>(mlContext,DefaultColumnNames.Label) 
    |> Seq.zip (scoredData.GetColumn<float32>(mlContext,DefaultColumnNames.Probability))
    |> Seq.groupBy snd 
    |> Seq.map (fun (label,x) ->
                    x
                    |> Seq.map fst
                    |> Chart.Histogram
                )
    |> Chart.Combine



    scoredData.GetColumn<bool>(mlContext,DefaultColumnNames.Label) 
    |> Seq.zip (scoredData.GetColumn<float32>(mlContext,DefaultColumnNames.Score))
    |> Seq.groupBy snd 
    |> Seq.map (fun (label,x) ->
                    x
                    |> Seq.map fst
                    |> Chart.Histogram
                )
    |> Chart.Combine
]
|> Chart.Stack(2)

No value has been returned

1:

The chart clearly shows that the data we learned uppon is highly inhomogenous. We have a lot more ham than spam, which is generally preferable but but our models labeling threshold is clearly to high. Lets have a look at the precision recall curves of our model. For this we will evaluate the model with different thresholds and plot both

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 
32: 
33:

let idea = 
    trainedModel
    |> TransformerModel.transform data

let thresholdVSPrecicionAndRecall = 
    [-0.05 .. 0.05 .. 0.95]
    |> List.map (fun threshold ->
                    let newModel = 
                        let lastTransformer = 
                            BinaryPredictionTransformer<IPredictorProducing<float32>>(
                                trainedModel.Context, 
                                trainedModel.TransformerChain.LastTransformer.Model, 
                                trainedModel.TransformerChain.GetOutputSchema(idea.Schema), 
                                trainedModel.TransformerChain.LastTransformer.FeatureColumn, 
                                threshold = float32 threshold, 
                                thresholdColumn = DefaultColumnNames.Probability)
                        let parts = 
                            trainedModel.TransformerChain 
                            |> Seq.toArray
                            |> fun x ->x.[..x.Length-2] 
                        
                        printfn "%A, %A" lastTransformer.Threshold lastTransformer.ThresholdColumn
                        TransformerChain<Microsoft.ML.Core.Data.ITransformer>(parts).Append(lastTransformer)
                    let newModel' = 
                        {TransformerModel.TransformerChain = newModel;Context=trainedModel.Context}
                        |> Evaluation.BinaryClassification.evaluate trainTestSplit.TestData
                    
                    threshold,newModel'.Accuracy
                    //threshold,
                    //exampleData 
                    //|> (Prediction.BinaryClassification.predictDefaultCols newModel')
                    //|> Array.ofSeq
                )

namespace FSharp

namespace System

namespace Microsoft

namespace Microsoft.ML

namespace Microsoft.ML.Data

module FSharpML

namespace FSharpML.TransformerModel

val mlContext : MLContext

Multiple items
type MLContext =
  new : ?seed:Nullable<int> -> MLContext
  member AnomalyDetection : AnomalyDetectionCatalog
  member BinaryClassification : BinaryClassificationCatalog
  member Clustering : ClusteringCatalog
  member ComponentCatalog : ComponentCatalog
  member Data : DataOperationsCatalog
  member Forecasting : ForecastingCatalog
  member Model : ModelOperationsCatalog
  member MulticlassClassification : MulticlassClassificationCatalog
  member Ranking : RankingCatalog
  ...

--------------------
MLContext(?seed: Nullable<Microsoft.FSharp.Core.int>) : MLContext

Multiple items
type Nullable =
  static member Compare<'T> : n1:Nullable<'T> * n2:Nullable<'T> -> int
  static member Equals<'T> : n1:Nullable<'T> * n2:Nullable<'T> -> bool
  static member GetUnderlyingType : nullableType:Type -> Type

--------------------
type Nullable<'T (requires default constructor and value type and 'T :> ValueType)> =
  struct
    new : value:'T -> Nullable<'T>
    member Equals : other:obj -> bool
    member GetHashCode : unit -> int
    member GetValueOrDefault : unit -> 'T + 1 overload
    member HasValue : bool
    member ToString : unit -> string
    member Value : 'T
  end

--------------------
Nullable ()
Nullable(value: 'T) : Nullable<'T>

val trainDataPath : Microsoft.FSharp.Core.string

val data : IDataView

property MLContext.Data: DataOperationsCatalog with get

type TextLoader =
  member GetOutputSchema : unit -> DataViewSchema
  member Load : source:IMultiStreamSource -> IDataView
  nested type Column
  nested type Options
  nested type Range

type Column =
  new : unit -> Column + 3 overloads
  val Name : string
  val Source : Range[]
  val KeyCount : KeyCount
  member DataKind : DataKind with get, set

val estimatorChain : Microsoft.FSharp.Core.obj

Multiple items
type EstimatorChain<'TLastTransformer (requires reference type and 'TLastTransformer :> ITransformer)> =
  new : unit -> EstimatorChain<'TLastTransformer>
  val LastEstimator : IEstimator<'TLastTransformer>
  member Append<'TNewTrans> : estimator:IEstimator<'TNewTrans> * ?scope:TransformerScope -> EstimatorChain<'TNewTrans>
  member AppendCacheCheckpoint : env:IHostEnvironment -> EstimatorChain<'TLastTransformer>
  member Fit : input:IDataView -> TransformerChain<'TLastTransformer>
  member GetOutputSchema : inputSchema:SchemaShape -> SchemaShape

--------------------
EstimatorChain() : EstimatorChain<'TLastTransformer>

module Estimator

from FSharpML

val append : source1:IEstimator<'a> -> source2:IEstimator<#ITransformer> -> EstimatorChain<'a> (requires 'a :> ITransformer and reference type)

property MLContext.Transforms: TransformsCatalog with get

property TransformsCatalog.Conversion: TransformsCatalog.ConversionTransforms with get

module DefaultColumnNames

from FSharpML

val Label : Microsoft.FSharp.Core.string

property TransformsCatalog.Text: TransformsCatalog.TextTransforms with get

(extension) TransformsCatalog.TextTransforms.FeaturizeText(outputColumnName: Microsoft.FSharp.Core.string,?inputColumnName: Microsoft.FSharp.Core.string) : Transforms.Text.TextFeaturizingEstimator
(extension) TransformsCatalog.TextTransforms.FeaturizeText(outputColumnName: Microsoft.FSharp.Core.string, options: Transforms.Text.TextFeaturizingEstimator.Options, [<ParamArray>] inputColumnNames: Microsoft.FSharp.Core.string Microsoft.FSharp.Core.[]) : Transforms.Text.TextFeaturizingEstimator

val Features : Microsoft.FSharp.Core.string

val appendCacheCheckpoint : mlContext:MLContext -> pipeline:IEstimator<'a> -> IEstimator<ITransformer> (requires 'a :> ITransformer and reference type)

property MLContext.BinaryClassification: BinaryClassificationCatalog with get

property BinaryClassificationCatalog.Trainers: BinaryClassificationCatalog.BinaryClassificationTrainers with get

val estimatorModel : '_arg3 (requires member ( |> ) and member ( |> ) and member ( |> ) and member ( |> ) and 'c :> ITransformer and reference type and 'e :> ITransformer and reference type and 'g :> ITransformer and reference type)

namespace FSharpML.EstimatorModel

val trainTestSplit : Microsoft.FSharp.Core.obj

Multiple items
module Data

from FSharpML

--------------------
namespace Microsoft.ML.Data

--------------------
namespace System.Data

val trainedModel : '_arg3 (requires member ( |> ) and member ( |> ) and member ( |> ) and 'a :> ITransformer and reference type and 'c :> ITransformer and reference type and 'e :> ITransformer and reference type)

val estimatorModel : 'a Microsoft.FSharp.Core.[]

val evaluationMetrics : '_arg3

module Evaluation

from FSharpML.TransformerModel

Multiple items
module BinaryClassification

from FSharpML.TransformerModel.Evaluation

--------------------
type BinaryClassification =
static member evaluateNonCalibratedWith : ?Label:string * ?Score:string * ?PredictedLabel:string -> (IDataView -> TransformerModel<'a0> -> BinaryClassificationMetrics) (requires 'a0 :> ITransformer and reference type)
static member evaluateWith : ?Label:string * ?Score:string * ?Probability:string * ?PredictedLabel:string -> (IDataView -> TransformerModel<'a0> -> CalibratedBinaryClassificationMetrics) (requires 'a0 :> ITransformer and reference type)

val evaluate : data:IDataView -> transformerModel:TransformerModel<'a> -> CalibratedBinaryClassificationMetrics (requires 'a :> ITransformer and reference type)

val scoredData : '_arg3

Multiple items
module TransformerModel

from FSharpML.TransformerModel

--------------------
namespace FSharpML.TransformerModel

--------------------
type TransformerModel<'a (requires 'a :> ITransformer and reference type)> =
{ TransformerChain: TransformerChain<'a>
Context: MLContext }

val transform : data:IDataView -> transformerModel:TransformerModel<'b> -> IDataView (requires 'b :> ITransformer and reference type)

type SpamInput =
{ LabelText: obj
Message: obj }

SpamInput.LabelText: Microsoft.FSharp.Core.obj

SpamInput.Message: Microsoft.FSharp.Core.obj

val exampleData : '_arg3 (requires member ( |> ) and member ( |> ))

type Array =
  member Clone : unit -> obj
  member CopyTo : array:Array * index:int -> unit + 1 overload
  member GetEnumerator : unit -> IEnumerator
  member GetLength : dimension:int -> int
  member GetLongLength : dimension:int -> int64
  member GetLowerBound : dimension:int -> int
  member GetUpperBound : dimension:int -> int
  member GetValue : index:int -> obj + 7 overloads
  member Initialize : unit -> unit
  member IsFixedSize : bool
  ...

val scoredData : 'a Microsoft.FSharp.Core.[]

val Probability : Microsoft.FSharp.Core.string

val Score : Microsoft.FSharp.Core.string

val idea : '_arg3

val thresholdVSPrecicionAndRecall : '_arg3

type BinaryPredictionTransformer<'TModel (requires reference type)> =
inherit SingleFeaturePredictionTransformerBase<'TModel>

val trainedModel : 'a Microsoft.FSharp.Core.[]

Multiple items
type TransformerChain<'TLastTransformer (requires reference type and 'TLastTransformer :> ITransformer)> =
  new : [<ParamArray>] transformers:ITransformer[] -> TransformerChain<'TLastTransformer> + 1 overload
  val LastTransformer : 'TLastTransformer
  member Append<'TNewLast> : transformer:'TNewLast * ?scope:TransformerScope -> TransformerChain<'TNewLast>
  member GetEnumerator : unit -> IEnumerator<ITransformer>
  member GetModelFor : scopeFilter:TransformerScope -> TransformerChain<ITransformer>
  member GetOutputSchema : inputSchema:DataViewSchema -> DataViewSchema
  member Transform : input:IDataView -> IDataView

--------------------
TransformerChain([<ParamArray>] transformers: ITransformer Microsoft.FSharp.Core.[]) : TransformerChain<'TLastTransformer>
TransformerChain(transformers: Collections.Generic.IEnumerable<ITransformer>, scopes: Collections.Generic.IEnumerable<TransformerScope>) : TransformerChain<'TLastTransformer>

BioFSharp

Tutorial: spam detection