BioFSharp


Binary classification: Spam Detection for Text Messages

ML.NET version

API type

Status

App Type

Data type

Scenario

ML Task

Algorithms

v1.40

Dynamic API

Up-to-date

Console app

.tsv files

Spam detection

Two-class classification

SDCA (linear learner)

In this sample, you'll see how to use FSharpML on top of ML.NET to predict whether a text message is spam. In the world of machine learning, this type of prediction is known as binary classification.

Problem

Our goal here is to predict whether a text message is spam (an irrelevant/unwanted message). We will use the SMS Spam Collection Data Set from UCI, which contains close to 6000 messages that have been classified as being "spam" or "ham" (not spam). We will use this dataset to train a model that can take in new message and predict whether they are spam or not.

This is an example of binary classification, as we are classifying the text messages into one of two categories.

Solution

To solve this problem, first we will build an estimator to define the ML pipeline we want to use. Then we will train this estimator on existing data, evaluate how good it is, and lastly we'll consume the model to predict whether a few examples messages are spam.

Build -> Train -> Evaluate -> Consume

  1. Build and train the model ----------------------------

FSharpML containing two complementary parts named EstimatorModel and TransformerModel covering the full machine lerarning workflow. In order to build an ML model and fit it to the training data we use EstimatorModel. The 'fit' function in EstimatorModel applied on training data results into the TransformerModel that represents the trained model able to transform other data of the same shape and is used int the second part to evaluate and consume the model.

To build the estimator we will:

  • Define how to read the spam dataset that will be downloaded from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection.

  • Apply several data transformations:

    • Convert the label ("spam" or "ham") to a boolean ("true" represents spam) so we can use it with a binary classifier.
    • Featurize the text message into a numeric vector so a machine learning trainer can use it
  • Add a trainer (such as StochasticDualCoordinateAscent).

*

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 
32: 
33: 
34: 
35: 
36: 
37: 
38: 
39: 
40: 
41: 
42: 
43: 
44: 
45: 
46: 
47: 
48: 
49: 
50: 
51: 
52: 
#load "../../FSharpML.fsx"


open System;
open Microsoft.ML
open Microsoft.ML.Data;
open FSharpML
open FSharpML.EstimatorModel
open FSharpML.TransformerModel


/// Type representing the Message to run analysis on.
[<CLIMutable>] 
type SpamInput = 
    { 
        [<LoadColumn(0)>] LabelText : string
        [<LoadColumn(1)>] Message : string 
    }




//Create the MLContext to share across components for deterministic results
let mlContext = MLContext(seed = Nullable 1) // Seed set to any number so you
                                             // have a deterministic environment

// STEP 1: Common data loading configuration   
let fullData = 
    __SOURCE_DIRECTORY__  + "./data/SMSSpamCollection.txt"
    |> DataModel.fromTextFileWith<SpamInput> mlContext '\t' false 

let trainingData, testingData = 
    fullData
    |> DataModel.trainTestSplit 0.1 


//STEP 2: Process data, create and train the model 
let model = 
    EstimatorModel.create mlContext
    // Process data transformations in pipeline
    |> EstimatorModel.appendBy (fun mlc -> mlc.Transforms.Conversion.MapValue(DefaultColumnNames.Label, dict ["ham", false; "spam", true], "LabelText") )
    |> EstimatorModel.appendBy (fun mlc -> mlc.Transforms.Text.FeaturizeText(DefaultColumnNames.Features, "Message"))
    |> EstimatorModel.appendCacheCheckpoint
    // Create the model
    |> EstimatorModel.appendBy (fun mlc -> mlc.BinaryClassification.Trainers.SdcaLogisticRegression(DefaultColumnNames.Label, DefaultColumnNames.Features))
    // Train the model
    |> EstimatorModel.fit trainingData.Dataview

// STEP3: Run the prediciton on the test data
let predictions =
    model
    |> TransformerModel.transform testingData.Dataview
  1. Evaluate and consume the model ---------------------------------

TransformerModel is used to evaluate the model and make prediction on independant data.

*

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
// STEP4: Evaluate accuracy of the model
let metrics = 
    model
    |> Evaluation.BinaryClassification.evaluate testingData.Dataview

metrics.Accuracy





//// STEP5: Create prediction engine function related to the loaded trained model
//let predict = 
//    TransformerModel.createPredictionEngine<_,SpamInput,SpamInput> model

//// Score
//let prediction = predict sampleStatement

//// Test a few examples
//[
//    "That's a great idea. It should work."
//    "free medicine winner! congratulations"
//    "Yes we should meet over the weekend!"
//    "you win pills and free entry vouchers"
//] 
//|> List.iter (classify predictor)
namespace FSharp
namespace FSharp.Plotly
namespace System
namespace Microsoft
namespace Microsoft.ML
namespace Microsoft.ML.Data
module FSharpML
namespace FSharpML.EstimatorModel
namespace FSharpML.TransformerModel
type SpamInput =
  { LabelText: obj
    Message: obj }


 Type representing the Message to run analysis on.
Multiple items
type LoadColumnAttribute =
  inherit Attribute
  new : fieldIndex:int -> LoadColumnAttribute + 2 overloads

--------------------
LoadColumnAttribute(fieldIndex: Microsoft.FSharp.Core.int) : LoadColumnAttribute
LoadColumnAttribute(columnIndexes: Microsoft.FSharp.Core.int Microsoft.FSharp.Core.[]) : LoadColumnAttribute
LoadColumnAttribute(start: Microsoft.FSharp.Core.int, end: Microsoft.FSharp.Core.int) : LoadColumnAttribute
SpamInput.LabelText: Microsoft.FSharp.Core.obj
SpamInput.Message: Microsoft.FSharp.Core.obj
val mlContext : MLContext
Multiple items
type MLContext =
  new : ?seed:Nullable<int> -> MLContext
  member AnomalyDetection : AnomalyDetectionCatalog
  member BinaryClassification : BinaryClassificationCatalog
  member Clustering : ClusteringCatalog
  member ComponentCatalog : ComponentCatalog
  member Data : DataOperationsCatalog
  member Forecasting : ForecastingCatalog
  member Model : ModelOperationsCatalog
  member MulticlassClassification : MulticlassClassificationCatalog
  member Ranking : RankingCatalog
  ...

--------------------
MLContext(?seed: Nullable<Microsoft.FSharp.Core.int>) : MLContext
Multiple items
type Nullable =
  static member Compare<'T> : n1:Nullable<'T> * n2:Nullable<'T> -> int
  static member Equals<'T> : n1:Nullable<'T> * n2:Nullable<'T> -> bool
  static member GetUnderlyingType : nullableType:Type -> Type

--------------------
type Nullable<'T (requires default constructor and value type and 'T :> ValueType)> =
  struct
    new : value:'T -> Nullable<'T>
    member Equals : other:obj -> bool
    member GetHashCode : unit -> int
    member GetValueOrDefault : unit -> 'T + 1 overload
    member HasValue : bool
    member ToString : unit -> string
    member Value : 'T
  end

--------------------
Nullable ()
Nullable(value: 'T) : Nullable<'T>
val fullData : Microsoft.FSharp.Core.obj
module DataModel

from FSharpML
val fromTextFileWith<'Trow> : mlc:MLContext -> separatorChar:Microsoft.FSharp.Core.char -> hasHeader:Microsoft.FSharp.Core.bool -> path:Microsoft.FSharp.Core.string -> DataModel.DataModel<Microsoft.FSharp.Core.obj>
val trainingData : Microsoft.FSharp.Core.obj
val testingData : 'a
val trainTestSplit : testfraction:Microsoft.FSharp.Core.float -> dataModel:DataModel.DataModel<'a> -> DataModel.DataModel<DataModel.TrainTestSplitInfo> * DataModel.DataModel<DataModel.TrainTestSplitInfo>
val model : '_arg3 (requires member ( |> ) and member ( |> ) and 'a :> ITransformer and reference type and 'c :> ITransformer and reference type)
Multiple items
module EstimatorModel

from FSharpML.EstimatorModel

--------------------
namespace FSharpML.EstimatorModel

--------------------
type EstimatorModel<'a (requires 'a :> ITransformer and reference type)> =
  { EstimatorChain: EstimatorChain<'a>
    Context: MLContext }
val create : mlContext:MLContext -> EstimatorModel<'a> (requires reference type and 'a :> ITransformer)
val appendBy : transforming:(MLContext -> #IEstimator<'c>) -> estimatorModel:EstimatorModel<'d> -> EstimatorModel<'c> (requires 'c :> ITransformer and reference type and 'd :> ITransformer and reference type)
val mlc : MLContext
property MLContext.Transforms: TransformsCatalog with get
property TransformsCatalog.Conversion: TransformsCatalog.ConversionTransforms with get
(extension) TransformsCatalog.ConversionTransforms.MapValue<'TInputType,'TOutputType>(outputColumnName: Microsoft.FSharp.Core.string, keyValuePairs: Collections.Generic.IEnumerable<Collections.Generic.KeyValuePair<'TInputType,'TOutputType Microsoft.FSharp.Core.[]>>,?inputColumnName: Microsoft.FSharp.Core.string) : Transforms.ValueMappingEstimator<'TInputType,'TOutputType>
(extension) TransformsCatalog.ConversionTransforms.MapValue<'TInputType,'TOutputType>(outputColumnName: Microsoft.FSharp.Core.string, keyValuePairs: Collections.Generic.IEnumerable<Collections.Generic.KeyValuePair<'TInputType,'TOutputType>>,?inputColumnName: Microsoft.FSharp.Core.string,?treatValuesAsKeyType: Microsoft.FSharp.Core.bool) : Transforms.ValueMappingEstimator<'TInputType,'TOutputType>
(extension) TransformsCatalog.ConversionTransforms.MapValue(outputColumnName: Microsoft.FSharp.Core.string, lookupMap: IDataView, keyColumn: DataViewSchema.Column, valueColumn: DataViewSchema.Column,?inputColumnName: Microsoft.FSharp.Core.string) : Transforms.ValueMappingEstimator
module DefaultColumnNames

from FSharpML
val Label : Microsoft.FSharp.Core.string
property TransformsCatalog.Text: TransformsCatalog.TextTransforms with get
(extension) TransformsCatalog.TextTransforms.FeaturizeText(outputColumnName: Microsoft.FSharp.Core.string,?inputColumnName: Microsoft.FSharp.Core.string) : Transforms.Text.TextFeaturizingEstimator
(extension) TransformsCatalog.TextTransforms.FeaturizeText(outputColumnName: Microsoft.FSharp.Core.string, options: Transforms.Text.TextFeaturizingEstimator.Options, [<ParamArray>] inputColumnNames: Microsoft.FSharp.Core.string Microsoft.FSharp.Core.[]) : Transforms.Text.TextFeaturizingEstimator
val Features : Microsoft.FSharp.Core.string
val appendCacheCheckpoint : estimatorModel:EstimatorModel<'a> -> EstimatorModel<'a> (requires 'a :> ITransformer and reference type)
property MLContext.BinaryClassification: BinaryClassificationCatalog with get
property BinaryClassificationCatalog.Trainers: BinaryClassificationCatalog.BinaryClassificationTrainers with get
(extension) BinaryClassificationCatalog.BinaryClassificationTrainers.SdcaLogisticRegression(options: Trainers.SdcaLogisticRegressionBinaryTrainer.Options) : Trainers.SdcaLogisticRegressionBinaryTrainer
(extension) BinaryClassificationCatalog.BinaryClassificationTrainers.SdcaLogisticRegression(?labelColumnName: Microsoft.FSharp.Core.string,?featureColumnName: Microsoft.FSharp.Core.string,?exampleWeightColumnName: Microsoft.FSharp.Core.string,?l2Regularization: Nullable<Microsoft.FSharp.Core.float32>,?l1Regularization: Nullable<Microsoft.FSharp.Core.float32>,?maximumNumberOfIterations: Nullable<Microsoft.FSharp.Core.int>) : Trainers.SdcaLogisticRegressionBinaryTrainer
val fit : data:IDataView -> estimatorModel:EstimatorModel<'a> -> TransformerModel<'a> (requires 'a :> ITransformer and reference type)
val predictions : '_arg3
Multiple items
module TransformerModel

from FSharpML.TransformerModel

--------------------
namespace FSharpML.TransformerModel

--------------------
type TransformerModel<'a (requires 'a :> ITransformer and reference type)> =
  { TransformerChain: TransformerChain<'a>
    Context: MLContext }
val transform : data:IDataView -> transformerModel:TransformerModel<'b> -> IDataView (requires 'b :> ITransformer and reference type)
val metrics : '_arg3
module Evaluation

from FSharpML.TransformerModel
Multiple items
module BinaryClassification

from FSharpML.TransformerModel.Evaluation

--------------------
type BinaryClassification =
  static member evaluateNonCalibratedWith : ?Label:string * ?Score:string * ?PredictedLabel:string -> (IDataView -> TransformerModel<'a0> -> BinaryClassificationMetrics) (requires 'a0 :> ITransformer and reference type)
  static member evaluateWith : ?Label:string * ?Score:string * ?Probability:string * ?PredictedLabel:string -> (IDataView -> TransformerModel<'a0> -> CalibratedBinaryClassificationMetrics) (requires 'a0 :> ITransformer and reference type)
val evaluate : data:IDataView -> transformerModel:TransformerModel<'a> -> CalibratedBinaryClassificationMetrics (requires 'a :> ITransformer and reference type)
val testingData : 'a Microsoft.FSharp.Core.[]
Fork me on GitHub