@unifed/backend-ml

This package is currently used for machine learning related code.

Unifed currently utilises two forms of machine learning:

A spam detection filter.
A text toxicity classifier.

The spam detection models are created and trained in this package, whereas the text toxicity classifier utilises the pre-trained @tensorflow-models/toxicity model.

Spam Detection

The majority of the code in this package is for training a spam detection model.

Training data located in the data directory is converted into a common form, using the parsers located in src/parsers.
The training data is then tokenized, using src/tokenizer.ts.
A tensor is created using this data with the code in src/tensor.ts.
The models used in src/models are trained with the data.

src/train.ts provides a command line utility for training the models, whereas src/test-model.ts provides a command line utility for for accessing the performance of models.

An API to utilise the models is exposed in src/index.ts, which can be used by other packages.

Training Data

Training data is located in the data directory. The sources for the training data are as follows:

enron.zip - Source
sms.zip - Source
spam-assasin.zip - Source
testing.zip - Source

Models

The models used have been taken from the following sources:

dense (trained) - Source
dense-pooling (trained) - Source
twilio-dense (trained) - Source
lstm (not trained) - Source
bi-directional-lstm (not trained) - Source

Some models have not been trained, as we did not have the computing resources to do so in a reasonable amount of time. Training and evaluating these would be an interesting project extension.

Index

Classes

Interfaces

Type aliases

Variables

Functions

Type aliases

Sequence

Sequence: number[]

Type representing a numerical sequence.

See Tokenizer for usage example.

internal

StringNumberMapping

StringNumberMapping: [string, number][]

Type representing a string to number mapping.

See Tokenizer for usage example.

internal

TrainedModelWithMeta

TrainedModelWithMeta: TrainedModel & ModelMeta

internal

Variables

Const configName

configName: "configuration.json" = "configuration.json"

Name of configuration file containing the Config used to train the model.

internal

Const defaultConfig

defaultConfig: Config = ...

Default configuration for training models.

See the spam detection paper in deliverables 3 and 4 for a justification of these values.

internal

Const historyName

historyName: "history.json" = "history.json"

Name of the file containing the statistics from training the model, such as losses and accuracy.

internal

Const metaPath

metaPath: string = ...

Path to save metadata.

internal

Const modelName

modelName: "model.json" = "model.json"

Name of the file used to save the model information, such as layers.

internal

Const modelNames

modelNames: string[] = ...

Names of all available models.

internal

Const modelsPath

modelsPath: string = ...

Path to save models.

internal

Const sentenceLengthsName

sentenceLengthsName: "sentence-lengths.dat" = "sentence-lengths.dat"

Name of file the file to save the sentence lengths statistics.

internal

Const testingResultsHamName

testingResultsHamName: "testing-results-ham.dat" = "testing-results-ham.dat"

Name of file the file to save the testing results for non-spam posts.

internal

Const testingResultsSpamName

testingResultsSpamName: "testing-results-spam.dat" = "testing-results-spam.dat"

Name of file the file to save the testing results for spam posts.

internal

Const tokenizerName

tokenizerName: "tokenizer.json" = "tokenizer.json"

Name of the file to save the serialised tokenizer.

internal

Const wordFrequenciesName

wordFrequenciesName: "word-frequencies.dat" = "word-frequencies.dat"

Name of file the file to save the worth frequency statistics.

internal

Functions

arrayToCSV

arrayToCSV<T, R>(values: R[]): string

- Defined in packages/backend-ml/src/helpers.ts:103
Converts a 2D array to a CSV.

This is a very naive implementation, which does not consider commas in the actual values, however it is good enough for this purpose.

internal

Type parameters
- T
  
  The type of values contained in the 2D array.
- R: T[]
Parameters
- values: R[]
  
  A 2D array of values to convert to a CSV. The first level of the array are treated as rows. The values in the second level are treated as columns.
Returns string

The converted CSV value.

createDirectory

createDirectory(path: string): Promise<void>

- Defined in packages/backend-ml/src/helpers.ts:56
Creates a directory.

If the directory already exists, this is a noop.

internal

Parameters
- path: string
  
  The path to create the directory at.
Returns Promise<void>

fitModel

fitModel(model: Model, trainingSentences: Tensor, trainingLabels: Tensor, testingSentences: Tensor, testingLabels: Tensor, config: Config): Promise<TrainedModel>

- Defined in packages/backend-ml/src/models/models.ts:121
Fits a given model to a set of training and validation data data.

internal

Parameters
- model: Model
  
  The model to fit.
- trainingSentences: Tensor
  
  The training sentences (i.e. messages/posts).
- trainingLabels: Tensor
  
  The training labels (i.e. spam/non-spam classifications).
- testingSentences: Tensor
  
  The testing sentences (i.e. messages/posts), which are used for validation.
- testingLabels: Tensor
  
  The testing labels (i.e. spam/non-spam classifications).
- config: Config
  
  The Config to use when fitting the model.
Returns Promise<TrainedModel>

The trained model and training statistics.

flattenMessages

flattenMessages(messages: Message[]): SentenceMapping

- Defined in packages/backend-ml/src/helpers.ts:119
Flattens an array of Message objects into an array of sentences and array of labels, returned as a SentenceMapping.

internal

Parameters
- messages: Message[]
  
  The messages to flatten.
Returns SentenceMapping

A flattened representation of the messages, as two arrays with corresponding indexes.

getLabelsTensor

getLabelsTensor(labels: Sequence): Tensor

- Defined in packages/backend-ml/src/tensor.ts:94
Converts an array of labels into a tensor.

internal

Parameters
- labels: Sequence
  
  The labels to convert to a tensor.
Returns Tensor

The tensor resulting from conversion of the labels.

getModel

getModel(modelName: string, config: Config): Model

- Defined in packages/backend-ml/src/models/models.ts:86
Initialises a model (by name) with a given configuration.

internal

Parameters
- modelName: string
- config: Config
  
  The Config to initialise the model with.
Returns Model

An initialised model.

getSentencesTensor

getSentencesTensor(sentences: string[], tokenizer: Tokenizer, maxSequenceLength: number): Tensor

- Defined in packages/backend-ml/src/tensor.ts:69
Converts an array of sentences into a Tensor, which can be used for training a model.

internal

Parameters
- sentences: string[]
  
  An array of sentences to be converted to a tensor.
- tokenizer: Tokenizer
  
  The tokenizer to use when converting sentences to numerical values.
- maxSequenceLength: number
  
  The length that sequences should be normalised to (making up one of the tensor's dimensions).
Returns Tensor

The tensor resulting from conversion of the sentences.

getSpamFactor

getSpamFactor(sentence: string): Promise<number>

- Defined in packages/backend-ml/src/predictions.ts:100
Predicts the likelihood that a given sentence is spam.

Parameters
- sentence: string
  
  The sentence to generate a prediction for.
Returns Promise<number>

A number between 0 and 1, indicating how likely the given sentence is to be non-spam or spam respectively.

getToxicityClassification

getToxicityClassification(sentence: string): Promise<boolean>

- Defined in packages/backend-ml/src/predictions.ts:85
Predicts whether or not a given sentence of 'toxic'.

Toxic is defined as being any of:
- An identity attack
- Insulting
- Obscene
- Sexually explicit
- Threatening
Parameters
- sentence: string
  
  The sentence to classify.
Returns Promise<boolean>

A boolean indicating if the sentence was classified as toxic.

mergeParsers

mergeParsers(parsers: Parser[]): Promise<Message[]>

- Defined in packages/backend-ml/src/helpers.ts:75
Merges the messages from multiple parsers into a single array of messages.

internal

Parameters
- parsers: Parser[]
  
  The array of parsers for which messages should be merged.
Returns Promise<Message[]>

ratioSplitArray

ratioSplitArray<T>(data: T[], ratio: number): [T[], T[]]

- Defined in packages/backend-ml/src/helpers.ts:146
Splits an array into two parts using the given ratio.

internal

Type parameters
- T
  
  The type of elements in the array.
Parameters
- data: T[]
  
  The array to split.
- ratio: number
  
  The ration used to split the array. This should be a number in-between 0 and 1.
Returns [T[], T[]]

A tuple of length two containing the split array.

readZIPFile

readZIPFile(path: string): AsyncGenerator<ZIPFileEntry, void>

- Defined in packages/backend-ml/src/parsers/helpers.ts:68
Iterates through all entries in a given ZIP archive.

internal

Parameters
- path: string
  
  Path to the ZIP archive.
Returns AsyncGenerator<ZIPFileEntry, void>

An asynchronous generator yielding ZIPFileEntry objects.

saveModel

saveModel(trainedModel: TrainedModelWithMeta, modelsPath: string): Promise<void>

- Defined in packages/backend-ml/src/train.ts:67
Saves a trained model.

The output files are the:
- model structure (i.e. layers);
- trained weightings;
- configuration;
- tokenizer;
- history statistics.
internal
Parameters
- trainedModel: TrainedModelWithMeta
  
  The trained model.
- modelsPath: string
  
  The path to save the trained model.
Returns Promise<void>

saveSentencesMeta

saveSentencesMeta(sentences: string[], path: string): Promise<void>

- Defined in packages/backend-ml/src/train.ts:91
Generates and saves statistical information about the training data.

internal

Parameters
- sentences: string[]
  
  The sentences used for training the models.
- path: string
  
  The path to save the statistical information.
Returns Promise<void>

testModel

testModel(messages: Message[], outputPath: string): Promise<void>

- Defined in packages/backend-ml/src/test-model.ts:39
Tests a set of messages against the default model and writes the results to a file.

internal

Parameters
- messages: Message[]
- outputPath: string
  
  The file path to write the results at.
Returns Promise<void>

trainModels

trainModels(modelNames: string[], messages: Message[], config: Config): AsyncGenerator<TrainedModelWithMeta, void>

- Defined in packages/backend-ml/src/train.ts:136
Trains models with the given data and given configuration.

Creates a generator that yields once each model is trained. This is so that they can be saved individually, in case training is interrupted after many hours.

internal

Parameters
- modelNames: string[]
  
  The names of the models to be trained.
- messages: Message[]
  
  The data used to train the models.
- config: Config
  
  The Config used to train the models.
Returns AsyncGenerator<TrainedModelWithMeta, void>

@unifed/backend-ml

Spam Detection

Training Data

Models

Artifacts

Development and Evaluation

Text Toxicity

Index

Classes

Interfaces

Type aliases

Variables

Functions

Type aliases

Sequence

StringNumberMapping

TrainedModelWithMeta

Variables

Const configName

Const defaultConfig

Const historyName

Const metaPath

Const modelName

Const modelNames

Const modelsPath

Const sentenceLengthsName

Const testingResultsHamName

Const testingResultsSpamName

Const tokenizerName

Const wordFrequenciesName

Functions

arrayToCSV

Type parameters

T

R: T[]

Parameters

values: R[]

Returns string

createDirectory

Parameters

path: string

Returns Promise<void>

fitModel

Parameters

model: Model

trainingSentences: Tensor

trainingLabels: Tensor

testingSentences: Tensor

testingLabels: Tensor

config: Config

Returns Promise<TrainedModel>

flattenMessages

Parameters

messages: Message[]

Returns SentenceMapping

getLabelsTensor

Parameters

labels: Sequence

Returns Tensor

getModel

Parameters

modelName: string

config: Config

Returns Model

getSentencesTensor

Parameters

sentences: string[]

tokenizer: Tokenizer

maxSequenceLength: number

Returns Tensor

getSpamFactor

Parameters

sentence: string

Returns Promise<number>

getToxicityClassification

Parameters

sentence: string

Returns Promise<boolean>

mergeParsers

Parameters