Options
All
  • Public
  • Public/Protected
  • All
Menu

@unifed/backend-ml

This package is currently used for machine learning related code.

Unifed currently utilises two forms of machine learning:

  1. A spam detection filter.
  2. A text toxicity classifier.

The spam detection models are created and trained in this package, whereas the text toxicity classifier utilises the pre-trained @tensorflow-models/toxicity model.

Spam Detection

The majority of the code in this package is for training a spam detection model.

  1. Training data located in the data directory is converted into a common form, using the parsers located in src/parsers.

  2. The training data is then tokenized, using src/tokenizer.ts.

  3. A tensor is created using this data with the code in src/tensor.ts.

  4. The models used in src/models are trained with the data.

src/train.ts provides a command line utility for training the models, whereas src/test-model.ts provides a command line utility for for accessing the performance of models.

An API to utilise the models is exposed in src/index.ts, which can be used by other packages.

Training Data

Training data is located in the data directory. The sources for the training data are as follows:

Models

The models used have been taken from the following sources:

  • dense (trained) - Source
  • dense-pooling (trained) - Source
  • twilio-dense (trained) - Source
  • lstm (not trained) - Source
  • bi-directional-lstm (not trained) - Source

Some models have not been trained, as we did not have the computing resources to do so in a reasonable amount of time. Training and evaluating these would be an interesting project extension.

Artifacts

The models directory contains the trained models. All configuration information is stored within here. These models take time to train and are checked into the repository.

The meta directory contains statistics about the training data, used in the report. This directory is not committed, as it contained hundreds of thousands of lines.

Development and Evaluation

A detailed report outlining the development and evaluation of the spam detection filter is available in both the 3rd and 4th deliverables.

Text Toxicity

The text toxicity classifier utilises the pre-trained @tensorflow-models/toxicity model.

This package provides a simple API around the model in order to classify single pieces of text.

Index

Type aliases

Sequence

Sequence: number[]

Type representing a numerical sequence.

See Tokenizer for usage example.

internal

StringNumberMapping

StringNumberMapping: [string, number][]

Type representing a string to number mapping.

See Tokenizer for usage example.

internal

TrainedModelWithMeta

TrainedModelWithMeta: TrainedModel & ModelMeta
internal

Variables

Const configName

configName: "configuration.json" = "configuration.json"

Name of configuration file containing the Config used to train the model.

internal

Const defaultConfig

defaultConfig: Config = ...

Default configuration for training models.

See the spam detection paper in deliverables 3 and 4 for a justification of these values.

internal

Const historyName

historyName: "history.json" = "history.json"

Name of the file containing the statistics from training the model, such as losses and accuracy.

internal

Const metaPath

metaPath: string = ...

Path to save metadata.

internal

Const modelName

modelName: "model.json" = "model.json"

Name of the file used to save the model information, such as layers.

internal

Const modelNames

modelNames: string[] = ...

Names of all available models.

internal

Const modelsPath

modelsPath: string = ...

Path to save models.

internal

Const sentenceLengthsName

sentenceLengthsName: "sentence-lengths.dat" = "sentence-lengths.dat"

Name of file the file to save the sentence lengths statistics.

internal

Const testingResultsHamName

testingResultsHamName: "testing-results-ham.dat" = "testing-results-ham.dat"

Name of file the file to save the testing results for non-spam posts.

internal

Const testingResultsSpamName

testingResultsSpamName: "testing-results-spam.dat" = "testing-results-spam.dat"

Name of file the file to save the testing results for spam posts.

internal

Const tokenizerName

tokenizerName: "tokenizer.json" = "tokenizer.json"

Name of the file to save the serialised tokenizer.

internal

Const wordFrequenciesName

wordFrequenciesName: "word-frequencies.dat" = "word-frequencies.dat"

Name of file the file to save the worth frequency statistics.

internal

Functions

arrayToCSV

  • arrayToCSV<T, R>(values: R[]): string
  • Converts a 2D array to a CSV.

    This is a very naive implementation, which does not consider commas in the actual values, however it is good enough for this purpose.

    internal

    Type parameters

    • T

      The type of values contained in the 2D array.

    • R: T[]

    Parameters

    • values: R[]

      A 2D array of values to convert to a CSV. The first level of the array are treated as rows. The values in the second level are treated as columns.

    Returns string

    The converted CSV value.

createDirectory

  • createDirectory(path: string): Promise<void>
  • Creates a directory.

    If the directory already exists, this is a noop.

    internal

    Parameters

    • path: string

      The path to create the directory at.

    Returns Promise<void>

fitModel

  • fitModel(model: Model, trainingSentences: Tensor, trainingLabels: Tensor, testingSentences: Tensor, testingLabels: Tensor, config: Config): Promise<TrainedModel>
  • Fits a given model to a set of training and validation data data.

    internal

    Parameters

    • model: Model

      The model to fit.

    • trainingSentences: Tensor

      The training sentences (i.e. messages/posts).

    • trainingLabels: Tensor

      The training labels (i.e. spam/non-spam classifications).

    • testingSentences: Tensor

      The testing sentences (i.e. messages/posts), which are used for validation.

    • testingLabels: Tensor

      The testing labels (i.e. spam/non-spam classifications).

    • config: Config

      The Config to use when fitting the model.

    Returns Promise<TrainedModel>

    The trained model and training statistics.

flattenMessages

getLabelsTensor

  • getLabelsTensor(labels: Sequence): Tensor
  • Converts an array of labels into a tensor.

    internal

    Parameters

    • labels: Sequence

      The labels to convert to a tensor.

    Returns Tensor

    The tensor resulting from conversion of the labels.

getModel

getSentencesTensor

  • getSentencesTensor(sentences: string[], tokenizer: Tokenizer, maxSequenceLength: number): Tensor
  • Converts an array of sentences into a Tensor, which can be used for training a model.

    internal

    Parameters

    • sentences: string[]

      An array of sentences to be converted to a tensor.

    • tokenizer: Tokenizer

      The tokenizer to use when converting sentences to numerical values.

    • maxSequenceLength: number

      The length that sequences should be normalised to (making up one of the tensor's dimensions).

    Returns Tensor

    The tensor resulting from conversion of the sentences.

getSpamFactor

  • getSpamFactor(sentence: string): Promise<number>
  • Predicts the likelihood that a given sentence is spam.

    Parameters

    • sentence: string

      The sentence to generate a prediction for.

    Returns Promise<number>

    A number between 0 and 1, indicating how likely the given sentence is to be non-spam or spam respectively.

getToxicityClassification

  • getToxicityClassification(sentence: string): Promise<boolean>
  • Predicts whether or not a given sentence of 'toxic'.

    Toxic is defined as being any of:

    • An identity attack
    • Insulting
    • Obscene
    • Sexually explicit
    • Threatening

    Parameters

    • sentence: string

      The sentence to classify.

    Returns Promise<boolean>

    A boolean indicating if the sentence was classified as toxic.

mergeParsers

ratioSplitArray

  • ratioSplitArray<T>(data: T[], ratio: number): [T[], T[]]
  • Splits an array into two parts using the given ratio.

    internal

    Type parameters

    • T

      The type of elements in the array.

    Parameters

    • data: T[]

      The array to split.

    • ratio: number

      The ration used to split the array. This should be a number in-between 0 and 1.

    Returns [T[], T[]]

    A tuple of length two containing the split array.

readZIPFile

  • readZIPFile(path: string): AsyncGenerator<ZIPFileEntry, void>

saveModel

  • Saves a trained model.

    The output files are the:

    • model structure (i.e. layers);
    • trained weightings;
    • configuration;
    • tokenizer;
    • history statistics.
    internal

    Parameters

    • trainedModel: TrainedModelWithMeta

      The trained model.

    • modelsPath: string

      The path to save the trained model.

    Returns Promise<void>

saveSentencesMeta

  • saveSentencesMeta(sentences: string[], path: string): Promise<void>
  • Generates and saves statistical information about the training data.

    internal

    Parameters

    • sentences: string[]

      The sentences used for training the models.

    • path: string

      The path to save the statistical information.

    Returns Promise<void>

testModel

  • testModel(messages: Message[], outputPath: string): Promise<void>
  • Tests a set of messages against the default model and writes the results to a file.

    internal

    Parameters

    • messages: Message[]
    • outputPath: string

      The file path to write the results at.

    Returns Promise<void>

trainModels

  • Trains models with the given data and given configuration.

    Creates a generator that yields once each model is trained. This is so that they can be saved individually, in case training is interrupted after many hours.

    internal

    Parameters

    • modelNames: string[]

      The names of the models to be trained.

    • messages: Message[]

      The data used to train the models.

    • config: Config

      The Config used to train the models.

    Returns AsyncGenerator<TrainedModelWithMeta, void>

Generated using TypeDoc