Type representing a numerical sequence.
See Tokenizer
for usage example.
Type representing a string to number mapping.
See Tokenizer
for usage example.
Name of configuration file containing
the Config
used to train the model.
Default configuration for training models.
See the spam detection paper in deliverables 3 and 4 for a justification of these values.
Name of the file containing the statistics from training the model, such as losses and accuracy.
Path to save metadata.
Name of the file used to save the model information, such as layers.
Names of all available models.
Path to save models.
Name of file the file to save the sentence lengths statistics.
Name of file the file to save the testing results for non-spam posts.
Name of file the file to save the testing results for spam posts.
Name of the file to save the serialised tokenizer.
Name of file the file to save the worth frequency statistics.
Converts a 2D array to a CSV.
This is a very naive implementation, which does not consider commas in the actual values, however it is good enough for this purpose.
The type of values contained in the 2D array.
A 2D array of values to convert to a CSV. The first level of the array are treated as rows. The values in the second level are treated as columns.
The converted CSV value.
Creates a directory.
If the directory already exists, this is a noop.
The path to create the directory at.
Fits a given model to a set of training and validation data data.
The model to fit.
The training sentences (i.e. messages/posts).
The training labels (i.e. spam/non-spam classifications).
The testing sentences (i.e. messages/posts), which are used for validation.
The testing labels (i.e. spam/non-spam classifications).
The Config
to use when fitting the model.
The trained model and training statistics.
Flattens an array of Message
objects into an array
of sentences and array of labels, returned as a
SentenceMapping
.
The messages to flatten.
A flattened representation of the messages, as two arrays with corresponding indexes.
Converts an array of labels into a tensor.
The labels to convert to a tensor.
The tensor resulting from conversion of the labels.
Converts an array of sentences into a Tensor, which can be used for training a model.
An array of sentences to be converted to a tensor.
The tokenizer to use when converting sentences to numerical values.
The length that sequences should be normalised to (making up one of the tensor's dimensions).
The tensor resulting from conversion of the sentences.
Predicts the likelihood that a given sentence is spam.
The sentence to generate a prediction for.
A number between 0 and 1, indicating how likely the given sentence is to be non-spam or spam respectively.
Predicts whether or not a given sentence of 'toxic'.
Toxic is defined as being any of:
The sentence to classify.
A boolean indicating if the sentence was classified as toxic.
Splits an array into two parts using the given ratio.
The type of elements in the array.
The array to split.
The ration used to split the array. This should be a number in-between 0 and 1.
A tuple of length two containing the split array.
Iterates through all entries in a given ZIP archive.
Path to the ZIP archive.
An asynchronous generator yielding ZIPFileEntry
objects.
Saves a trained model.
The output files are the:
The trained model.
The path to save the trained model.
Generates and saves statistical information about the training data.
The sentences used for training the models.
The path to save the statistical information.
Tests a set of messages against the default model and writes the results to a file.
The file path to write the results at.
Trains models with the given data and given configuration.
Creates a generator that yields once each model is trained. This is so that they can be saved individually, in case training is interrupted after many hours.
The names of the models to be trained.
The data used to train the models.
The Config
used to train the
models.
Generated using TypeDoc
@unifed/backend-ml
This package is currently used for machine learning related code.
Unifed currently utilises two forms of machine learning:
The spam detection models are created and trained in this package, whereas the text toxicity classifier utilises the pre-trained
@tensorflow-models/toxicity
model.Spam Detection
The majority of the code in this package is for training a spam detection model.
Training data located in the
data
directory is converted into a common form, using the parsers located insrc/parsers
.The training data is then tokenized, using
src/tokenizer.ts
.A tensor is created using this data with the code in
src/tensor.ts
.The models used in
src/models
are trained with the data.src/train.ts
provides a command line utility for training the models, whereassrc/test-model.ts
provides a command line utility for for accessing the performance of models.An API to utilise the models is exposed in
src/index.ts
, which can be used by other packages.Training Data
Training data is located in the
data
directory. The sources for the training data are as follows:enron.zip
- Sourcesms.zip
- Sourcespam-assasin.zip
- Sourcetesting.zip
- SourceModels
The models used have been taken from the following sources:
dense
(trained) - Sourcedense-pooling
(trained) - Sourcetwilio-dense
(trained) - Sourcelstm
(not trained) - Sourcebi-directional-lstm
(not trained) - SourceSome models have not been trained, as we did not have the computing resources to do so in a reasonable amount of time. Training and evaluating these would be an interesting project extension.
Artifacts
The
models
directory contains the trained models. All configuration information is stored within here. These models take time to train and are checked into the repository.The
meta
directory contains statistics about the training data, used in the report. This directory is not committed, as it contained hundreds of thousands of lines.Development and Evaluation
A detailed report outlining the development and evaluation of the spam detection filter is available in both the 3rd and 4th deliverables.
Text Toxicity
The text toxicity classifier utilises the pre-trained
@tensorflow-models/toxicity
model.This package provides a simple API around the model in order to classify single pieces of text.