Options
All
  • Public
  • Public/Protected
  • All
Menu

Implementation of a tokenizer derived from here.

Used to convert a string into a sequence of numbers, where smaller numbers indicate more frequently occurring tokens.

internal

Hierarchy

  • Tokenizer

Index

Constructors

constructor

  • new Tokenizer(vocabSize: number): Tokenizer

Properties

wordCounts

wordCounts: Map<string, number> = ...

A mapping between words (tokens) and their corresponding frequencies.

wordIndex

wordIndex: Map<string, number> = ...

A mapping between words (tokens) and their corresponding index.

Methods

fitOnTexts

  • fitOnTexts(texts: string[]): void
  • Fits the tokenizer to a given set of strings. The frequency that tokens appear in these model strings will be used when generating sequences for unknown strings.

    Parameters

    • texts: string[]

      The strings to fit the tokenizer with.

    Returns void

fromJSON

textToSequence

  • textToSequence(text: string): Sequence

toJSON

Static tokenize

  • tokenize(text: string): string[]
  • Converts a string into an array of tokens.

    The following rules are followed:

    • Multiple spaces are collapsed into a single space.
    • Special characters are removed.
    • Numbers are replaced with a single token.
    • URLs are replaced with a single token.
    • Tokens are split at white space.

    Parameters

    • text: string

      The raw string.

    Returns string[]

    An array of tokens.

Generated using TypeDoc