ivis.Ivis

class ivis.Ivis(embedding_dims=2, k=150, distance='pn', batch_size=128, epochs=1000, n_epochs_without_progress=20, margin=1, ntrees=50, search_k=-1, precompute=True, model='szubert', supervision_metric='sparse_categorical_crossentropy', supervision_weight=0.5, annoy_index_path=None, callbacks=[], build_index_on_disk=None, neighbour_matrix=None, verbose=1)

Bases: sklearn.base.BaseEstimator

Ivis is a technique that uses an artificial neural network for dimensionality reduction, often useful for the purposes of visualization. The network trains on triplets of data-points at a time and pulls positive points together, while pushing more distant points away from each other. Triplets are sampled from the original data using KNN aproximation using the Annoy library.

Parameters:
  • embedding_dims (int) – Number of dimensions in the embedding space
  • k (int) – The number of neighbours to retrieve for each point. Must be less than one minus the number of rows in the dataset.
  • distance (str) – The loss function used to train the neural network. One of “pn”, “euclidean”, “manhattan_pn”, “manhattan”, “chebyshev”, “chebyshev_pn”, “softmax_ratio_pn”, “softmax_ratio”, “cosine”, “cosine_pn”.
  • batch_size (int) – The size of mini-batches used during gradient descent while training the neural network. Must be less than the num_rows in the dataset.
  • epochs (int) – The maximum number of epochs to train the model for. Each epoch the network will see a triplet based on each data-point once.
  • n_epochs_without_progress (int) – After n number of epochs without an improvement to the loss, terminate training early.
  • margin (float) – The distance that is enforced between points by the triplet loss functions.
  • ntrees (int) – The number of random projections trees built by Annoy to approximate KNN. The more trees the higher the memory usage, but the better the accuracy of results.
  • search_k (int) – The maximum number of nodes inspected during a nearest neighbour query by Annoy. The higher, the more computation time required, but the higher the accuracy. The default is n_trees * k, where k is the number of neighbours to retrieve. If this is set too low, a variable number of neighbours may be retrieved per data-point.
  • precompute (bool) – Whether to pre-compute the nearest neighbours. Pre-computing is a little faster, but requires more memory. If memory is limited, try setting this to False.
  • model (str) – str or keras.models.Model. The keras model to train using triplet loss. If a model object is provided, an embedding layer of size ‘embedding_dims’ will be appended to the end of the network. If a string, a pre-defined network by that name will be used. Possible options are: ‘szubert’, ‘hinton’, ‘maaten’. By default the ‘szubert’ network will be created, which is a selu network composed of 3 dense layers of 128 neurons each, followed by an embedding layer of size ‘embedding_dims’.
  • supervision_metric (str) – str or function. The supervision metric to optimize when training keras in supervised mode. Supports all of the classification or regression losses included with keras, so long as the labels are provided in the correct format. A list of keras’ loss functions can be found at https://keras.io/losses/ .
  • supervision_weight (float) – Float between 0 and 1 denoting the weighting to give to classification vs triplet loss when training in supervised mode. The higher the weight, the more classification influences training. Ignored if using Ivis in unsupervised mode.
  • annoy_index_path (str) – The filepath of a pre-trained annoy index file saved on disk. If provided, the annoy index file will be used. Otherwise, a new index will be generated and saved to disk in the current directory as ‘annoy.index’.
  • callbacks (list[keras.callbacks.Callback]) – List of keras Callbacks to pass model during training, such as the TensorBoard callback. A set of ivis-specific callbacks are provided in the ivis.nn.callbacks module.
  • build_index_on_disk (bool) – Whether to build the annoy index directly on disk. Building on disk should allow for bigger datasets to be indexed, but may cause issues. If None, on-disk building will be enabled for Linux, but not Windows due to issues on Windows.
  • neighbour_matrix (np.array) – A pre-computed KNN matrix can be provided. The KNNs can be retrieved using any method, and will cause Ivis to skip computing the Annoy KNN index.
  • verbose (int) – Controls the volume of logging output the model produces when training. When set to 0, silences outputs, when above 0 will print outputs.
fit(X, Y=None, shuffle_mode=True)

Fit an ivis model.

X : array, shape (n_samples, n_features)
Data to be embedded.
Y : array, shape (n_samples)
Optional array for supervised dimentionality reduction. If Y contains -1 labels, and ‘sparse_categorical_crossentropy’ is the loss function, semi-supervised learning will be used.

returns an instance of self

fit_transform(X, Y=None, shuffle_mode=True)

Fit to data then transform

X : array, shape (n_samples, n_features)
Data to be embedded.
Y : array, shape (n_samples)
Optional array for supervised dimentionality reduction. If Y contains -1 labels, and ‘sparse_categorical_crossentropy’ is the loss function, semi-supervised learning will be used.
X_new : transformed array, shape (n_samples, embedding_dims)
Embedding of the new data in low-dimensional space.
get_params(deep=True)

Get parameters for this estimator.

deep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
params : mapping of string to any
Parameter names mapped to their values.
load_model(folder_path)

Load ivis model

folder_path : string
Path to serialised model files and metadata

returns an ivis instance

save_model(folder_path, overwrite=False)

Save an ivis model

folder_path : string
Path to serialised model files and metadata
score_samples(X)

Passes X through classification network to obtain predicted supervised values. Only applicable when trained in supervised mode.

X : array, shape (n_samples, n_features)
Data to be passed through classification network.
X_new : array, shape (n_samples, embedding_dims)
Softmax class probabilities of the data.
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

**params : dict
Estimator parameters.
self : object
Estimator instance.
transform(X)

Transform X into the existing embedded space and return that transformed output.

X : array, shape (n_samples, n_features)
New data to be transformed.
X_new : array, shape (n_samples, embedding_dims)
Embedding of the new data in low-dimensional space.

ivis.data.knn

KNN retrieval using an Annoy index.

ivis.data.knn.build_annoy_index(X, path, ntrees=50, build_index_on_disk=True, verbose=1)

Build a standalone annoy index.

Parameters:
  • X (array) – numpy array with shape (n_samples, n_features)
  • path (str) – The filepath of a trained annoy index file saved on disk.
  • ntrees (int) – The number of random projections trees built by Annoy to approximate KNN. The more trees the higher the memory usage, but the better the accuracy of results.
  • build_index_on_disk (bool) – Whether to build the annoy index directly on disk. Building on disk should allow for bigger datasets to be indexed, but may cause issues. If None, on-disk building will be enabled for Linux, but not Windows due to issues on Windows.
  • verbose (int) – Controls the volume of logging output the model produces when training. When set to 0, silences outputs, when above 0 will print outputs.