Ivis

class ivis.Ivis(embedding_dims=2, *, k=150, distance='pn', batch_size=128, epochs=1000, n_epochs_without_progress=20, n_trees=50, ntrees=None, knn_distance_metric='angular', search_k=-1, precompute=True, model='szubert', supervision_metric='sparse_categorical_crossentropy', supervision_weight=0.5, annoy_index_path=None, callbacks=None, build_index_on_disk=True, neighbour_matrix=None, verbose=1)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Ivis is a technique that uses an artificial neural network for dimensionality reduction, often useful for the purposes of visualization. The network trains on triplets of data-points at a time and pulls positive points together, while pushing more distant points away from each other. Triplets are sampled from the original data using KNN approximation using the Annoy library.

Parameters:
  • embedding_dims (int) – Number of dimensions in the embedding space
  • k (int) – The number of neighbours to retrieve for each point. Must be less than one minus the number of rows in the dataset.
  • distance (Union[str,Callable]) –

    The loss function used to train the neural network.

    • If string: a registered loss function name. Predefined losses are: “pn”, “euclidean”, “manhattan_pn”, “manhattan”, “chebyshev”, “chebyshev_pn”, “softmax_ratio_pn”, “softmax_ratio”, “cosine”, “cosine_pn”.
    • If Callable, must have two parameters, (y_true, y_pred). y_pred denotes the batch of triplets, and y_true are any corresponding labels. y_pred is expected to be of shape: (3, batch_size, embedding_dims).
      • When loading model loaded with a custom loss, provide the loss to the constructor of the new Ivis instance before loading the saved model.
  • batch_size (int) – The size of mini-batches used during gradient descent while training the neural network. Must be less than the num_rows in the dataset.
  • epochs (int) – The maximum number of epochs to train the model for. Each epoch the network will see a triplet based on each data-point once.
  • n_epochs_without_progress (int) – After n number of epochs without an improvement to the loss, terminate training early.
  • n_trees (int) – The number of random projections trees built by Annoy to approximate KNN. The more trees the higher the memory usage, but the better the accuracy of results.
  • ntrees (int) – Deprecated. Use n_trees instead.
  • knn_distance_metric (str) – The distance metric used to retrieve nearest neighbours. Supports “angular” (default), “euclidean”, “manhattan”, “hamming”, or “dot”.
  • search_k (int) – The maximum number of nodes inspected during a nearest neighbour query by Annoy. The higher, the more computation time required, but the higher the accuracy. The default is n_trees * k, where k is the number of neighbours to retrieve. If this is set too low, a variable number of neighbours may be retrieved per data-point.
  • precompute (bool) – Whether to pre-compute the nearest neighbours. Pre-computing is a little faster, but requires more memory. If memory is limited, try setting this to False.
  • model (Union[str,tf.keras.models.Model]) –

    The keras model to train using triplet loss.

    • If a model object is provided, an embedding layer of size ‘embedding_dims’ will be appended to the end of the network.
    • If a string, a pre-defined network by that name will be used. Possible options are: ‘szubert’, ‘hinton’, ‘maaten’. By default the ‘szubert’ network will be created, which is a selu network composed of 3 dense layers of 128 neurons each, followed by an embedding layer of size ‘embedding_dims’.
  • supervision_metric (str) – The supervision metric to optimize when training keras in supervised mode. Supports all of the classification or regression losses included with keras, so long as the labels are provided in the correct format. A list of keras’ loss functions can be found at https://keras.io/losses/ .
  • supervision_weight (float) – Float between 0 and 1 denoting the weighting to give to classification vs triplet loss when training in supervised mode. The higher the weight, the more classification influences training. Ignored if using Ivis in unsupervised mode.
  • annoy_index_path (str) – The filepath of a pre-trained annoy index file saved on disk. If provided, the annoy index file will be loaded and used. Otherwise, a new index will be generated and saved to disk in a temporary directory.
  • callbacks ([keras.callbacks.Callback]) – List of keras Callbacks to pass model during training, such as the TensorBoard callback. A set of ivis-specific callbacks are provided in the ivis.nn.callbacks module.
  • build_index_on_disk (bool) – Whether to build the annoy index directly on disk. Building on disk should allow for bigger datasets to be indexed, but may cause issues.
  • neighbour_matrix (Union[np.array,collections.abc.Sequence]) –

    Providing a neighbour matrix will cause Ivis to skip computing the Annoy KNN index and instead use the provided neighbour_matrix.

    • A pre-computed neighbour matrix can be provided as a numpy array. Indexing the array should retrieve a list of neighbours for the data point associated with that index.
    • Alternatively, dynamic computation of neighbours can be done by providing a class than implements the collections.abc.Sequence class, specifically the __getitem__ and __len__ methods.
      • See the ivis.data.neighbour_retrieval.AnnoyKnnMatrix class for an example.
  • verbose (int) – Controls the volume of logging output the model produces when training. When set to 0, silences outputs, when above 0 will print outputs.
fit(X, Y=None, shuffle_mode=True)

Fit an ivis model.

Parameters:
  • X (np.array, ivis.data.sequence.IndexableDataset, tensorflow.keras.utils.HDF5Matrix) – Data to be embedded. Needs to have a .shape attribute and a __getitem__ method.
  • Y (array, shape (n_samples)) – Optional array for supervised dimensionality reduction. If Y contains -1 labels, and ‘sparse_categorical_crossentropy’ is the loss function, semi-supervised learning will be used.
Returns:

self – Returns estimator instance.

Return type:

ivis.Ivis object

fit_transform(X, Y=None, shuffle_mode=True)

Fit to data then transform

Parameters:
  • X (np.array, ivis.data.sequence.IndexableDataset, tensorflow.keras.utils.HDF5Matrix) – Data to train on and then embedded. Needs to have a .shape attribute and a __getitem__ method.
  • Y (array, shape (n_samples)) – Optional array for supervised dimensionality reduction. If Y contains -1 labels, and ‘sparse_categorical_crossentropy’ is the loss function, semi-supervised learning will be used.
Returns:

X_new – Embedding of the data in low-dimensional space.

Return type:

array, shape (n_samples, embedding_dims)

get_params(deep=True)

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:dict
load_model(folder_path)

Load ivis model

Parameters:folder_path (string) – Path to serialised model files and metadata
Returns:self – Returns estimator instance.
Return type:ivis.Ivis object
save_model(folder_path, save_format='h5', overwrite=False)

Save an ivis model

Parameters:
  • folder_path (string) – Path to serialised model files and metadata
  • save_format (string) – Format to save ivis model as. Either “.h5” for a .h5 file or “tf” for TensorFlow SavedModel format.
  • overwrite (bool) – Whether to overwrite the specified folder path.
score_samples(X)

Passes X through classification network to obtain predicted supervised values. Only applicable when trained in supervised mode.

Parameters:X (np.array, ivis.data.sequence.IndexableDataset, tensorflow.keras.utils.HDF5Matrix) – Data to be passed through classification network. Needs to have a .shape attribute and a __getitem__ method.
Returns:X_new – Softmax class probabilities of the data.
Return type:array, shape (n_samples, embedding_dims)
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:estimator instance
transform(X)

Transform X into the existing embedded space and return that transformed output.

Parameters:X (np.array, ivis.data.sequence.IndexableDataset, tensorflow.keras.utils.HDF5Matrix) – Data to be transformed. Needs to have a .shape attribute and a __getitem__ method.
Returns:X_new – Embedding of the data in low-dimensional space.
Return type:array, shape (n_samples, embedding_dims)

Neighbour Retrieval

class ivis.data.neighbour_retrieval.NeighbourMatrix

Bases: collections.abc.Sequence

A matrix Aij where i is the row index of the data point and j refers to the index of the neigbouring point.

get_batch(idx_seq)

Gets a batch of neighbours corresponding to the provided index sequence.

Non-optimized version, can be overridden by child classes to be made be efficient

k

The width of the matrix (number of neighbours retrieved)

class ivis.data.neighbour_retrieval.AnnoyKnnMatrix(index, nrows, index_path='annoy.index', metric='angular', k=150, search_k=-1, precompute=False, include_distances=False, verbose=False, n_jobs=-1)

Bases: ivis.data.neighbour_retrieval.knn.NeighbourMatrix

A matrix Aij where i is the row index of the data point and j refers to the index of the neigbouring point.

Neighbouring points are KNN retrieved using an Annoy Index.

Parameters:
  • index (AnnoyIndex) – AnnoyIndex instance to use when retrieving KNN
  • nrows (tuple) – Number of rows in data matrix was built on
  • index_path (string) – Location of the AnnoyIndex file on disk
  • k (int) – The number of neighbours to retrieve for each point
  • search_k (int) – Controls the number of nodes searched - higher is more expensive but more accurate. Default of -1 defaults to n_trees * k
  • precompute (boolean) – Whether to precompute the KNN index and store the matrix in memory. Much faster when training, but consumes more memory.
  • include_distances (boolean) – Whether to return the distances along with the indexes of the neighbouring points
  • verbose (boolean) – Controls verbosity of output to console when building index. If False, nothing will be printed to the terminal.
__getitem__(idx)

Returns neighbours list for the specified index. Supports both integer and slice indices.

__getstate__()

Return object serializable variable dict

__init__(index, nrows, index_path='annoy.index', metric='angular', k=150, search_k=-1, precompute=False, include_distances=False, verbose=False, n_jobs=-1)

Constructs an AnnoyKnnMatrix instance from an AnnoyIndex object with given parameters

__len__()

Number of rows in neighbour matrix

classmethod build(X, path, k=150, metric='angular', search_k=-1, include_distances=False, ntrees=50, build_index_on_disk=True, precompute=False, verbose=1, n_jobs=-1)

Builds a new Annoy Index on input data X, then constructs and returns an AnnoyKnnMatrix object using the newly-built index.

delete_index(parent=False)

Cleans up disk resources used by the index, rendering it unusable. First will unload the index, then recursively removes the files at index path. If parent is True, will recursively remove parent folder.

get_batch(idx_seq)

Returns a batch of neighbours based on the index sequence provided.

get_neighbour_indices(n_jobs=-1)

Retrieves neighbours for every row in parallel

classmethod load(index_path, data_shape, k=150, metric='angular', search_k=-1, include_distances=False, precompute=False, verbose=1, n_jobs=-1)

Constructs and returns an AnnoyKnnMatrix object from an existing Annoy Index on disk.

save(path)

Saves internal Annoy index to disk at given path.

unload()

Unloads the index from disk, allowing other processes to read/write to the index file. After calling this, the index will no longer be usable from this instance.

class ivis.data.neighbour_retrieval.LabeledNeighbourMap(labels)

Bases: collections.abc.Sequence

Retrieves neighbour indices according to class labels provided in constructor. Rows with the same label will be regarded as neighbours.

__getitem__(idx)

Retrieves the neighbours for the row index provided

__init__(labels)

Constructs a LabeledNeighbourMap instance from a list of labels. :param labels list: List of labels for each data-point. One label per data-point.

__len__()

Returns the number of rows in the data

ivis.data.neighbour_retrieval.knn.build_annoy_index(X, path, metric='angular', ntrees=50, build_index_on_disk=True, verbose=1, n_jobs=-1)

Build a standalone annoy index.

Parameters:
  • X (array) – numpy array with shape (n_samples, n_features)
  • path (str) – The filepath of a trained annoy index file saved on disk.
  • ntrees (int) – The number of random projections trees built by Annoy to approximate KNN. The more trees the higher the memory usage, but the better the accuracy of results.
  • build_index_on_disk (bool) – Whether to build the annoy index directly on disk. Building on disk should allow for bigger datasets to be indexed, but may cause issues.
  • metric (str) – Which distance metric Annoy should use when building KNN index. Supports “angular”, “euclidean”, “manhattan”, “hamming”, or “dot”.
  • verbose (int) – Controls the volume of logging output the model produces when training. When set to 0, silences outputs, when above 0 will print outputs.

Indexable Datasets

class ivis.data.sequence.IndexableDataset

Bases: collections.abc.Sequence

A sequence that also defines a shape attribute. This indexable data structure can be provided as input to ivis.

get_batch(idx_seq)

Returns a batch of data points based on the index sequence provided.

Non-optimized version, can be overridden by child classes to be made be efficient

shape()

Returns the shape of the dataset. First dimension corresponds to rows, the other dimensions correspond to features.

class ivis.data.sequence.ImageDataset(filepath_list, img_shape, color_mode='rgb', resize_method='bilinear', preserve_aspect_ratio=False, dtype=<sphinx.ext.autodoc.importer._MockObject object>, preprocessing_function=None, n_jobs=-1)

Bases: ivis.data.sequence.sequence.IndexableDataset

When indexed, loads images from disk, resizes to consistent size, then returns image. Since the returned images will consist of 3 dimensions, the model ivis uses must be capable of dealing with this dimensionality of data (for example, a Convolutional Neural Network). Such a model can be constructed externally and then passed to ivis as the argument for ‘model’.

Parameters:
  • filepath_list (list) – All image filepaths in dataset.
  • img_shape (tuple) – A tuple (height, width) containing desired dimensions to resize the images to.
  • str (resize_method) – Either “rgb”, “rgba” or “grayscale”. Determines how many channels present in images that are read in - 3, 4, or 1 respectively.
  • str – Interpolation method to use when resizing image. Must be one of: “area”, “bicubic”, “bilinear”, “gaussian”, “lanczos3”, “lanczos5”, “mitchellcubic”, “nearest”.
  • boolean (preserve_aspect_ratio) – Whether to preserve the aspect ratio when resizing images. If True, will maintain aspect ratio by padding the image.
  • tf.dtypes.DType (dtype) – The dtype to read the image into. One of tf.uint8 or tf.uint16.
  • Callable (preprocessing_function) – A function to apply to every image. Will be called at the end of the pipeline, after image reading and resizing. If None (default), no function will be applied.
__init__(filepath_list, img_shape, color_mode='rgb', resize_method='bilinear', preserve_aspect_ratio=False, dtype=<sphinx.ext.autodoc.importer._MockObject object>, preprocessing_function=None, n_jobs=-1)

Initialize self. See help(type(self)) for accurate signature.

get_batch(idx_seq)

Returns a batch of data points based on the index sequence provided.

read_image(filepath)

Reads an image from disk into a numpy array

resize_image(img)

Resizes an numpy array image to desired dimensions

class ivis.data.sequence.FlattenedImageDataset(filepath_list, img_shape, color_mode='rgb', resize_method='bilinear', preserve_aspect_ratio=False, dtype=<sphinx.ext.autodoc.importer._MockObject object>, preprocessing_function=None, n_jobs=None)

Bases: ivis.data.sequence.image.ImageDataset

Returns flattened versions of images read in from disk. This dataset can be used with the default neighbour retrieval method used by ivis (Annoy KNN index) since it is 2D.

__init__(filepath_list, img_shape, color_mode='rgb', resize_method='bilinear', preserve_aspect_ratio=False, dtype=<sphinx.ext.autodoc.importer._MockObject object>, preprocessing_function=None, n_jobs=None)

Initialize self. See help(type(self)) for accurate signature.

Losses

Triplet loss functions for training a siamese network with three subnetworks. All loss function variants are accessible through the triplet_loss function by specifying the distance as a string.

class ivis.nn.losses.ChebyshevPnLoss(margin=1, name=None)

Calculates the pn loss (a variant of triplet loss) between anchor, positive and negative examples in a triplet based on chebyshev distance.

class ivis.nn.losses.ChebyshevTripletLoss(margin=1, name=None)

Calculates the standard triplet loss between anchor, positive and negative examples in a triplet based on chebyshev distance.

class ivis.nn.losses.CosinePnLoss(margin=1, name=None)

Calculates the pn loss (a variant of triplet loss) between anchor, positive and negative examples in a triplet based on cosine distance.

class ivis.nn.losses.CosineTripletLoss(margin=1, name=None)

Calculates the standard triplet loss between anchor, positive and negative examples in a triplet based on cosine distance.

class ivis.nn.losses.EuclideanPnLoss(margin=1, name=None)

Calculates the pn loss (a variant of triplet loss) between anchor, positive and negative examples in a triplet based on euclidean distance.

class ivis.nn.losses.EuclideanSoftmaxRatioLoss(name=None)

Calculates the standard softmax ratio between anchor, positive and negative examples in a triplet based on euclidean distance.

class ivis.nn.losses.EuclideanSoftmaxRatioPnLoss(name=None)

Calculates a pn variant of the softmax ratio between anchor, positive and negative examples in a triplet based on euclidean distance.

class ivis.nn.losses.EuclideanTripletLoss(margin=1, name=None)

Calculates the standard triplet loss between anchor, positive and negative examples in a triplet based on euclidean distance.

class ivis.nn.losses.ManhattanPnLoss(margin=1, name=None)

Calculates the pn loss (a variant of triplet loss) between anchor, positive and negative examples in a triplet based on manhattan distance.

class ivis.nn.losses.ManhattanTripletLoss(margin=1, name=None)

Calculates the standard triplet loss between anchor, positive and negative examples in a triplet based on manhattan distance.

ivis.nn.losses.register_loss(loss_fn=None, *, name=None)

Registers a class definition or Callable as an ivis loss function. A mapping will be created between the name and the loss function passed. If a class definition is provided, an instance will be created, passing the name as an argument.

If no name is provided to this function, the name of the passed function will be used as a key.

The loss function must have two parameters, (y_true, y_pred) and calculates the loss for a batch of triplet inputs (y_pred). y_pred is expected to be of shape: (3, batch_size, embedding_dims).

Usage:
@register_loss
def custom_loss(y_true, y_pred):
    pass
model = Ivis(distance='custom_loss')
ivis.nn.losses.semi_supervised_loss(loss_function)

Wraps the provided ivis supervised loss function to deal with the partially labeled data. Returns a new ‘semi-supervised’ loss function that masks the loss on examples where label information is missing.

Missing labels are assumed to be marked with -1.

ivis.nn.losses.triplet_loss(distance='pn')

Returns a previously registered triplet loss function associated with the string ‘distance’. If passed a callable, just returns it.

Callbacks

A collection of callbacks that can be passed to ivis to be called during training. These provide utilities such as saving checkpoints during training (allowing for resuming if interrupted), as well as periodic logging of plots and model embeddings. With this information, you may decide to terminate a training session early due to a lack of improvements to the visualizations, for example.

To use a callback during training, simply pass a list of callback objects to the Ivis object when creating it using the callbacks keyword argument. The ivis.nn.callbacks module contains a set of callbacks provided for use with ivis models, but any tf.keras.callbacks.Callbacks object can be passed and will be used during training: for example, tf.keras.callbacks.TensorBoard. This means it’s also possible to write your own callbacks for ivis to use.

class ivis.nn.callbacks.EmbeddingsImage(data, labels=None, log_dir='./logs', filename='{}_embeddings.png', epoch_interval=1)

Bases: sphinx.ext.autodoc.importer._MockObject

Periodically generates and plots 2D embeddings of the data provided to data using the latest state of the Ivis model. By default, saves plots of the embeddings every epoch; increasing the epoch_interval will save the plots less frequently.

Parameters:
  • data (list[float]) – Data to embed and plot with the latest Ivis model
  • labels (list[int]) – Labels with which to colour plotted embeddings. If None all points will have the same color.
  • log_dir (str) – Folder to save resulting embeddings.
  • filename (str) – Filename to save each file as. {} in string will be substituted with the epoch number.

Example usage:

from ivis.nn.callbacks import EmbeddingsImage
from ivis import Ivis
from tensorflow.keras.datasets import mnsit

(X_train, Y_train), (X_test, Y_test)  = mnist.load_data()

# Plot embeddings of test set every epoch colored by labels
embeddings_callback = EmbeddingsImage(X_test, Y_test,
                                        log_dir='test-embeddings',
                                        filename='{}_test_embeddings.npy',
                                        epoch_interval=1)

model = Ivis(callbacks=[embeddings_callback])

# Train on training set
model.fit(X_train)
class ivis.nn.callbacks.EmbeddingsLogging(data, log_dir='./embeddings_logs', filename='{}_embeddings.npy', epoch_interval=1)

Bases: sphinx.ext.autodoc.importer._MockObject

Periodically saves embeddings of the data provided to data using the latest state of the Ivis model. By default, saves embeddings every epoch; increasing the epoch_interval will save the embeddings less frequently.

Parameters:
  • data (list[float]) – Data to embed with the latest Ivis object
  • log_dir (str) – Folder to save resulting embeddings.
  • filename (str) – Filename to save each file as. {} in string will be substituted with the epoch number.

Example usage:

from ivis.nn.callbacks import EmbeddingsLogging
from ivis import Ivis
from tensorflow.keras.datasets import mnsit

(X_train, Y_train), (X_test, Y_test)  = mnist.load_data()

# Save embeddings of test set every epoch
embeddings_callback = EmbeddingsLogging(X_test,
                                        log_dir='test-embeddings',
                                        filename='{}_test_embeddings.npy',
                                        epoch_interval=1)

model = Ivis(callbacks=[embeddings_callback])

# Train on training set
model.fit(X_train)
class ivis.nn.callbacks.ModelCheckpoint(log_dir='./model_checkpoints', filename='model-checkpoint_{}.ivis', epoch_interval=1)

Bases: sphinx.ext.autodoc.importer._MockObject

Periodically saves the model during training. By default, it saves the model every epoch; increasing the epoch_interval will make checkpointing less frequent.

If the given filename contains the {} string, the epoch number will be subtituted in, resulting in multiple checkpoint folders with different names. If a filename such as ‘ivis-checkpoint’ is provided, only the latest checkpoint will be kept.

Parameters:
  • log_dir (str) – Folder to save resulting embeddings.
  • filename (str) – Filename to save each file as. {} in string will be substituted with the epoch number.

Example usage:

from ivis.nn.callbacks import ModelCheckpoint
from ivis import Ivis

# Save only the latest checkpoint to current directory every 10 epochs
checkpoint_callback = ModelCheckpoint(log_dir='.',
                                    filename='latest-checkpoint.ivis',
                                    epoch_interval=10)

model = Ivis(callbacks=[checkpoint_callback])
class ivis.nn.callbacks.TensorBoardEmbeddingsImage(data, labels=None, log_dir='./logs', epoch_interval=1)

Bases: sphinx.ext.autodoc.importer._MockObject

Periodically generates and plots 2D embeddings of the data provided to data using the latest state of the Ivis model. The plots are designed to be viewed in Tensorboard, which will provide an image that shows the history of embeddings plots through training. By default, saves plots of the embeddings every epoch; increasing the epoch_interval will save the plots less frequently.

Parameters:
  • data (list[float]) – Data to embed and plot with the latest Ivis
  • labels (list[int]) – Labels with which to colour plotted embeddings. If None all points will have the same color.
  • log_dir (str) – Folder to save resulting embeddings.
  • filename (str) – Filename to save each file as. {} in string will be substituted with the epoch number.

Example usage:

from ivis.nn.callbacks import TensorBoardEmbeddingsImage
from ivis import Ivis
from tensorflow.keras.datasets import mnsit

(X_train, Y_train), (X_test, Y_test)  = mnist.load_data()

# Plot embeddings of test set every epoch colored by labels
embeddings_callback = TensorBoardEmbeddingsImage(X_test, Y_test,
                                        log_dir='test-embeddings',
                                        filename='{}_test_embeddings.npy',
                                        epoch_interval=1)

model = Ivis(callbacks=[embeddings_callback])

# Train on training set
model.fit(X_train)