Overview

The framework has layered and modular architecture. Each instance of batch generator is actually a stack of 4 layers of functionality

Functional diagram

  • Batch generator - it samples batches from a full dataset. The batches sampled are pandas dataframes of the same structure as full dataset. A generator passes batch to the next layer
  • Batch transformers - makes transformations to a sampled batch. It has access to all variables in multi variable scenario and therefore can be used for transformations where variables interact. E.g. feature dropout where you have number of features per each data point and you would like to drop one of them randomly. You can specify multiple transformers, which will be applied sequentially
  • Sklearn transformers - These are normally encoders that encode your data into keras friendly format. In the structure definition you specify which sklearn transformer you would like to be applied to which column of the dataset dataframe
  • Batch shaper - is a layer that arranges numpy arrays from encoders into a structure accepted by Keras.

At each level, there is a choice of interchangeable types of objects that you can use making a batch generator with a functionality you need. On the top of that you can create custom types of layers making the framework very flexible and extendable

Components

The framework comes with few standard components that you can choose from and combine together to make a generator with a required functionality.

Below sections describe those out of the box components

Batch generators

Batch generators are responsible for sampling batches from a dataset. By choosing different types of of batch generators you can select different sampling strategies of a whole stack. At the moment, there are two different generators included: general purpose batch transformer and triplet generator for feeding data into triplet network.

All batch generators are interchangable with some minor differences in parameters

Generic BatchGenerator

Generic batch generator implements basic sampling without replacement from a dataset. It ensures that each datapoint is sampled only once. This batch generator may work without shuffling, i.e. when datapoints are sampled in the same order as they are presented in a dataset

This batch generator is implemented in BatchGenerator class. Here is an example of use:

from keras_batchflow.base.batch_generators import BatchGenerator
from sklearn.preprocessing import LabelEncoder
import pandas as pd
from keras.models import Model

titanic_data = pd.read_csv('../data/titanic/train.csv')

embark_encoder = LabelEncoder().fit(titanic_data['Embarked'])
cabinclass_encoder = LabelEncoder().fit(titanic_data['Pclass'])

train_generator = BatchGenerator(titanic_data, 
                                 x_structure=(
                                    ('Embarked', embark_encoder),
                                    ('Pclass', cabinclass_encoder),
                                    ('Age', None)
                                 ),
                                 y_structure=('Survived', None),
                                 batch_size=32,
                                 shuffle=True
)

model = Model()

model.fit_generator(train_generator)  

You can find all details in API documentation

Triplet PK generator

This class implements a batch generator for a triplet network described in this paper.

Triplet network is an evolution of siamese networks, both of them are known for their ability to learn from very few samples. They sometimes referred as one-shot learning models for this property.

In summary, the generator randomly samples P classes (labels) and K random datapoints for each of them. In each batch both samplings are done without replacement, but samplings are independent from batch to batch, so the generator does not guarantee that each datapoint will be used in one epoch once.

The generator is designed to be used with triplet loss, that mines triplets "online", i.e. inside neural network while training. This type of loss is already available in tensorflow and tensorflow v2.

You can use this loss with keras:

from tfa.losses.triplet import triplet_semihard_loss

def keras_triplet_loss(labels, embeddings):
    return triplet_semihard_loss(labels, embeddings, margin=1)

Batch transformers

All batch transformers modify batches generated by batch generators. BatchTransformers are used for transformations that involve interaction between variables. For example, randomly dropping values so that no more than N variables are dropped in each record.

Feature dropout

TODO. Meanwhile see API for this object

Shuffle noise

TODO. Meanwhile see API for this object