Overview
The framework has layered and modular architecture. Each instance of batch generator is actually a stack of 4 layers of functionality
- Batch generator - it samples batches from a full dataset. The batches sampled are pandas dataframes of the same structure as full dataset. A generator passes batch to the next layer
- Batch transformers - makes transformations to a sampled batch. It has access to all variables in multi variable scenario and therefore can be used for transformations where variables interact. E.g. feature dropout where you have number of features per each data point and you would like to drop one of them randomly. You can specify multiple transformers, which will be applied sequentially
- Sklearn transformers - These are normally encoders that encode your data into keras friendly format. In the structure definition you specify which sklearn transformer you would like to be applied to which column of the dataset dataframe
- Batch shaper - is a layer that arranges numpy arrays from encoders into a structure accepted by Keras.
At each level, there is a choice of interchangeable types of objects that you can use making a batch generator with a functionality you need. On the top of that you can create custom types of layers making the framework very flexible and extendable
Components
The framework comes with few standard components that you can choose from and combine together to make a generator with a required functionality.
Below sections describe those out of the box components
Batch generators
Batch generators are responsible for sampling batches from a dataset. By choosing different types of of batch generators you can select different sampling strategies of a whole stack. At the moment, there are two different generators included: general purpose batch transformer and triplet generator for feeding data into triplet network.
All batch generators are interchangable with some minor differences in parameters
Generic BatchGenerator
Generic batch generator implements basic sampling without replacement from a dataset. It ensures that each datapoint is sampled only once. This batch generator may work without shuffling, i.e. when datapoints are sampled in the same order as they are presented in a dataset
This batch generator is implemented in BatchGenerator
class. Here is an example of use:
from keras_batchflow.base.batch_generators import BatchGenerator
from sklearn.preprocessing import LabelEncoder
import pandas as pd
from keras.models import Model
titanic_data = pd.read_csv('../data/titanic/train.csv')
embark_encoder = LabelEncoder().fit(titanic_data['Embarked'])
cabinclass_encoder = LabelEncoder().fit(titanic_data['Pclass'])
train_generator = BatchGenerator(titanic_data,
x_structure=(
('Embarked', embark_encoder),
('Pclass', cabinclass_encoder),
('Age', None)
),
y_structure=('Survived', None),
batch_size=32,
shuffle=True
)
model = Model()
model.fit_generator(train_generator)
You can find all details in API documentation
Triplet PK generator
This class implements a batch generator for a triplet network described in this paper.
Triplet network is an evolution of siamese networks, both of them are known for their ability to learn from very few samples. They sometimes referred as one-shot learning models for this property.
In summary, the generator randomly samples P classes (labels) and K random datapoints for each of them. In each batch both samplings are done without replacement, but samplings are independent from batch to batch, so the generator does not guarantee that each datapoint will be used in one epoch once.
The generator is designed to be used with triplet loss, that mines triplets "online", i.e. inside neural network while training. This type of loss is already available in tensorflow and tensorflow v2.
You can use this loss with keras:
from tfa.losses.triplet import triplet_semihard_loss
def keras_triplet_loss(labels, embeddings):
return triplet_semihard_loss(labels, embeddings, margin=1)
Batch transformers
All batch transformers modify batches generated by batch generators. BatchTransformers are used for transformations that involve interaction between variables. For example, randomly dropping values so that no more than N variables are dropped in each record.
Feature dropout
TODO. Meanwhile see API for this object
Shuffle noise
TODO. Meanwhile see API for this object