Overview

Keras batchflow is a batch generator framework for Keras. The framework is generating batches for keras fit_generator and predict_generator functions in all sorts of less standard scenarios: multi-input and multi-output scenarios, scenarios employing dynamic data augmentation with dependencies between variables, etc.

The framework bridges gaps between keras and other two core data science modules: pandas and sklearn. With it, you can use pandas dataframe directly as a datasource for your keras model. You can use all breadth of standard sklearn encoders to transform columns of a dataframe into numpy arrays.

Following keras's modular approach, keras batchflow is built in the same way. A batch generator the framework produces is actually a stack of few interchangeable objects selected from a library, just like keras's layers. Each selected object adds new feature to the generator. All of this gives combinatorial strength to the framework which is capable covering a wide range of scenarios by just mixing library objects.

Quick taster example

Here is a quick example of what the framework can do:

import pandas as pd
from sklearn import LabelEncoder, LabelBinarizer
from keras_batchflow.batch_generators import BatchGenerator

df = pd.DataFrame({
    'var1': ['Class 0', 'Class 1', 'Class 0', 'Class 2', 'Class 0', 'Class 1', 'Class 0', 'Class 2'],
    'var2': ['Green', 'Yellow', 'Red', 'Brown', 'Green', 'Yellow', 'Red', 'Brown'],
    'label': ['Leaf', 'Flower', 'Leaf', 'Branch', 'Green', 'Yellow', 'Red', 'Brown']
})

#prefit sklearn encoders
var1_enc = LabelEncoder().fit(df['var1'])
var2_enc = LabelEncoder().fit(df['var2'])
label_enc = LabelBinarizer().fit(df['label'])

# define a batch generator
train_gen = BatchGenerator(
    df,
    x_structure=(('var1', var1_enc), ('var2', var2_enc)),
    y_structure=('label', label_enc),
    batch_size=4,
    train_mode=True
)

The generator returns batches of format (x_structure, y_structure) and the shape of the batches is:

>>> train_gen.shape
(((None, ), (None, )), (None, 3))

The first element is a x_structure and it is a list if two inputs. Both of them are outputs of LabelEncoders, that return integer ids of categorical variables, hence only one dimension. The y_structure is a single output produced by one-hot encoder, hence 2 dimensions.

Now you can define a neural network and use above generator in fit_generator to train it:

from keras.layers import Dense, Concatenate, Embedding, Input
from keras.models import Model

shapes = bg.shape
n_classes = bg.n_classes

var1_input = Input(batch_shape=shapes[0][0])
var2_input = Input(batch_shape=shapes[0][2])
var1_emb = Embedding(n_classes[0][0], 10)(var1_input)
var2_emb = Embedding(n_classes[0][1], 10)(var2_input)
features = Concatenate()([var1_emb, var2_emb])
# shapes[1] is (None, 3), so shapes[1][1] is just 3
classes = Dense(shapes[1][1], activation='softmax')(features)

model = Model([var1_input, var2_input], classes)

model.compile('adam', 'categorical_crossentropy')

model.fit_generator(bg)

Installation

shell script pip install git+https://github.com/maxsch3/keras-batchflow.git