Keras batchflow
  • Home
  • User guide

Examples

  • Basic use of Keras-batchflow with Titanic data
    • Data pre-processing
    • Batch generator
    • Keras model
    • Predicting using the keras-batchflow
    • Decoding predictions back to labels

API

  • Batch generators
  • Batch transformers
Keras batchflow
  • Examples
  • Basic use of Keras-batchflow with Titanic data
  • Edit on GitHub

Basic use of Keras-batchflow with Titanic data¶

Below example shows the most basic use of keras batchflow for predicting survival outcome in Titanic disaster. A well known Titanic dataset from Kaggle is used in this example

This dataset has a mixture of both categorical and numeric variables which will show the features of keras-batchflow better.

Data pre-processing¶

In [1]:
Copied!
import pandas as pd
import numpy as np
import pandas as pd import numpy as np
In [2]:
Copied!
data = pd.read_csv('../data/titanic/train.csv')
data.shape
data = pd.read_csv('../data/titanic/train.csv') data.shape
Out[2]:
(891, 12)

There are only 891 datapoints in training dataset

In [3]:
Copied!
data.head()
data.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Imagine after exploratory analysis and model finding, only few columns were selected as features: Pclass, Sex, Age, and Embarked.

Lets see if there are any NAs to fill:

In [4]:
Copied!
data[['Pclass', 'Sex', 'Age', 'Embarked', 'Survived']].isna().apply(sum)
data[['Pclass', 'Sex', 'Age', 'Embarked', 'Survived']].isna().apply(sum)
Out[4]:
Pclass        0
Sex           0
Age         177
Embarked      2
Survived      0
dtype: int64

Lets fill those NAs:

In [5]:
Copied!
data['Age'] = data['Age'].fillna(0)
data['Embarked'] = data['Embarked'].fillna('')
data['Age'] = data['Age'].fillna(0) data['Embarked'] = data['Embarked'].fillna('')

The outcome column Survived is ordinal categorical too, but it is presented as 0 and 1 and does not require any conversion for the purpose of binary classification.

For making example more generic, I will convert this outcome to text labels Yes and No

In [6]:
Copied!
data['Survived'] = data['Survived'].astype(str)
data.loc[data.Survived == 1, 'Survived'] = 'Yes'
data.loc[data.Survived == 0, 'Survived'] = 'No'
data['Survived'] = data['Survived'].astype(str) data.loc[data.Survived == 1, 'Survived'] = 'Yes' data.loc[data.Survived == 0, 'Survived'] = 'No'

Batch generator¶

I would like to build a simple neural network using embedding for all categorical values, which will predict if a passenger would survive.

When building such a model, I will need to provide number of levels of each categorical feature in embedding layers declarations. Keras-batchflow provides some automation helping determining this parameter for each feature and therefore, I will build a generator first.

To build a batchflow generator you will first need to define your encoders, that will map categorical value to its integer repredentation. I will use sklearn LabelEncoder for this purpose.

In [7]:
Copied!
from sklearn.preprocessing import LabelEncoder

class_encoder = LabelEncoder().fit(data['Pclass'])
sex_encoder = LabelEncoder().fit(data['Sex'])
embarked_encoder = LabelEncoder().fit(data['Embarked'].astype(str))
surv_encoder = LabelEncoder().fit(data['Survived'])
from sklearn.preprocessing import LabelEncoder class_encoder = LabelEncoder().fit(data['Pclass']) sex_encoder = LabelEncoder().fit(data['Sex']) embarked_encoder = LabelEncoder().fit(data['Embarked'].astype(str)) surv_encoder = LabelEncoder().fit(data['Survived'])

Split train and validation data

In [8]:
Copied!
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(data, train_size=.85, random_state=0)
from sklearn.model_selection import train_test_split train_data, test_data = train_test_split(data, train_size=.85, random_state=0)

Now, I can define a batch generator. I will be using a basic class BatchGenerator

In [9]:
Copied!
from keras_batchflow.keras.batch_generators import BatchGenerator
from keras_batchflow.keras.batch_generators import BatchGenerator
2023-12-28 11:56:15.307931: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
In [10]:
Copied!
from keras_batchflow.keras.batch_generators import BatchGenerator

x_structure = (
    ('Pclass', class_encoder),
    ('Sex', sex_encoder),
    ('Embarked', embarked_encoder),
#     None below means no encoding will be applied and values will be passed unchanged
    ('Age', None)
)
y_structure = ('Survived', surv_encoder)

bg_train = BatchGenerator(train_data,
                          x_structure=x_structure,
                          y_structure=y_structure,
                          shuffle = True,
                          batch_size=8)
bg_test = BatchGenerator(test_data,
                         x_structure=x_structure,
                         y_structure=y_structure,
                         shuffle = True,
                         batch_size=8)
from keras_batchflow.keras.batch_generators import BatchGenerator x_structure = ( ('Pclass', class_encoder), ('Sex', sex_encoder), ('Embarked', embarked_encoder), # None below means no encoding will be applied and values will be passed unchanged ('Age', None) ) y_structure = ('Survived', surv_encoder) bg_train = BatchGenerator(train_data, x_structure=x_structure, y_structure=y_structure, shuffle = True, batch_size=8) bg_test = BatchGenerator(test_data, x_structure=x_structure, y_structure=y_structure, shuffle = True, batch_size=8)

I can now check the first batch it generates

In [11]:
Copied!
bg_test[0]
bg_test[0]
Out[11]:
((array([[1],
         [0],
         [2],
         [2],
         [2],
         [0],
         [1],
         [2]]),
  array([[0],
         [1],
         [1],
         [1],
         [0],
         [1],
         [1],
         [1]]),
  array([[3],
         [3],
         [2],
         [3],
         [2],
         [3],
         [3],
         [2]]),
  array([[26.],
         [52.],
         [32.],
         [20.],
         [21.],
         [ 0.],
         [54.],
         [ 0.]])),
 array([[0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0]]))

It is exactly what keras will expect:

  • the batch is tuple (X, y)
  • X is a list of numpy arrays - this is how Keras expects multiple inputs to be passed
  • y is a single numpy array.

Before I jump into building a keras model. I'd like to show a helper functions of keras-batchflow for automated model creation

In [12]:
Copied!
bg_train.shapes
bg_train.shapes
Out[12]:
(((1,), (1,), (1,), (1,)), (1,))
In [13]:
Copied!
bg_train.metadata
bg_train.metadata
Out[13]:
(({'name': 'Pclass',
   'encoder': LabelEncoder(),
   'shape': (1,),
   'dtype': dtype('int64'),
   'n_classes': 3},
  {'name': 'Sex',
   'encoder': LabelEncoder(),
   'shape': (1,),
   'dtype': dtype('int64'),
   'n_classes': 2},
  {'name': 'Embarked',
   'encoder': LabelEncoder(),
   'shape': (1,),
   'dtype': dtype('int64'),
   'n_classes': 4},
  {'name': 'Age',
   'encoder': None,
   'shape': (1,),
   'dtype': dtype('float64'),
   'n_classes': None}),
 {'name': 'Survived',
  'encoder': LabelEncoder(),
  'shape': (1,),
  'dtype': dtype('int64'),
  'n_classes': 2})

Keras model¶

In [14]:
Copied!
from keras.layers import Input, Embedding, Dense, Concatenate, Lambda, Dropout
from keras.models import Model
import keras.backend as K

metadata_x, metadata_y = bg_train.metadata
# define categorical and numeric inputs from X metadata
inputs = [Input(shape=m['shape'], dtype=m['dtype']) for m in metadata_x]
# Define embeddings for categorical features (where n_classes not None) and connect them to inputs
embs = [Embedding(m['n_classes'], 10)(inp) for m, inp in zip(metadata_x, inputs) if m['n_classes'] is not None]
# Collapse unnecessary dimension after emmbedding layers (None, 1, 10) -> (None, 10)
embs = [Lambda(lambda x: K.squeeze(x, axis=1))(emb) for emb in embs]
# separate numeric inputs
num_inps = [inp for m, inp in zip(metadata_x, inputs) if m['n_classes'] is None]
# convert data type to standard keras float datatype
num_x = [Lambda(lambda x: K.cast(x, 'float32'))(ni) for ni in num_inps]
# merge all inputs
x = Concatenate()(embs + num_x)
x = Dropout(.3)(x)
x = Dense(64, activation='relu')(x)
x = Dropout(.3)(x)
x = Dense(32, activation='relu')(x)
x = Dropout(.3)(x)
survived = Dense(2, activation='softmax')(x)

model = Model(inputs, survived)
from keras.layers import Input, Embedding, Dense, Concatenate, Lambda, Dropout from keras.models import Model import keras.backend as K metadata_x, metadata_y = bg_train.metadata # define categorical and numeric inputs from X metadata inputs = [Input(shape=m['shape'], dtype=m['dtype']) for m in metadata_x] # Define embeddings for categorical features (where n_classes not None) and connect them to inputs embs = [Embedding(m['n_classes'], 10)(inp) for m, inp in zip(metadata_x, inputs) if m['n_classes'] is not None] # Collapse unnecessary dimension after emmbedding layers (None, 1, 10) -> (None, 10) embs = [Lambda(lambda x: K.squeeze(x, axis=1))(emb) for emb in embs] # separate numeric inputs num_inps = [inp for m, inp in zip(metadata_x, inputs) if m['n_classes'] is None] # convert data type to standard keras float datatype num_x = [Lambda(lambda x: K.cast(x, 'float32'))(ni) for ni in num_inps] # merge all inputs x = Concatenate()(embs + num_x) x = Dropout(.3)(x) x = Dense(64, activation='relu')(x) x = Dropout(.3)(x) x = Dense(32, activation='relu')(x) x = Dropout(.3)(x) survived = Dense(2, activation='softmax')(x) model = Model(inputs, survived)

I have added quite significant dropout to avoid overfitting as titanic dataset is quite small for neural networks.

In [15]:
Copied!
model.summary()
model.summary()
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
==================================================================================================
 input_1 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 input_2 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 input_3 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 embedding (Embedding)       (None, 1, 10)                30        ['input_1[0][0]']             
                                                                                                  
 embedding_1 (Embedding)     (None, 1, 10)                20        ['input_2[0][0]']             
                                                                                                  
 embedding_2 (Embedding)     (None, 1, 10)                40        ['input_3[0][0]']             
                                                                                                  
 input_4 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 lambda (Lambda)             (None, 10)                   0         ['embedding[0][0]']           
                                                                                                  
 lambda_1 (Lambda)           (None, 10)                   0         ['embedding_1[0][0]']         
                                                                                                  
 lambda_2 (Lambda)           (None, 10)                   0         ['embedding_2[0][0]']         
                                                                                                  
 lambda_3 (Lambda)           (None, 1)                    0         ['input_4[0][0]']             
                                                                                                  
 concatenate (Concatenate)   (None, 31)                   0         ['lambda[0][0]',              
                                                                     'lambda_1[0][0]',            
                                                                     'lambda_2[0][0]',            
                                                                     'lambda_3[0][0]']            
                                                                                                  
 dropout (Dropout)           (None, 31)                   0         ['concatenate[0][0]']         
                                                                                                  
 dense (Dense)               (None, 64)                   2048      ['dropout[0][0]']             
                                                                                                  
 dropout_1 (Dropout)         (None, 64)                   0         ['dense[0][0]']               
                                                                                                  
 dense_1 (Dense)             (None, 32)                   2080      ['dropout_1[0][0]']           
                                                                                                  
 dropout_2 (Dropout)         (None, 32)                   0         ['dense_1[0][0]']             
                                                                                                  
 dense_2 (Dense)             (None, 2)                    66        ['dropout_2[0][0]']           
                                                                                                  
==================================================================================================
Total params: 4284 (16.73 KB)
Trainable params: 4284 (16.73 KB)
Non-trainable params: 0 (0.00 Byte)
__________________________________________________________________________________________________

I can now compile and train the model

In [16]:
Copied!
model.compile('adam', 'sparse_categorical_crossentropy', metrics=['accuracy'])
model.compile('adam', 'sparse_categorical_crossentropy', metrics=['accuracy'])
In [17]:
Copied!
model.fit(bg_train, validation_data=bg_test, epochs=10)
model.fit(bg_train, validation_data=bg_test, epochs=10)
Epoch 1/10
95/95 [==============================] - 2s 9ms/step - loss: 1.9813 - accuracy: 0.7213 - val_loss: 0.7945 - val_accuracy: 1.0000
Epoch 2/10
95/95 [==============================] - 1s 8ms/step - loss: 1.0991 - accuracy: 0.7635 - val_loss: 0.6878 - val_accuracy: 0.9776
Epoch 3/10
95/95 [==============================] - 1s 8ms/step - loss: 0.7966 - accuracy: 0.6948 - val_loss: 0.6452 - val_accuracy: 0.9701
Epoch 4/10
95/95 [==============================] - 1s 7ms/step - loss: 0.7250 - accuracy: 0.6856 - val_loss: 0.6561 - val_accuracy: 0.9701
Epoch 5/10
95/95 [==============================] - 1s 7ms/step - loss: 0.6290 - accuracy: 0.7107 - val_loss: 0.6592 - val_accuracy: 0.9701
Epoch 6/10
95/95 [==============================] - 1s 7ms/step - loss: 0.6056 - accuracy: 0.7226 - val_loss: 0.6525 - val_accuracy: 0.8134
Epoch 7/10
95/95 [==============================] - 1s 7ms/step - loss: 0.5784 - accuracy: 0.7239 - val_loss: 0.6482 - val_accuracy: 0.9478
Epoch 8/10
95/95 [==============================] - 1s 7ms/step - loss: 0.5919 - accuracy: 0.7464 - val_loss: 0.6352 - val_accuracy: 0.9552
Epoch 9/10
95/95 [==============================] - 1s 7ms/step - loss: 0.5845 - accuracy: 0.7583 - val_loss: 0.6203 - val_accuracy: 0.9030
Epoch 10/10
95/95 [==============================] - 1s 7ms/step - loss: 0.5652 - accuracy: 0.7239 - val_loss: 0.5805 - val_accuracy: 0.8955
Out[17]:
<keras.src.callbacks.History at 0x7f4c89194a60>

The model is now trained. The next question is how to use it to predict labels in new data?

Predicting using the keras-batchflow¶

Predicting using the same structures is really simple: once new data are in the same format as your training data, you will just need to define a batch generator for predictions using same x_structure that you used above

I will continue my example to show how it works:

In [18]:
Copied!
unlabelled_data = pd.read_csv('../data/titanic/test.csv')
unlabelled_data.shape
unlabelled_data = pd.read_csv('../data/titanic/test.csv') unlabelled_data.shape
Out[18]:
(418, 11)

Check if I need to fill any NAs

In [19]:
Copied!
unlabelled_data.head()
unlabelled_data.head()
Out[19]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
In [20]:
Copied!
unlabelled_data[['Pclass', 'Sex', 'Age', 'Embarked']].isna().apply(sum)
unlabelled_data[['Pclass', 'Sex', 'Age', 'Embarked']].isna().apply(sum)
Out[20]:
Pclass       0
Sex          0
Age         86
Embarked     0
dtype: int64
In [21]:
Copied!
unlabelled_data['Age'] = unlabelled_data['Age'].fillna(0)
unlabelled_data['Age'] = unlabelled_data['Age'].fillna(0)

I can define a batch generator for predicting using the same structure I used in definition of batch generators used for training. The key is to set shuffle=False, drop y_structure and provide new unlabelled data

In [22]:
Copied!
bg_predict = BatchGenerator(unlabelled_data,
                            x_structure=x_structure,
                            shuffle = False,
                            batch_size=8)
bg_predict = BatchGenerator(unlabelled_data, x_structure=x_structure, shuffle = False, batch_size=8)
In [23]:
Copied!
bg_predict[0]
bg_predict[0]
Out[23]:
((array([[2],
         [2],
         [1],
         [2],
         [2],
         [2],
         [2],
         [1]]),
  array([[1],
         [0],
         [1],
         [1],
         [0],
         [1],
         [0],
         [1]]),
  array([[2],
         [3],
         [2],
         [3],
         [3],
         [3],
         [2],
         [3]]),
  array([[34.5],
         [47. ],
         [62. ],
         [27. ],
         [22. ],
         [14. ],
         [30. ],
         [26. ]])),)
In [24]:
Copied!
pred = model.predict(bg_predict, verbose=1)
pred = model.predict(bg_predict, verbose=1)
53/53 [==============================] - 0s 5ms/step
In [25]:
Copied!
pred.shape
pred.shape
Out[25]:
(418, 2)

Outputs are one-hot-encoded so I need to convert it to indicies:

In [26]:
Copied!
pred = pred.argmax(axis=1)
pred
pred = pred.argmax(axis=1) pred
Out[26]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0])

These are predictions in encoded format. In most cases they need to be converted back to labels.

Decoding predictions back to labels¶

Most of sklearn encoders have function inverse_transform for this purpose. In this example, when the model returns only one prediction variable I could use this function from the encoder surv_enc directly, but imagine I had more than one prediction, each of which used its own encoder. Wouldn't it be convenient to have inverse_transform at a batch generator level?

Batch generator's inverse_transform returns a dataframe with predictions converted to labels and with correct column names:

In [27]:
Copied!
bg_train.inverse_transform(pred)
bg_train.inverse_transform(pred)
Out[27]:
Survived
0 0
1 0
2 0
3 0
4 0
... ...
413 0
414 0
415 0
416 0
417 0

418 rows × 1 columns

All I need to do is concatenate it with the unlabelled data

In [28]:
Copied!
pd.concat([unlabelled_data, bg_train.inverse_transform(pred)], axis=1).head()
pd.concat([unlabelled_data, bg_train.inverse_transform(pred)], axis=1).head()
Out[28]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Survived
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q 0
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S 0
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q 0
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S 0
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S 0

That's it! This data can now be passed to applications consuming these predictions

Previous Next

Built with MkDocs using a theme provided by Read the Docs.
GitHub « Previous Next »