Basic use of Keras-batchflow with Titanic data¶

Below example shows the most basic use of keras batchflow for predicting survival outcome in Titanic disaster. A well known Titanic dataset from Kaggle is used in this example

This dataset has a mixture of both categorical and numeric variables which will show the features of keras-batchflow better.

Data pre-processing¶

In [1]:

Copied!

import pandas as pd
import numpy as np
import pandas as pd
import numpy as np

In [2]:

Copied!

data = pd.read_csv('../data/titanic/train.csv')
data.shape
data = pd.read_csv('../data/titanic/train.csv')
data.shape

Out[2]:

(891, 12)

There are only 891 datapoints in training dataset

In [3]:

Copied!

data.head()
data.head()

Out[3]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Imagine after exploratory analysis and model finding, only few columns were selected as features: Pclass, Sex, Age, and Embarked.

Lets see if there are any NAs to fill:

In [4]:

Copied!

data[['Pclass', 'Sex', 'Age', 'Embarked', 'Survived']].isna().apply(sum)
data[['Pclass', 'Sex', 'Age', 'Embarked', 'Survived']].isna().apply(sum)

Out[4]:

Pclass        0
Sex           0
Age         177
Embarked      2
Survived      0
dtype: int64

Lets fill those NAs:

In [5]:

Copied!

data['Age'] = data['Age'].fillna(0)
data['Embarked'] = data['Embarked'].fillna('')
data['Age'] = data['Age'].fillna(0)
data['Embarked'] = data['Embarked'].fillna('')

The outcome column Survived is ordinal categorical too, but it is presented as 0 and 1 and does not require any conversion for the purpose of binary classification.

For making example more generic, I will convert this outcome to text labels Yes and No

In [6]:

Copied!

data['Survived'] = data['Survived'].astype(str)
data.loc[data.Survived == 1, 'Survived'] = 'Yes'
data.loc[data.Survived == 0, 'Survived'] = 'No'
data['Survived'] = data['Survived'].astype(str)
data.loc[data.Survived == 1, 'Survived'] = 'Yes'
data.loc[data.Survived == 0, 'Survived'] = 'No'

Batch generator¶

I would like to build a simple neural network using embedding for all categorical values, which will predict if a passenger would survive.

When building such a model, I will need to provide number of levels of each categorical feature in embedding layers declarations. Keras-batchflow provides some automation helping determining this parameter for each feature and therefore, I will build a generator first.

To build a batchflow generator you will first need to define your encoders, that will map categorical value to its integer repredentation. I will use sklearn LabelEncoder for this purpose.

In [7]:

Copied!





from sklearn.preprocessing import LabelEncoder

class_encoder = LabelEncoder().fit(data['Pclass'])
sex_encoder = LabelEncoder().fit(data['Sex'])
embarked_encoder = LabelEncoder().fit(data['Embarked'].astype(str))
surv_encoder = LabelEncoder().fit(data['Survived'])
from sklearn.preprocessing import LabelEncoder

class_encoder = LabelEncoder().fit(data['Pclass'])
sex_encoder = LabelEncoder().fit(data['Sex'])
embarked_encoder = LabelEncoder().fit(data['Embarked'].astype(str))
surv_encoder = LabelEncoder().fit(data['Survived'])

Split train and validation data

In [8]:

Copied!

from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(data, train_size=.85, random_state=0)
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(data, train_size=.85, random_state=0)

Now, I can define a batch generator. I will be using a basic class BatchGenerator

In [9]:

Copied!

from keras_batchflow.keras.batch_generators import BatchGenerator
from keras_batchflow.keras.batch_generators import BatchGenerator

2023-12-28 11:56:15.307931: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

In [10]:

Copied!





from keras_batchflow.keras.batch_generators import BatchGenerator

x_structure = (
    ('Pclass', class_encoder),
    ('Sex', sex_encoder),
    ('Embarked', embarked_encoder),
#     None below means no encoding will be applied and values will be passed unchanged
    ('Age', None)
)
y_structure = ('Survived', surv_encoder)

bg_train = BatchGenerator(train_data,
                          x_structure=x_structure,
                          y_structure=y_structure,
                          shuffle = True,
                          batch_size=8)
bg_test = BatchGenerator(test_data,
                         x_structure=x_structure,
                         y_structure=y_structure,
                         shuffle = True,
                         batch_size=8)
from keras_batchflow.keras.batch_generators import BatchGenerator

x_structure = (
    ('Pclass', class_encoder),
    ('Sex', sex_encoder),
    ('Embarked', embarked_encoder),
#     None below means no encoding will be applied and values will be passed unchanged
    ('Age', None)
)
y_structure = ('Survived', surv_encoder)

bg_train = BatchGenerator(train_data,
                          x_structure=x_structure,
                          y_structure=y_structure,
                          shuffle = True,
                          batch_size=8)
bg_test = BatchGenerator(test_data,
                         x_structure=x_structure,
                         y_structure=y_structure,
                         shuffle = True,
                         batch_size=8)

I can now check the first batch it generates

In [11]:

Copied!

bg_test[0]
bg_test[0]

Out[11]:

((array([[1],
         [0],
         [2],
         [2],
         [2],
         [0],
         [1],
         [2]]),
  array([[0],
         [1],
         [1],
         [1],
         [0],
         [1],
         [1],
         [1]]),
  array([[3],
         [3],
         [2],
         [3],
         [2],
         [3],
         [3],
         [2]]),
  array([[26.],
         [52.],
         [32.],
         [20.],
         [21.],
         [ 0.],
         [54.],
         [ 0.]])),
 array([[0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0]]))

It is exactly what keras will expect:

the batch is tuple (X, y)
X is a list of numpy arrays - this is how Keras expects multiple inputs to be passed
y is a single numpy array.

Before I jump into building a keras model. I'd like to show a helper functions of keras-batchflow for automated model creation

In [12]:

Copied!

bg_train.shapes
bg_train.shapes

Out[12]:

(((1,), (1,), (1,), (1,)), (1,))

In [13]:

Copied!

bg_train.metadata
bg_train.metadata

Out[13]:

(({'name': 'Pclass',
   'encoder': LabelEncoder(),
   'shape': (1,),
   'dtype': dtype('int64'),
   'n_classes': 3},
  {'name': 'Sex',
   'encoder': LabelEncoder(),
   'shape': (1,),
   'dtype': dtype('int64'),
   'n_classes': 2},
  {'name': 'Embarked',
   'encoder': LabelEncoder(),
   'shape': (1,),
   'dtype': dtype('int64'),
   'n_classes': 4},
  {'name': 'Age',
   'encoder': None,
   'shape': (1,),
   'dtype': dtype('float64'),
   'n_classes': None}),
 {'name': 'Survived',
  'encoder': LabelEncoder(),
  'shape': (1,),
  'dtype': dtype('int64'),
  'n_classes': 2})

Keras model¶

In [14]:

Copied!





from keras.layers import Input, Embedding, Dense, Concatenate, Lambda, Dropout
from keras.models import Model
import keras.backend as K

metadata_x, metadata_y = bg_train.metadata
# define categorical and numeric inputs from X metadata
inputs = [Input(shape=m['shape'], dtype=m['dtype']) for m in metadata_x]
# Define embeddings for categorical features (where n_classes not None) and connect them to inputs
embs = [Embedding(m['n_classes'], 10)(inp) for m, inp in zip(metadata_x, inputs) if m['n_classes'] is not None]
# Collapse unnecessary dimension after emmbedding layers (None, 1, 10) -> (None, 10)
embs = [Lambda(lambda x: K.squeeze(x, axis=1))(emb) for emb in embs]
# separate numeric inputs
num_inps = [inp for m, inp in zip(metadata_x, inputs) if m['n_classes'] is None]
# convert data type to standard keras float datatype
num_x = [Lambda(lambda x: K.cast(x, 'float32'))(ni) for ni in num_inps]
# merge all inputs
x = Concatenate()(embs + num_x)
x = Dropout(.3)(x)
x = Dense(64, activation='relu')(x)
x = Dropout(.3)(x)
x = Dense(32, activation='relu')(x)
x = Dropout(.3)(x)
survived = Dense(2, activation='softmax')(x)

model = Model(inputs, survived)
from keras.layers import Input, Embedding, Dense, Concatenate, Lambda, Dropout
from keras.models import Model
import keras.backend as K

metadata_x, metadata_y = bg_train.metadata
# define categorical and numeric inputs from X metadata
inputs = [Input(shape=m['shape'], dtype=m['dtype']) for m in metadata_x]
# Define embeddings for categorical features (where n_classes not None) and connect them to inputs
embs = [Embedding(m['n_classes'], 10)(inp) for m, inp in zip(metadata_x, inputs) if m['n_classes'] is not None]
# Collapse unnecessary dimension after emmbedding layers (None, 1, 10) -> (None, 10)
embs = [Lambda(lambda x: K.squeeze(x, axis=1))(emb) for emb in embs]
# separate numeric inputs
num_inps = [inp for m, inp in zip(metadata_x, inputs) if m['n_classes'] is None]
# convert data type to standard keras float datatype
num_x = [Lambda(lambda x: K.cast(x, 'float32'))(ni) for ni in num_inps]
# merge all inputs
x = Concatenate()(embs + num_x)
x = Dropout(.3)(x)
x = Dense(64, activation='relu')(x)
x = Dropout(.3)(x)
x = Dense(32, activation='relu')(x)
x = Dropout(.3)(x)
survived = Dense(2, activation='softmax')(x)

model = Model(inputs, survived)

I have added quite significant dropout to avoid overfitting as titanic dataset is quite small for neural networks.

In [15]:

Copied!

model.summary()
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
==================================================================================================
 input_1 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 input_2 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 input_3 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 embedding (Embedding)       (None, 1, 10)                30        ['input_1[0][0]']             
                                                                                                  
 embedding_1 (Embedding)     (None, 1, 10)                20        ['input_2[0][0]']             
                                                                                                  
 embedding_2 (Embedding)     (None, 1, 10)                40        ['input_3[0][0]']             
                                                                                                  
 input_4 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 lambda (Lambda)             (None, 10)                   0         ['embedding[0][0]']           
                                                                                                  
 lambda_1 (Lambda)           (None, 10)                   0         ['embedding_1[0][0]']         
                                                                                                  
 lambda_2 (Lambda)           (None, 10)                   0         ['embedding_2[0][0]']         
                                                                                                  
 lambda_3 (Lambda)           (None, 1)                    0         ['input_4[0][0]']             
                                                                                                  
 concatenate (Concatenate)   (None, 31)                   0         ['lambda[0][0]',              
                                                                     'lambda_1[0][0]',            
                                                                     'lambda_2[0][0]',            
                                                                     'lambda_3[0][0]']            
                                                                                                  
 dropout (Dropout)           (None, 31)                   0         ['concatenate[0][0]']         
                                                                                                  
 dense (Dense)               (None, 64)                   2048      ['dropout[0][0]']             
                                                                                                  
 dropout_1 (Dropout)         (None, 64)                   0         ['dense[0][0]']               
                                                                                                  
 dense_1 (Dense)             (None, 32)                   2080      ['dropout_1[0][0]']           
                                                                                                  
 dropout_2 (Dropout)         (None, 32)                   0         ['dense_1[0][0]']             
                                                                                                  
 dense_2 (Dense)             (None, 2)                    66        ['dropout_2[0][0]']           
                                                                                                  
==================================================================================================
Total params: 4284 (16.73 KB)
Trainable params: 4284 (16.73 KB)
Non-trainable params: 0 (0.00 Byte)
__________________________________________________________________________________________________

I can now compile and train the model

In [16]:

Copied!

model.compile('adam', 'sparse_categorical_crossentropy', metrics=['accuracy'])
model.compile('adam', 'sparse_categorical_crossentropy', metrics=['accuracy'])

In [17]:

Copied!

model.fit(bg_train, validation_data=bg_test, epochs=10)
model.fit(bg_train, validation_data=bg_test, epochs=10)

Epoch 1/10
95/95 [==============================] - 2s 9ms/step - loss: 1.9813 - accuracy: 0.7213 - val_loss: 0.7945 - val_accuracy: 1.0000
Epoch 2/10
95/95 [==============================] - 1s 8ms/step - loss: 1.0991 - accuracy: 0.7635 - val_loss: 0.6878 - val_accuracy: 0.9776
Epoch 3/10
95/95 [==============================] - 1s 8ms/step - loss: 0.7966 - accuracy: 0.6948 - val_loss: 0.6452 - val_accuracy: 0.9701
Epoch 4/10
95/95 [==============================] - 1s 7ms/step - loss: 0.7250 - accuracy: 0.6856 - val_loss: 0.6561 - val_accuracy: 0.9701
Epoch 5/10
95/95 [==============================] - 1s 7ms/step - loss: 0.6290 - accuracy: 0.7107 - val_loss: 0.6592 - val_accuracy: 0.9701
Epoch 6/10
95/95 [==============================] - 1s 7ms/step - loss: 0.6056 - accuracy: 0.7226 - val_loss: 0.6525 - val_accuracy: 0.8134
Epoch 7/10
95/95 [==============================] - 1s 7ms/step - loss: 0.5784 - accuracy: 0.7239 - val_loss: 0.6482 - val_accuracy: 0.9478
Epoch 8/10
95/95 [==============================] - 1s 7ms/step - loss: 0.5919 - accuracy: 0.7464 - val_loss: 0.6352 - val_accuracy: 0.9552
Epoch 9/10
95/95 [==============================] - 1s 7ms/step - loss: 0.5845 - accuracy: 0.7583 - val_loss: 0.6203 - val_accuracy: 0.9030
Epoch 10/10
95/95 [==============================] - 1s 7ms/step - loss: 0.5652 - accuracy: 0.7239 - val_loss: 0.5805 - val_accuracy: 0.8955

Out[17]:

<keras.src.callbacks.History at 0x7f4c89194a60>

The model is now trained. The next question is how to use it to predict labels in new data?

Predicting using the keras-batchflow¶

Predicting using the same structures is really simple: once new data are in the same format as your training data, you will just need to define a batch generator for predictions using same x_structure that you used above

I will continue my example to show how it works:

In [18]:

Copied!

unlabelled_data = pd.read_csv('../data/titanic/test.csv')
unlabelled_data.shape
unlabelled_data = pd.read_csv('../data/titanic/test.csv')
unlabelled_data.shape

Out[18]:

(418, 11)

Check if I need to fill any NAs

In [19]:

Copied!

unlabelled_data.head()
unlabelled_data.head()

Out[19]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

In [20]:

Copied!

unlabelled_data[['Pclass', 'Sex', 'Age', 'Embarked']].isna().apply(sum)
unlabelled_data[['Pclass', 'Sex', 'Age', 'Embarked']].isna().apply(sum)

Out[20]:

Pclass       0
Sex          0
Age         86
Embarked     0
dtype: int64

In [21]:

Copied!

unlabelled_data['Age'] = unlabelled_data['Age'].fillna(0)
unlabelled_data['Age'] = unlabelled_data['Age'].fillna(0)

I can define a batch generator for predicting using the same structure I used in definition of batch generators used for training. The key is to set shuffle=False, drop y_structure and provide new unlabelled data

In [22]:

Copied!





bg_predict = BatchGenerator(unlabelled_data,
                            x_structure=x_structure,
                            shuffle = False,
                            batch_size=8)
bg_predict = BatchGenerator(unlabelled_data,
                            x_structure=x_structure,
                            shuffle = False,
                            batch_size=8)

In [23]:

Copied!

bg_predict[0]
bg_predict[0]

Out[23]:

((array([[2],
         [2],
         [1],
         [2],
         [2],
         [2],
         [2],
         [1]]),
  array([[1],
         [0],
         [1],
         [1],
         [0],
         [1],
         [0],
         [1]]),
  array([[2],
         [3],
         [2],
         [3],
         [3],
         [3],
         [2],
         [3]]),
  array([[34.5],
         [47. ],
         [62. ],
         [27. ],
         [22. ],
         [14. ],
         [30. ],
         [26. ]])),)

In [24]:

Copied!

pred = model.predict(bg_predict, verbose=1)
pred = model.predict(bg_predict, verbose=1)

53/53 [==============================] - 0s 5ms/step

In [25]:

Copied!

pred.shape
pred.shape

Out[25]:

(418, 2)

Outputs are one-hot-encoded so I need to convert it to indicies:

In [26]:

Copied!

pred = pred.argmax(axis=1)
pred
pred = pred.argmax(axis=1)
pred

Out[26]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0])

These are predictions in encoded format. In most cases they need to be converted back to labels.

Decoding predictions back to labels¶

Most of sklearn encoders have function inverse_transform for this purpose. In this example, when the model returns only one prediction variable I could use this function from the encoder surv_enc directly, but imagine I had more than one prediction, each of which used its own encoder. Wouldn't it be convenient to have inverse_transform at a batch generator level?

Batch generator's inverse_transform returns a dataframe with predictions converted to labels and with correct column names:

In [27]:

Copied!

bg_train.inverse_transform(pred)
bg_train.inverse_transform(pred)

Out[27]:

	Survived
0	0
1	0
2	0
3	0
4	0
...	...
413	0
414	0
415	0
416	0
417	0

418 rows × 1 columns

All I need to do is concatenate it with the unlabelled data

In [28]:

Copied!

pd.concat([unlabelled_data, bg_train.inverse_transform(pred)], axis=1).head()
pd.concat([unlabelled_data, bg_train.inverse_transform(pred)], axis=1).head()

Out[28]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

That's it! This data can now be passed to applications consuming these predictions