Basic use of Keras-batchflow with Titanic data¶
Below example shows the most basic use of keras batchflow for predicting survival outcome in Titanic disaster. A well known Titanic dataset from Kaggle is used in this example
This dataset has a mixture of both categorical and numeric variables which will show the features of keras-batchflow better.
Data pre-processing¶
import pandas as pd
import numpy as np
data = pd.read_csv('../data/titanic/train.csv')
data.shape
(891, 12)
There are only 891 datapoints in training dataset
data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Imagine after exploratory analysis and model finding, only few columns were selected as features: Pclass, Sex, Age, and Embarked.
Lets see if there are any NAs to fill:
data[['Pclass', 'Sex', 'Age', 'Embarked', 'Survived']].isna().apply(sum)
Pclass 0 Sex 0 Age 177 Embarked 2 Survived 0 dtype: int64
Lets fill those NAs:
data['Age'] = data['Age'].fillna(0)
data['Embarked'] = data['Embarked'].fillna('')
The outcome column Survived
is ordinal categorical too, but it is presented as 0 and 1 and does not require any conversion for the purpose of binary classification.
For making example more generic, I will convert this outcome to text labels Yes and No
data['Survived'] = data['Survived'].astype(str)
data.loc[data.Survived == 1, 'Survived'] = 'Yes'
data.loc[data.Survived == 0, 'Survived'] = 'No'
Batch generator¶
I would like to build a simple neural network using embedding for all categorical values, which will predict if a passenger would survive.
When building such a model, I will need to provide number of levels of each categorical feature in embedding layers declarations. Keras-batchflow provides some automation helping determining this parameter for each feature and therefore, I will build a generator first.
To build a batchflow generator you will first need to define your encoders, that will map categorical value to its integer repredentation. I will use sklearn LabelEncoder for this purpose.
from sklearn.preprocessing import LabelEncoder
class_encoder = LabelEncoder().fit(data['Pclass'])
sex_encoder = LabelEncoder().fit(data['Sex'])
embarked_encoder = LabelEncoder().fit(data['Embarked'].astype(str))
surv_encoder = LabelEncoder().fit(data['Survived'])
Split train and validation data
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data, train_size=.85, random_state=0)
Now, I can define a batch generator. I will be using a basic class BatchGenerator
from keras_batchflow.keras.batch_generators import BatchGenerator
2023-12-28 11:56:15.307931: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
from keras_batchflow.keras.batch_generators import BatchGenerator
x_structure = (
('Pclass', class_encoder),
('Sex', sex_encoder),
('Embarked', embarked_encoder),
# None below means no encoding will be applied and values will be passed unchanged
('Age', None)
)
y_structure = ('Survived', surv_encoder)
bg_train = BatchGenerator(train_data,
x_structure=x_structure,
y_structure=y_structure,
shuffle = True,
batch_size=8)
bg_test = BatchGenerator(test_data,
x_structure=x_structure,
y_structure=y_structure,
shuffle = True,
batch_size=8)
I can now check the first batch it generates
bg_test[0]
((array([[1], [0], [2], [2], [2], [0], [1], [2]]), array([[0], [1], [1], [1], [0], [1], [1], [1]]), array([[3], [3], [2], [3], [2], [3], [3], [2]]), array([[26.], [52.], [32.], [20.], [21.], [ 0.], [54.], [ 0.]])), array([[0], [0], [0], [0], [0], [0], [0], [0]]))
It is exactly what keras will expect:
- the batch is tuple (X, y)
- X is a list of numpy arrays - this is how Keras expects multiple inputs to be passed
- y is a single numpy array.
Before I jump into building a keras model. I'd like to show a helper functions of keras-batchflow for automated model creation
bg_train.shapes
(((1,), (1,), (1,), (1,)), (1,))
bg_train.metadata
(({'name': 'Pclass', 'encoder': LabelEncoder(), 'shape': (1,), 'dtype': dtype('int64'), 'n_classes': 3}, {'name': 'Sex', 'encoder': LabelEncoder(), 'shape': (1,), 'dtype': dtype('int64'), 'n_classes': 2}, {'name': 'Embarked', 'encoder': LabelEncoder(), 'shape': (1,), 'dtype': dtype('int64'), 'n_classes': 4}, {'name': 'Age', 'encoder': None, 'shape': (1,), 'dtype': dtype('float64'), 'n_classes': None}), {'name': 'Survived', 'encoder': LabelEncoder(), 'shape': (1,), 'dtype': dtype('int64'), 'n_classes': 2})
Keras model¶
from keras.layers import Input, Embedding, Dense, Concatenate, Lambda, Dropout
from keras.models import Model
import keras.backend as K
metadata_x, metadata_y = bg_train.metadata
# define categorical and numeric inputs from X metadata
inputs = [Input(shape=m['shape'], dtype=m['dtype']) for m in metadata_x]
# Define embeddings for categorical features (where n_classes not None) and connect them to inputs
embs = [Embedding(m['n_classes'], 10)(inp) for m, inp in zip(metadata_x, inputs) if m['n_classes'] is not None]
# Collapse unnecessary dimension after emmbedding layers (None, 1, 10) -> (None, 10)
embs = [Lambda(lambda x: K.squeeze(x, axis=1))(emb) for emb in embs]
# separate numeric inputs
num_inps = [inp for m, inp in zip(metadata_x, inputs) if m['n_classes'] is None]
# convert data type to standard keras float datatype
num_x = [Lambda(lambda x: K.cast(x, 'float32'))(ni) for ni in num_inps]
# merge all inputs
x = Concatenate()(embs + num_x)
x = Dropout(.3)(x)
x = Dense(64, activation='relu')(x)
x = Dropout(.3)(x)
x = Dense(32, activation='relu')(x)
x = Dropout(.3)(x)
survived = Dense(2, activation='softmax')(x)
model = Model(inputs, survived)
I have added quite significant dropout to avoid overfitting as titanic dataset is quite small for neural networks.
model.summary()
Model: "model" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) [(None, 1)] 0 [] input_2 (InputLayer) [(None, 1)] 0 [] input_3 (InputLayer) [(None, 1)] 0 [] embedding (Embedding) (None, 1, 10) 30 ['input_1[0][0]'] embedding_1 (Embedding) (None, 1, 10) 20 ['input_2[0][0]'] embedding_2 (Embedding) (None, 1, 10) 40 ['input_3[0][0]'] input_4 (InputLayer) [(None, 1)] 0 [] lambda (Lambda) (None, 10) 0 ['embedding[0][0]'] lambda_1 (Lambda) (None, 10) 0 ['embedding_1[0][0]'] lambda_2 (Lambda) (None, 10) 0 ['embedding_2[0][0]'] lambda_3 (Lambda) (None, 1) 0 ['input_4[0][0]'] concatenate (Concatenate) (None, 31) 0 ['lambda[0][0]', 'lambda_1[0][0]', 'lambda_2[0][0]', 'lambda_3[0][0]'] dropout (Dropout) (None, 31) 0 ['concatenate[0][0]'] dense (Dense) (None, 64) 2048 ['dropout[0][0]'] dropout_1 (Dropout) (None, 64) 0 ['dense[0][0]'] dense_1 (Dense) (None, 32) 2080 ['dropout_1[0][0]'] dropout_2 (Dropout) (None, 32) 0 ['dense_1[0][0]'] dense_2 (Dense) (None, 2) 66 ['dropout_2[0][0]'] ================================================================================================== Total params: 4284 (16.73 KB) Trainable params: 4284 (16.73 KB) Non-trainable params: 0 (0.00 Byte) __________________________________________________________________________________________________
I can now compile and train the model
model.compile('adam', 'sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(bg_train, validation_data=bg_test, epochs=10)
Epoch 1/10 95/95 [==============================] - 2s 9ms/step - loss: 1.9813 - accuracy: 0.7213 - val_loss: 0.7945 - val_accuracy: 1.0000 Epoch 2/10 95/95 [==============================] - 1s 8ms/step - loss: 1.0991 - accuracy: 0.7635 - val_loss: 0.6878 - val_accuracy: 0.9776 Epoch 3/10 95/95 [==============================] - 1s 8ms/step - loss: 0.7966 - accuracy: 0.6948 - val_loss: 0.6452 - val_accuracy: 0.9701 Epoch 4/10 95/95 [==============================] - 1s 7ms/step - loss: 0.7250 - accuracy: 0.6856 - val_loss: 0.6561 - val_accuracy: 0.9701 Epoch 5/10 95/95 [==============================] - 1s 7ms/step - loss: 0.6290 - accuracy: 0.7107 - val_loss: 0.6592 - val_accuracy: 0.9701 Epoch 6/10 95/95 [==============================] - 1s 7ms/step - loss: 0.6056 - accuracy: 0.7226 - val_loss: 0.6525 - val_accuracy: 0.8134 Epoch 7/10 95/95 [==============================] - 1s 7ms/step - loss: 0.5784 - accuracy: 0.7239 - val_loss: 0.6482 - val_accuracy: 0.9478 Epoch 8/10 95/95 [==============================] - 1s 7ms/step - loss: 0.5919 - accuracy: 0.7464 - val_loss: 0.6352 - val_accuracy: 0.9552 Epoch 9/10 95/95 [==============================] - 1s 7ms/step - loss: 0.5845 - accuracy: 0.7583 - val_loss: 0.6203 - val_accuracy: 0.9030 Epoch 10/10 95/95 [==============================] - 1s 7ms/step - loss: 0.5652 - accuracy: 0.7239 - val_loss: 0.5805 - val_accuracy: 0.8955
<keras.src.callbacks.History at 0x7f4c89194a60>
The model is now trained. The next question is how to use it to predict labels in new data?
Predicting using the keras-batchflow¶
Predicting using the same structures is really simple: once new data are in the same format as your training data, you will just need to define a batch generator for predictions using same x_structure
that you used above
I will continue my example to show how it works:
unlabelled_data = pd.read_csv('../data/titanic/test.csv')
unlabelled_data.shape
(418, 11)
Check if I need to fill any NAs
unlabelled_data.head()
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
unlabelled_data[['Pclass', 'Sex', 'Age', 'Embarked']].isna().apply(sum)
Pclass 0 Sex 0 Age 86 Embarked 0 dtype: int64
unlabelled_data['Age'] = unlabelled_data['Age'].fillna(0)
I can define a batch generator for predicting using the same structure I used in definition of batch generators used for training. The key is to set shuffle=False
, drop y_structure
and provide new unlabelled data
bg_predict = BatchGenerator(unlabelled_data,
x_structure=x_structure,
shuffle = False,
batch_size=8)
bg_predict[0]
((array([[2], [2], [1], [2], [2], [2], [2], [1]]), array([[1], [0], [1], [1], [0], [1], [0], [1]]), array([[2], [3], [2], [3], [3], [3], [2], [3]]), array([[34.5], [47. ], [62. ], [27. ], [22. ], [14. ], [30. ], [26. ]])),)
pred = model.predict(bg_predict, verbose=1)
53/53 [==============================] - 0s 5ms/step
pred.shape
(418, 2)
Outputs are one-hot-encoded so I need to convert it to indicies:
pred = pred.argmax(axis=1)
pred
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0])
These are predictions in encoded format. In most cases they need to be converted back to labels.
Decoding predictions back to labels¶
Most of sklearn encoders have function inverse_transform
for this purpose. In this example, when the model returns only one prediction variable I could use this function from the encoder surv_enc
directly, but imagine I had more than one prediction, each of which used its own encoder. Wouldn't it be convenient to have inverse_transform
at a batch generator level?
Batch generator's inverse_transform
returns a dataframe with predictions converted to labels and with correct column names:
bg_train.inverse_transform(pred)
Survived | |
---|---|
0 | 0 |
1 | 0 |
2 | 0 |
3 | 0 |
4 | 0 |
... | ... |
413 | 0 |
414 | 0 |
415 | 0 |
416 | 0 |
417 | 0 |
418 rows × 1 columns
All I need to do is concatenate it with the unlabelled data
pd.concat([unlabelled_data, bg_train.inverse_transform(pred)], axis=1).head()
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Survived | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q | 0 |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S | 0 |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q | 0 |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S | 0 |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S | 0 |
That's it! This data can now be passed to applications consuming these predictions