Batch transformers
Batch transformers process incoming batch data as whole batch. The batch transformers have access to all variables in data and they are for implementing all sorts of transforms involving any interaction between variables
FeatureDropout
keras_batchflow.base.batch_transformers.FeatureDropout
(n_probs, cols, col_probs=None, drop_values=None)A batch transformation for randomly dropping values in the data to some pre-defined values
Parameters
- n_probs - a list, tuple or a one dimensional numpy array of probabilities p_0, p_1, p_2, ... p_n. p_0 is a probability for a row to have 0 augmented elements (no augmentation), p_1 - one random cells, p_2 - two random cells, etc. A parameter must have at least 2 values, scalars are not accepted
- cols - a list, tuple or one dimensional numpy array of strings with columns names to be transformed Number of columns must be greater or equal the length of n_probs parameter simply because there must be enough columns to choose from to augment n elements in a row
- col_probs - (optional) a list, tuple or one dimensional numpy array of floats p_{c_0}, p_{c_1}, ... p_{c_k} where k is the number of columns specified in parameter cols. p_{c_0} is the probability of column 0 from parameter cols to be selected in when only one column is picked for augmentation. p_{c_1} is the same for column 1, etc. It is important to understand that when two or more columns, are picked for a row, actual frequencies of columns will drift towards equal distribution with every new item added. In a case when number of columns picked for augmentation reaches its max allowed value (number of columns available to choose from parameter cols), there will be no choice and the actual counts of columns will be equal. This means the actual distribution will turn into a uniform discrete distribution. Default: None
- drop_values - (optional) a list, tuple or one dimensional numpy array of values. If not set,
None
will be used for all columns. If single value is set, it will be used for all columns. If a list of items is set, it must be the same length as cols parameter. In this case, values specified in drop_values will be used to fill dropped values in corresponding columns
ShuffleNoise
keras_batchflow.base.batch_transformers.ShuffleNoise
(n_probs, cols, col_probs=None, data_fork=None)A batch transformation for adding noise to data by randomly shuffling columns. The noise is added by mixing incoming batch with its shuffled version using mask:
batch = batch.mask(mask, shuffled_batch)
Parameters
- n_probs - a list, tuple or a one dimensional numpy array of probabilities p_0, p_1, p_2, ... p_n. p_0 is a probability for a row to have 0 augmented elements (no augmentation), p_1 - one random cells, p_2 - two random cells, etc. A parameter must have at least 2 values, scalars are not accepted
- cols - a list, tuple or one dimensional numpy array of strings with columns names to be transformed Number of columns must be greater or equal the length of n_probs parameter simply because there must be enough columns to choose from to augment n elements in a row
- col_probs - (optional) a list, tuple or one dimensional numpy array of floats p_{c_0}, p_{c_1}, ... p_{c_k} where k is the number of columns specified in parameter cols. p_{c_0} is the probability of column 0 from parameter cols to be selected in when only one column is picked for augmentation. p_{c_1} is the same for column 1, etc. It is important to understand that when two or more columns, are picked for a row, actual frequencies of columns will drift towards equal distribution with every new item added. In a case when number of columns picked for augmentation reaches its max allowed value (number of columns available to choose from parameter cols), there will be no choice and the actual counts of columns will be equal. This means the actual distribution will turn into a uniform discrete distribution. Default: None
Base abstract classes
BatchTransformer class
keras_batchflow.base.batch_transformers.BatchTransformer
()This is an abstract class that defines basic functionality and interfaces of all BatchTransformers
Random cell base class
This class is a parent class for all transformers that transform random cells of incoming data using masked substitiution to some augmented version of the same batch. They all use the following formula at their core:
batch = batch.mask(mask, augmented_version)
The child classes in general only differ by the way how augmented version is defined. For example:
- FeatureDropout makes augmented version by simply creating a dataframe of the same structure as batch, but filled with drop values specified at initialization
- Shuffle noise makes augmeted version by shuffling columns of batch
keras_batchflow.base.batch_transformers.BaseRandomCellTransform
(n_probs, cols, col_probs=None, data_fork=None)This is a base class used for all sorts of random cell transforms: feature dropout, noise, etc
The transform is working by masked replacement of cells in a batch with some augmented version of the same batch:
batch.loc[mask] = augmented_batch[mask]
The this transform will provide infrastructure of this transformation, while derived classes will define their own versions of augmented batch
Parameters:
- n_probs - a list, tuple or a one dimensional numpy array of probabilities p_0, p_1, p_2, ... p_n. p_0 is a probability for a row to have 0 augmented elements (no augmentation), p_1 - one random cells, p_2 - two random cells, etc. A parameter must have at least 2 values, scalars are not accepted
- cols - a list, tuple or one dimensional numpy array of strings with columns names to be transformed Number of columns must be greater or equal the length of n_probs parameter simply because there must be enough columns to choose from to augment n elements in a row
- col_probs - (optional) a list, tuple or one dimensional numpy array of floats p_{c_0}, p_{c_1}, ... p_{c_k} where k is the number of columns specified in parameter cols. p_{c_0} is the probability of column 0 from parameter cols to be selected in when only one column is picked for augmentation. p_{c_1} is the same for column 1, etc. It is important to understand that when two or more columns, are picked for a row, actual frequencies of columns will drift towards equal distribution with every new item added. In a case when number of columns picked for augmentation reaches its max allowed value (number of columns available to choose from parameter cols), there will be no choice and the actual counts of columns will be equal. This means the actual distribution will turn into a uniform discrete distribution. Default: None
- data_fork - (optinonal) a single string setting the transformer to process only this index at level 0 when data has multiindex columns. The typical use scenario of this parameter is de-noising autoencoders when same data is fed to both inputs (x) and outputs (y) of the model, but data pushed to inputs (x) is augmented. In this case, data is forked by copying to different multiindex values (x and y). By using this parameter you can set the transformer to process only x 'fork' of data
Methods of random cell base class
keras_batchflow.base.batch_transformers.BaseRandomCellTransform
(n_probs, cols, col_probs=None, data_fork=None)_make_mask
(self, batch)This method creates a binary mask that marks items in an incoming batch that have to be augmented
The elements are selected taking the following parameters into account:
- n_probs - list of probabilities of picking 0, 1, ... n items in one row respectively
- cols - list of columns subjected to augmentation colname_0, colname_1, ... colname_k. k \lt n to provide enough choice for column picking.
- col_probs - expected frequencies p_0, p_1 ... p_k of columns to be augmented in a one-per-row basis
Parameters:
Returns: a pandas dataframe of booleans of the same dimensions and indices as a batch. The returned dataframe has True for elements that have to be augmented
There is a naive way to call pick columns for augmentation by using random choice for each column separately. This way I could use col_probs directly in a choice function. This however is quite slow method as it requires one call of random choice function per row.
This object utilises a vectorized approach for picking rows:
- generate a matrix of random standard uniform floats of size (batch_size, number of cols K)
- argsort and pick leftmost n columns. They will contain indices of picked columns
- Because generally not all rows will have n items picked, these indices are multiplied by a n-picking mask, which nullifies some of the indices effectively de-selecting them
- one hot encode of the remaining indices to make mask
Making n-picking mask
this mask is used to implement random number of cells per row picking which is set by n_probs
parameter
It is created by sampling with replacement from a lower triangle matrix:
0, 0, 0
1, 0, 0
1, 1, 0
1, 1, 1
using n_probs
as weights. In this case, this matrix can be used for generating a mask where 0 to 3 cells
can be augmented in each row.
Performance
the tests show 3 times performance increase when using vectorised version comparing with naive implementation
_calculate_col_weights
(self, col_probs)Calculate power factors for transformation according to desired frequencies
The weighed col sampler is using vectorized argsort for selecting unqiue ids in each row. The downside of this approach is that it doesn't use weighting. I.e. I can't make one column more prefferable if there is a choice of columns in each row. When using uniform distribution as is, all variables become equally possible which means each column can be selected with equal probability when only one column is chosen
to illustrate why this is happening, I will use CDF of a uniform distribution X For a standard uniform distribution in unit interval [0,1], the CDF fuction is
CDF sets the probability of a random variable to evaluate less than x
I can calculate probability of one variable be less than another p(X_1 \le X_2). For that I need to integrate the CDF:
then 3 variables are used, I will calculate joint probability
Adding weighting
Now, how I can skew the outcomes, so that the expectations of them being chosen are not equal, but some other ratios? For that, I need to modify distribution of X so that integral changes in the way we want. The distributions must not change their intervals and must stay within unit limits [0,1]. For this reason, I will not use simple scaling. Instead, I will use power transformation of standard uniform distribution.
The power factors z_1, z_2, z_3 are not yet known. Lets see if we can find them using desired weights [p_1, p_2, p_3] for these variables:
Using the same logic I can make formulas for p_2 and p_3. All together, they make a system of equations
This is a funny system which may have infinite number of solution which can be obrained by simply scaling one solution vector z_1, z_2, z_3 if it exists. Indeed, if a vector that conformed any of the equations is scaled, both nominator and denominator get scaled by the same number. This basically means that all possible solutions lay on a line
This is also a homogenious system of equations which has a simple solution Z = 0, which means the line where all possible solutions lay crosses zero. Because there is no single solution, the matrix of the equations is singular.
I will use SVD for finding one of the solution
Matrix A can be decomposed to
The solution will be in the n-th column where zero diagonal element is in matrix D. For above matrix, this element will be on last position. The solution will be located in the last row of matrix V