ML-ENSEMBLE

author:Sebastian Flennerhag
copyright:2017
license:MIT

Classes for implementing various cross-validation strategies. By default, ML-Ensemble indexers generates list of tuples, as opposed to array indexes, to avoid serialization during multiprocessing.

mlens.index

Indexers

BaseIndex

class mlens.index.BaseIndex[source]

Bases: mlens.externals.sklearn.base.BaseEstimator

Base Index class.

Specification of indexer-wide methods and attributes that we can always expect to find in any indexer. Helps to provide a uniform interface during parallel estimation.

fit(X, y=None, job=None)[source]

Method for storing array data.

Parameters:
  • X (array-like of shape [n_samples, optional]) – array to _collect dimension data from.
  • y (array-like, optional) – label data
  • job (str, optional) – optional job type data
Returns:

indexer with stores sample size data.

Return type:

instance

Notes

Fitting an indexer stores nothing that points to the array or memmap X. Only the shape attribute of X is called.

generate(X=None, as_array=False)[source]

Front-end generator method.

Generator for training and test set indices based on the generator specification in _gen_indicies.

Parameters:
  • X (array-like, optional) – If instance has not been fitted, the training set X must be passed to the generate method, which will call fit before proceeding. If already fitted, X can be omitted.
  • as_array (bool (default = False)) –

    whether to return train and test indices as a pair of tuple(s) or numpy arrays. If the returned tuples are singular they can be used on an array X with standard slicing syntax (X[start:stop]), but if a list of tuples is returned slicing X properly requires first building a list or array of index numbers from the list of tuples. This can be achieved either by setting as_array to True, or running

    for train_tup, test_tup in indexer.generate():
        train_idx = \
            np.hstack([np.arange(t0, t1) for t0, t1 in train_tup])
    

    when slicing is required.

partition(X=None, as_array=False)[source]

Partition generator method.

Default behavior is to yield None for fitting on full data. Overridden in SubsetIndex and ClusteredSubsetIndex to produce partition indexes.

FoldIndex

class mlens.index.FoldIndex(folds=2, X=None, raise_on_exception=True)[source]

Bases: mlens.index.base.BaseIndex

Indexer that generates the full size of X.

K-Fold iterator that generates fold index tuples.

FoldIndex creates a generator that returns a tuple of stop and start positions to be used for numpy array slicing [stop:start]. Note that slicing works well for the test set, but for the training set it is recommended to concatenate the index for training data that comes before the current test set with the index for the training data that comes after. This can easily be achieved with:

for train_tup, test_tup in self.generate():
    train_slice = numpy.hstack([numpy.arange(t0, t1) for t0, t1 in
                              train_tup])

    xtrain, xtest = X[train_slice], X[test_tup[0]:test_tup[1]]

Warning

Simple clicing (i.e. X[start:stop] generally does not work for the train set, which often requires concatenating the train index range below the current test set, and the train index range above the current test set. To build get a training index, use

``hstack([np.arange(t0, t1) for t0, t1 in train_index_tuples])``.

Examples

Creating arrays of folds and checking overlap

>>> import numpy as np
>>> from mlens.index import FoldIndex
>>> X = np.arange(10)
>>> print("Data set: %r" % X)
>>> print()
>>>
>>> idx = FoldIndex(4, X)
>>>
>>> for train, test in idx.generate(as_array=True):
...     print('TRAIN IDX: %32r | TEST IDX: %16r' % (train, test))
>>>
>>> print()
>>>
>>> for train, test in idx.generate(as_array=True):
...     print('TRAIN SET: %32r | TEST SET: %16r' % (X[train], X[test]))
>>>
>>> for train_idx, test_idx in idx.generate(as_array=True):
...     assert not any([i in X[test_idx] for i in X[train_idx]])
>>>
>>> print()
>>>
>>> print("No overlap between train set and test set.")
Data set: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
TRAIN IDX:     array([3, 4, 5, 6, 7, 8, 9]) | TEST IDX: array([0, 1, 2])
TRAIN IDX:     array([0, 1, 2, 6, 7, 8, 9]) | TEST IDX: array([3, 4, 5])
TRAIN IDX:  array([0, 1, 2, 3, 4, 5, 8, 9]) | TEST IDX:    array([6, 7])
TRAIN IDX:  array([0, 1, 2, 3, 4, 5, 6, 7]) | TEST IDX:    array([8, 9])
TRAIN SET:     array([3, 4, 5, 6, 7, 8, 9]) | TEST SET: array([0, 1, 2])
TRAIN SET:     array([0, 1, 2, 6, 7, 8, 9]) | TEST SET: array([3, 4, 5])
TRAIN SET:  array([0, 1, 2, 3, 4, 5, 8, 9]) | TEST SET:    array([6, 7])
TRAIN SET:  array([0, 1, 2, 3, 4, 5, 6, 7]) | TEST SET:    array([8, 9])
No overlap between train set and test set.

Passing folds = 1 without raising exception:

>>> import numpy as np
>>> from mlens.index import FoldIndex
>>> X = np.arange(3)
>>> print("Data set: %r" % X)
>>> print()
>>>
>>> idx = FoldIndex(1, X, raise_on_exception=False)
>>>
>>> for train, test in idx.generate(as_array=True):
...     print('TRAIN IDX: %10r | TEST IDX: %10r' % (train, test))
/../mlens/base/indexer.py:167: UserWarning: 'folds' is 1, will return
full index as both training set and test set.
warnings.warn("'folds' is 1, will return full index as "
Data set: array([0, 1, 2])
TRAIN IDX: array([0, 1, 2]) | TEST IDX: array([0, 1, 2])
fit(X, y=None, job=None)[source]

Method for storing array data.

Parameters:
  • X (array-like of shape [n_samples, optional]) – array to _collect dimension data from.
  • y (None) – for compatibility
  • job (None) – for compatibility
Returns:

indexer with stores sample size data.

Return type:

instance

BlendIndex

class mlens.index.BlendIndex(test_size=0.5, train_size=None, X=None, raise_on_exception=True)[source]

Bases: mlens.index.base.BaseIndex

Indexer that generates two non-overlapping subsets of X.

Iterator that generates one training fold and one test fold that are non-overlapping and that may or may not partition all of X depending on the user’s specification.

BlendIndex creates a singleton generator (has on iteration) that yields two tuples of (start, stop) integers that can be used for numpy array slicing (i.e. X[stop:start]). If a full array index is desired this can easily be achieved with:

for train_tup, test_tup in self.generate():
    train_slice = numpy.hstack([numpy.arange(t0, t1) for t0, t1 in
                              train_tup])

    test_slice = numpy.hstack([numpy.arange(t0, t1) for t0, t1 in
                              test_tup])
Parameters:
  • test_size (int or float (default = 0.5)) – Size of the test set. If float, assumed to be proportion of full data set.
  • train_size (int or float, optional) – Size of test set. If not specified (i.e. train_size = None, train_size is equal to n_samples - test_size. If float, assumed to be a proportion of full data set. If train_size + test_size amount to less than the observations in the full data set, a subset of specified size will be used.
  • X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also, accepted, as only the first dimension is used. If X is not passed at instantiation, the fit method must be called before generate, or X must be passed as an argument of generate.
  • raise_on_exception (bool (default = True)) – whether to warn on suspicious slices or raise an error.

Examples

Selecting an absolute test size, with train size as the remainder

>>> import numpy as np
>>> from mlens.index import BlendIndex
>>> X = np.arange(8)
>>> idx = BlendIndex(3, rebase=True)
>>> print('Test size: 3')
>>> for tri, tei in idx.generate(X):
...     print('TEST (idx | array): (%i, %i) | %r ' % (tei[0], tei[1],
...                                                   X[tei[0]:tei[1]]))
...     print('TRAIN (idx | array): (%i, %i) | %r ' % (tri[0], tri[1],
...                                                    X[tri[0]:tri[1]]))
Test size: 3
TEST (idx | array): (5, 8) | array([5, 6, 7])
TRAIN (idx | array): (0, 5) | array([0, 1, 2, 3, 4])

Selecting a test and train size less than the total

>>> import numpy as np
>>> from mlens.index import BlendIndex
>>> X = np.arange(8)
>>> idx = BlendIndex(3, 4, X)
>>> print('Test size: 3')
>>> print('Train size: 4')
>>> for tri, tei in idx.generate(X):
...     print('TEST (idx | array): (%i, %i) | %r ' % (tei[0], tei[1],
...                                                   X[tei[0]:tei[1]]))
...     print('TRAIN (idx | array): (%i, %i) | %r ' % (tri[0], tri[1],
...                                                    X[tri[0]:tri[1]]))
Test size: 3
Train size: 4
TEST (idx | array): (4, 7) | array([4, 5, 6])
TRAIN (idx | array): (0, 4) | array([0, 1, 2, 3])

Selecting a percentage of observations as test and train set

>>> import numpy as np
>>> from mlens.index import BlendIndex
>>> X = np.arange(8)
>>> idx = BlendIndex(0.25, 0.45, X)
>>> print('Test size: 25% * 8 = 2')
>>> print('Train size: 45% * 8 < 4 -> 3')
>>> for tri, tei in idx.generate(X):
...     print('TEST (idx | array): (%i, %i) | %r ' % (tei[0], tei[1],
...                                                   X[tei[0]:tei[1]]))
...     print('TRAIN (idx | array): (%i, %i) | %r ' % (tri[0], tri[1],
...                                                    X[tri[0]:tri[1]]))
Test size: 25% * 8 = 2
Train size: 50% * 8 < 4 ->
TEST (idx | array): (3, 5) | array([[3, 4]])
TRAIN (idx | array): (0, 3) | array([[0, 1, 2]])

Rebasing the test set to be 0-indexed

>>> import numpy as np
>>> from mlens.index import BlendIndex
>>> X = np.arange(8)
>>> idx = BlendIndex(3, rebase=True)
>>> print('Test size: 3')
>>> for tri, tei in idx.generate(X):
...     print('TEST tuple: (%i, %i) | array: %r' % (tei[0], tei[1],
...                                                 np.arange(tei[0],
...                                                           tei[1])))
Test size: 3
TEST tuple: (0, 3) | array: array([0, 1, 2])
fit(X, y=None, job=None)[source]

Method for storing array data.

Parameters:
  • X (array-like of shape [n_samples, optional]) – array to _collect dimension data from.
  • y (None) – for compatibility
  • job (None) – for compatibility
Returns:

indexer with stores sample size data.

Return type:

instance

SubsetIndex

class mlens.index.SubsetIndex(partitions=2, folds=2, X=None, raise_on_exception=True)[source]

Bases: mlens.index.base.BaseIndex

Subsample index generator.

Generates cross-validation folds according used to create J partitions of the data and v folds on each partition according to as per [1]:

  1. Split X into J partitions

  2. For each partition:

    1. For each fold v, create train index of all idx not in v
    2. Concatenate all the fold v indices into a test index for fold v that spans all partitions

Setting J = 1 is equivalent to the FullIndexer, which returns standard K-Fold train and test set indices.

See also

FoldIndex, BlendIndex, Subsemble

References

[1]Sapp, S., van der Laan, M. J., & Canny, J. (2014). Subsemble: an ensemble method for combining subset-specific algorithm fits. Journal of Applied Statistics, 41(6), 1247-1259. http://doi.org/10.1080/02664763.2013.864263
Parameters:
  • partitions (int, list (default = 2)) – Number of partitions to split data in. If partitions=1, SubsetIndex reduces to standard K-Fold.
  • folds (int (default = 2)) – Number of splits to create in each partition. folds can not be 1 if n_partition > 1. Note that if folds = 1, both the train and test set will index the full data.
  • X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also, accepted, as only the first dimension is used. If X is not passed at instantiation, the fit method must be called before generate, or X must be passed as an argument of generate.
  • raise_on_exception (bool (default = True)) – whether to warn on suspicious slices or raise an error.

Examples

>>> import numpy as np
>>> from mlens.index import SubsetIndex
>>> X = np.arange(10)
>>> idx = SubsetIndex(3, X=X)
>>>
>>> print('Expected partitions of X:')
>>> print('J = 1: {!r}'.format(X[0:4]))
>>> print('J = 2: {!r}'.format(X[4:7]))
>>> print('J = 3: {!r}'.format(X[7:10]))
>>> print('SubsetIndexer partitions:')
>>> for i, part in enumerate(idx.partition(as_array=True)):
...     print('J = {}: {!r}'.format(i + 1, part))
>>> print('SubsetIndexer folds on partitions:')
>>> for i, (tri, tei) in enumerate(idx.generate()):
...     fold = i % 2 + 1
...     part = i // 2 + 1
...     train = np.hstack([np.arange(t0, t1) for t0, t1 in tri])
...     test = np.hstack([np.arange(t0, t1) for t0, t1 in tei])
>>>     print("J = %i | f = %i | "
...           "train: %15r | test: %r" % (part, fold, train, test))
Expected partitions of X:
J = 1: array([0, 1, 2, 3])
J = 2: array([4, 5, 6])
J = 3: array([7, 8, 9])
SubsetIndexer partitions:
J = 1: array([0, 1, 2, 3])
J = 2: array([4, 5, 6])
J = 3: array([7, 8, 9])
SubsetIndexer folds on partitions:
J = 1 | f = 1 | train:   array([2, 3]) | test: array([0, 1, 4, 5, 7, 8])
J = 1 | f = 2 | train:   array([0, 1]) | test: array([2, 3, 6, 9])
J = 2 | f = 1 | train:      array([6]) | test: array([0, 1, 4, 5, 7, 8])
J = 2 | f = 2 | train:   array([4, 5]) | test: array([2, 3, 6, 9])
J = 3 | f = 1 | train:      array([9]) | test: array([0, 1, 4, 5, 7, 8])
J = 3 | f = 2 | train:   array([7, 8]) | test: array([2, 3, 6, 9])
fit(X, y=None, job=None)[source]

Method for storing array data.

Parameters:
  • X (array-like of shape [n_samples, optional]) – array to _collect dimension data from.
  • y (None) – for compatibility
  • job (None) – for compatibility
Returns:

indexer with stores sample size data.

Return type:

instance

partition(X=None, as_array=False)[source]

Get partition indices for training full subset estimators.

Returns the index range for each partition of X.

Parameters:
  • X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also, accepted, as only the first dimension is used. If X is not passed at instantiation, the fit method must be called before generate, or X must be passed as an argument of generate.
  • as_array (bool (default = False)) – whether to return partition as an index array. Otherwise tuples of (start, stop) indices are returned.

ClusteredSubsetIndex

class mlens.index.ClusteredSubsetIndex(partition_estimator, partitions=2, folds=2, X=None, y=None, fit_estimator=True, attr='predict', partition_on='X', raise_on_exception=True)[source]

Bases: mlens.index.base.BaseIndex

Clustered Subsample index generator.

Generates cross-validation folds according used to create J partitions of the data and v folds on each partition according to as per [2]:

  1. Split X into J partitions

  2. For each partition:

    1. For each fold v, create train index of all idx not in v
    2. Concatenate all the fold v indices into a test index for fold v that spans all partitions

Setting J = 1 is equivalent to the FullIndexer, which returns standard K-Fold train and test set indices.

ClusteredSubsetIndex uses a user-provided estimator to partition the data, in contrast to the SubsetIndex generator, which partitions data into randomly into equal sizes.

References

[2]Sapp, S., van der Laan, M. J., & Canny, J. (2014). Subsemble: an ensemble method for combining subset-specific algorithm fits. Journal of Applied Statistics, 41(6), 1247-1259. http://doi.org/10.1080/02664763.2013.864263
Parameters:
  • partition_estimator (instance) – Estimator to use for clustering.
  • partitions (int) – Number of partitions the estimator will create.
  • folds (int (default = 2)) – Number of folds to create in each partition. folds can not be 1 if n_partition > 1. Note that if folds = 1, both the train and test set will index the full data.
  • fit_estimator (bool (default = True)) – whether to fit the estimator separately before generating labels.
  • attr (str (default = 'predict')) – the attribute to use for generating cluster membership labels.
  • X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also, accepted, as only the first dimension is used. If X is not passed at instantiation, the fit method must be called before generate, or X must be passed as an argument of generate.
  • raise_on_exception (bool (default = True)) – whether to warn on suspicious slices or raise an error.

Examples

>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> from mlens.index import ClusteredSubsetIndex
>>>
>>> km = KMeans(3, random_state=0)
>>> X = np.arange(12).reshape(-1, 1); np.random.shuffle(X)
>>> print("Data: {}".format(X.ravel()))
>>>
>>> s = ClusteredSubsetIndex(km)
>>> s.fit(X)
>>>
>>> P = s.partition_estimator.predict(X)
>>> print("cluster labels: {}".format(P))
>>>
>>> for j, i in enumerate(s.partition(as_array=True)):
...    print("partition ({}) index: {}, cluster labels: {}".format(i, j + 1, P[i]))
>>>
>>> for i in s.generate(as_array=True):
...     print("train fold index: {}, cluster labels: {}".format(i[0], P[i[0]]))
Data: [ 8  7  5  2  4 10 11  1  3  6  9  0]
cluster labels: [0 2 2 1 2 0 0 1 1 2 0 1]
partition (1) index: [ 0  5  6 10], cluster labels: [0 0 0 0]
partition (2) index: [ 3  7  8 11], cluster labels: [1 1 1 1]
partition (3) index: [1 2 4 9], cluster labels: [2 2 2 2]
train fold index: [0 3 5], cluster labels: [0 0 0]
train fold index: [ 6 10], cluster labels: [0 0]
train fold index: [2 7], cluster labels: [1 1]
train fold index: [ 9 11], cluster labels: [1 1]
train fold index: [1 4], cluster labels: [2 2]
train fold index: [8], cluster labels: [2]
fit(X, y=None, job='fit')[source]

Method for storing array data.

Parameters:
  • X (array-like of shape [n_samples, n_features]) – input array.
  • y (array-like of shape [n_samples, ]) – labels.
  • job (str, ['fit', 'predict'] (default='fit')) – type of estimation job. If ‘fit’, the indexer will be fitted, which involves fitting the estimator. Otherwise, the indexer will not be fitted (since it is not used for prediction).
Returns:

indexer with stores sample size data.

Return type:

instance

partition(X=None, y=None, as_array=False)[source]

Get partition indices for training full subset estimators.

Returns the index range for each partition of X.

Parameters:
  • X (array-like of shape [n_samples, n_features] , optional) – the set to partition. The training label array is also, accepted, as only the first dimension is used. If X is not passed at instantiation, the fit method must be called before generate, or X must be passed as an argument of generate.
  • y (array-like of shape [n_samples,], optional) – the labels of the set to partition.
  • as_array (bool (default = False)) – whether to return partition as an index array. Otherwise tuples of (start, stop) indices are returned.

FullIndex

class mlens.index.FullIndex(X=None)[source]

Bases: mlens.index.base.BaseIndex

Vacuous indexer to be used with final layers.

FullIndex is a compatibility class to be used with meta layers. It stores the sample size to be predicted for use with the ParallelProcessing job manager, and yields a None, None index when generate is called.

fit(X, y=None, job=None)[source]

Store dimensionality data about X.

Utilities

prune_train

mlens.index.prune_train(start_below, stop_below, start_above, stop_above)[source]

Checks if indices above or below are empty and remove them.

A utility function for checking if the train indices below the a given test set range are (0, 0), or if indices above the test set range is (n, n). In this case, these will lead to an empty array and therefore can safely be removed to create a single training set index range.

Parameters:
  • start_below (int) –
    index number starting below the test set. Should always be the same
    for all test sets.
    stop_below : int
    the index number at which the test set is starting on.
  • start_above (int) –
  • index number at which the test set ends. (the) –
  • stop_above (int) – The end of the data set (n). Should always be the same for all test sets.

partition

mlens.index.partition(n, p)[source]

Get partition sizes for a given number of samples and partitions.

This method will give an array containing the sizes of p partitions given a total sample size of n. If there is a remainder from the split, the r first folds will be incremented by 1.

Parameters:
  • n (int) – number of samples.
  • p (int) – number of partitions.

Examples

Return sample sizes of 2 partitions given a total of 4 samples

>>> from mlens.index.base import partition
>>> _partition(4, 2)
array([2, 2])

Return sample sizes of 3 partitions given a total of 8 samples

>>> from mlens.index.base import partition
>>> _partition(8, 3)
array([3, 3, 2])

make_tuple

mlens.index.make_tuple(arr)[source]

Make a list of index tuples from array

Parameters:arr (array) –
Returns:out
Return type:list

Examples

>>> import numpy as np
>>> from mlens.index.base import make_tuple
>>> _make_tuple(np.array([0, 1, 2, 5, 6, 8, 9, 10]))
[(0, 3), (5, 7), (8, 11)]