ML-ENSEMBLE
author: | Sebastian Flennerhag |
---|---|
copyright: | 2017-2018 |
license: | MIT |
Classes for implementing various cross-validation strategies. By default, ML-Ensemble indexers generates list of tuples, as opposed to array indexes, to avoid serialization during multiprocessing.
mlens.index¶
Indexers¶
BaseIndex¶
-
class
mlens.index.
BaseIndex
[source]¶ Bases:
mlens.externals.sklearn.base.BaseEstimator
Base Index class.
Specification of indexer-wide methods and attributes that we can always expect to find in any indexer. Helps to provide a uniform interface during parallel estimation.
-
fit
(X, y=None, job=None)[source]¶ Method for storing array data.
Parameters: - X (array-like of shape [n_samples, optional]) – array to _collect dimension data from.
- y (array-like, optional) – label data
- job (str, optional) – optional job type data
Returns: indexer with stores sample size data.
Return type: instance
Notes
Fitting an indexer stores nothing that points to the array or memmap
X
. Only theshape
attribute ofX
is called.
-
generate
(X=None, as_array=False)[source]¶ Front-end generator method.
Generator for training and test set indices based on the generator specification in
_gen_indicies
.Parameters: - X (array-like, optional) – If instance has not been fitted, the training set
X
must be passed to thegenerate
method, which will callfit
before proceeding. If already fitted,X
can be omitted. - as_array (bool (default = False)) –
whether to return train and test indices as a pair of tuple(s) or numpy arrays. If the returned tuples are singular they can be used on an array X with standard slicing syntax (
X[start:stop]
), but if a list of tuples is returned slicingX
properly requires first building a list or array of index numbers from the list of tuples. This can be achieved either by settingas_array
toTrue
, or runningfor train_tup, test_tup in indexer.generate(): train_idx = \ np.hstack([np.arange(t0, t1) for t0, t1 in train_tup])
when slicing is required.
- X (array-like, optional) – If instance has not been fitted, the training set
-
partition
(X=None, as_array=False)[source]¶ Partition generator method.
Default behavior is to yield
None
for fitting on full data. Overridden inSubsetIndex
andClusteredSubsetIndex
to produce partition indexes.
-
set_params
(**params)[source]¶ Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object. :returns: :rtype: self
-
FoldIndex¶
-
class
mlens.index.
FoldIndex
(folds=2, X=None, raise_on_exception=True)[source]¶ Bases:
mlens.index.base.BaseIndex
Indexer that generates the full size of
X
.K-Fold iterator that generates fold index tuples.
FoldIndex creates a generator that returns a tuple of stop and start positions to be used for numpy array slicing [stop:start]. Note that slicing works well for the test set, but for the training set it is recommended to concatenate the index for training data that comes before the current test set with the index for the training data that comes after. This can easily be achieved with:
for train_tup, test_tup in self.generate(): train_slice = numpy.hstack([numpy.arange(t0, t1) for t0, t1 in train_tup]) xtrain, xtest = X[train_slice], X[test_tup[0]:test_tup[1]]
Warning
Simple clicing (i.e.
X[start:stop]
generally does not work for the train set, which often requires concatenating the train index range below the current test set, and the train index range above the current test set. To build get a training index, use``hstack([np.arange(t0, t1) for t0, t1 in train_index_tuples])``.
See also
Parameters: - folds (int (default = 2)) – Number of splits to create in each partition.
folds
can not be 1 ifn_partition > 1
. Note that iffolds = 1
, both the train and test set will index the full data. - X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
X
is not passed at instantiation, thefit
method must be called beforegenerate
, orX
must be passed as an argument ofgenerate
. - raise_on_exception (bool (default = True)) – whether to warn on suspicious slices or raise an error.
Examples
Creating arrays of folds and checking overlap
>>> import numpy as np >>> from mlens.index import FoldIndex >>> X = np.arange(10) >>> print("Data set: %r" % X) >>> print() >>> >>> idx = FoldIndex(4, X) >>> >>> for train, test in idx.generate(as_array=True): ... print('TRAIN IDX: %32r | TEST IDX: %16r' % (train, test)) >>> >>> print() >>> >>> for train, test in idx.generate(as_array=True): ... print('TRAIN SET: %32r | TEST SET: %16r' % (X[train], X[test])) >>> >>> for train_idx, test_idx in idx.generate(as_array=True): ... assert not any([i in X[test_idx] for i in X[train_idx]]) >>> >>> print() >>> >>> print("No overlap between train set and test set.") Data set: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) TRAIN IDX: array([3, 4, 5, 6, 7, 8, 9]) | TEST IDX: array([0, 1, 2]) TRAIN IDX: array([0, 1, 2, 6, 7, 8, 9]) | TEST IDX: array([3, 4, 5]) TRAIN IDX: array([0, 1, 2, 3, 4, 5, 8, 9]) | TEST IDX: array([6, 7]) TRAIN IDX: array([0, 1, 2, 3, 4, 5, 6, 7]) | TEST IDX: array([8, 9]) TRAIN SET: array([3, 4, 5, 6, 7, 8, 9]) | TEST SET: array([0, 1, 2]) TRAIN SET: array([0, 1, 2, 6, 7, 8, 9]) | TEST SET: array([3, 4, 5]) TRAIN SET: array([0, 1, 2, 3, 4, 5, 8, 9]) | TEST SET: array([6, 7]) TRAIN SET: array([0, 1, 2, 3, 4, 5, 6, 7]) | TEST SET: array([8, 9]) No overlap between train set and test set.
Passing
folds = 1
without raising exception:>>> import numpy as np >>> from mlens.index import FoldIndex >>> X = np.arange(3) >>> print("Data set: %r" % X) >>> print() >>> >>> idx = FoldIndex(1, X, raise_on_exception=False) >>> >>> for train, test in idx.generate(as_array=True): ... print('TRAIN IDX: %10r | TEST IDX: %10r' % (train, test)) /../mlens/base/indexer.py:167: UserWarning: 'folds' is 1, will return full index as both training set and test set. warnings.warn("'folds' is 1, will return full index as " Data set: array([0, 1, 2]) TRAIN IDX: array([0, 1, 2]) | TEST IDX: array([0, 1, 2])
- folds (int (default = 2)) – Number of splits to create in each partition.
BlendIndex¶
-
class
mlens.index.
BlendIndex
(test_size=0.5, train_size=None, X=None, raise_on_exception=True)[source]¶ Bases:
mlens.index.base.BaseIndex
Indexer that generates two non-overlapping subsets of
X
.Iterator that generates one training fold and one test fold that are non-overlapping and that may or may not partition all of X depending on the user’s specification.
BlendIndex creates a singleton generator (has on iteration) that yields two tuples of
(start, stop)
integers that can be used for numpy array slicing (i.e.X[stop:start]
). If a full array index is desired this can easily be achieved with:for train_tup, test_tup in self.generate(): train_slice = numpy.hstack([numpy.arange(t0, t1) for t0, t1 in train_tup]) test_slice = numpy.hstack([numpy.arange(t0, t1) for t0, t1 in test_tup])
Parameters: - test_size (int or float (default = 0.5)) – Size of the test set. If
float
, assumed to be proportion of full data set. - train_size (int or float, optional) – Size of test set. If not specified (i.e.
train_size = None
, train_size is equal ton_samples - test_size
. Iffloat
, assumed to be a proportion of full data set. Iftrain_size
+test_size
amount to less than the observations in the full data set, a subset of specified size will be used. - X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
X
is not passed at instantiation, thefit
method must be called beforegenerate
, orX
must be passed as an argument ofgenerate
. - raise_on_exception (bool (default = True)) – whether to warn on suspicious slices or raise an error.
See also
Examples
Selecting an absolute test size, with train size as the remainder
>>> import numpy as np >>> from mlens.index import BlendIndex >>> X = np.arange(8) >>> idx = BlendIndex(3, rebase=True) >>> print('Test size: 3') >>> for tri, tei in idx.generate(X): ... print('TEST (idx | array): (%i, %i) | %r ' % (tei[0], tei[1], ... X[tei[0]:tei[1]])) ... print('TRAIN (idx | array): (%i, %i) | %r ' % (tri[0], tri[1], ... X[tri[0]:tri[1]])) Test size: 3 TEST (idx | array): (5, 8) | array([5, 6, 7]) TRAIN (idx | array): (0, 5) | array([0, 1, 2, 3, 4])
Selecting a test and train size less than the total
>>> import numpy as np >>> from mlens.index import BlendIndex >>> X = np.arange(8) >>> idx = BlendIndex(3, 4, X) >>> print('Test size: 3') >>> print('Train size: 4') >>> for tri, tei in idx.generate(X): ... print('TEST (idx | array): (%i, %i) | %r ' % (tei[0], tei[1], ... X[tei[0]:tei[1]])) ... print('TRAIN (idx | array): (%i, %i) | %r ' % (tri[0], tri[1], ... X[tri[0]:tri[1]])) Test size: 3 Train size: 4 TEST (idx | array): (4, 7) | array([4, 5, 6]) TRAIN (idx | array): (0, 4) | array([0, 1, 2, 3])
Selecting a percentage of observations as test and train set
>>> import numpy as np >>> from mlens.index import BlendIndex >>> X = np.arange(8) >>> idx = BlendIndex(0.25, 0.45, X) >>> print('Test size: 25% * 8 = 2') >>> print('Train size: 45% * 8 < 4 -> 3') >>> for tri, tei in idx.generate(X): ... print('TEST (idx | array): (%i, %i) | %r ' % (tei[0], tei[1], ... X[tei[0]:tei[1]])) ... print('TRAIN (idx | array): (%i, %i) | %r ' % (tri[0], tri[1], ... X[tri[0]:tri[1]])) Test size: 25% * 8 = 2 Train size: 50% * 8 < 4 -> TEST (idx | array): (3, 5) | array([[3, 4]]) TRAIN (idx | array): (0, 3) | array([[0, 1, 2]])
Rebasing the test set to be 0-indexed
>>> import numpy as np >>> from mlens.index import BlendIndex >>> X = np.arange(8) >>> idx = BlendIndex(3, rebase=True) >>> print('Test size: 3') >>> for tri, tei in idx.generate(X): ... print('TEST tuple: (%i, %i) | array: %r' % (tei[0], tei[1], ... np.arange(tei[0], ... tei[1]))) Test size: 3 TEST tuple: (0, 3) | array: array([0, 1, 2])
- test_size (int or float (default = 0.5)) – Size of the test set. If
SubsetIndex¶
-
class
mlens.index.
SubsetIndex
(partitions=2, folds=2, X=None, raise_on_exception=True)[source]¶ Bases:
mlens.index.base.BaseIndex
Subsample index generator.
Generates cross-validation folds according used to create
J
partitions of the data andv
folds on each partition according to as per [1]:Split
X
intoJ
partitionsFor each partition:
- For each fold
v
, create train index of all idx not inv
- Concatenate all the fold
v
indices into a test index for foldv
that spans all partitions
- For each fold
Setting
J = 1
is equivalent to theFullIndexer
, which returns standard K-Fold train and test set indices.See also
FoldIndex
,BlendIndex
,Subsemble
References
[1] Sapp, S., van der Laan, M. J., & Canny, J. (2014). Subsemble: an ensemble method for combining subset-specific algorithm fits. Journal of Applied Statistics, 41(6), 1247-1259. http://doi.org/10.1080/02664763.2013.864263 Parameters: - partitions (int, list (default = 2)) – Number of partitions to split data in. If
partitions=1
,SubsetIndex
reduces to standard K-Fold. - folds (int (default = 2)) – Number of splits to create in each partition.
folds
can not be 1 ifn_partition > 1
. Note that iffolds = 1
, both the train and test set will index the full data. - X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
X
is not passed at instantiation, thefit
method must be called beforegenerate
, orX
must be passed as an argument ofgenerate
. - raise_on_exception (bool (default = True)) – whether to warn on suspicious slices or raise an error.
Examples
>>> import numpy as np >>> from mlens.index import SubsetIndex >>> X = np.arange(10) >>> idx = SubsetIndex(3, X=X) >>> >>> print('Expected partitions of X:') >>> print('J = 1: {!r}'.format(X[0:4])) >>> print('J = 2: {!r}'.format(X[4:7])) >>> print('J = 3: {!r}'.format(X[7:10])) >>> print('SubsetIndexer partitions:') >>> for i, part in enumerate(idx.partition(as_array=True)): ... print('J = {}: {!r}'.format(i + 1, part)) >>> print('SubsetIndexer folds on partitions:') >>> for i, (tri, tei) in enumerate(idx.generate()): ... fold = i % 2 + 1 ... part = i // 2 + 1 ... train = np.hstack([np.arange(t0, t1) for t0, t1 in tri]) ... test = np.hstack([np.arange(t0, t1) for t0, t1 in tei]) >>> print("J = %i | f = %i | " ... "train: %15r | test: %r" % (part, fold, train, test)) Expected partitions of X: J = 1: array([0, 1, 2, 3]) J = 2: array([4, 5, 6]) J = 3: array([7, 8, 9]) SubsetIndexer partitions: J = 1: array([0, 1, 2, 3]) J = 2: array([4, 5, 6]) J = 3: array([7, 8, 9]) SubsetIndexer folds on partitions: J = 1 | f = 1 | train: array([2, 3]) | test: array([0, 1, 4, 5, 7, 8]) J = 1 | f = 2 | train: array([0, 1]) | test: array([2, 3, 6, 9]) J = 2 | f = 1 | train: array([6]) | test: array([0, 1, 4, 5, 7, 8]) J = 2 | f = 2 | train: array([4, 5]) | test: array([2, 3, 6, 9]) J = 3 | f = 1 | train: array([9]) | test: array([0, 1, 4, 5, 7, 8]) J = 3 | f = 2 | train: array([7, 8]) | test: array([2, 3, 6, 9])
-
fit
(X, y=None, job=None)[source]¶ Method for storing array data.
Parameters: Returns: indexer with stores sample size data.
Return type: instance
-
partition
(X=None, as_array=False)[source]¶ Get partition indices for training full subset estimators.
Returns the index range for each partition of X.
Parameters: - X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
X
is not passed at instantiation, thefit
method must be called beforegenerate
, orX
must be passed as an argument ofgenerate
. - as_array (bool (default = False)) – whether to return partition as an index array. Otherwise tuples
of
(start, stop)
indices are returned.
- X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
ClusteredSubsetIndex¶
-
class
mlens.index.
ClusteredSubsetIndex
(partition_estimator, partitions=2, folds=2, X=None, y=None, fit_estimator=True, attr='predict', partition_on='X', raise_on_exception=True)[source]¶ Bases:
mlens.index.base.BaseIndex
Clustered Subsample index generator.
Generates cross-validation folds according used to create
J
partitions of the data andv
folds on each partition according to as per [2]:Split
X
intoJ
partitionsFor each partition:
- For each fold
v
, create train index of all idx not inv
- Concatenate all the fold
v
indices into a test index for foldv
that spans all partitions
- For each fold
Setting
J = 1
is equivalent to theFullIndexer
, which returns standard K-Fold train and test set indices.ClusteredSubsetIndex
uses a user-provided estimator to partition the data, in contrast to theSubsetIndex
generator, which partitions data into randomly into equal sizes.See also
References
[2] Sapp, S., van der Laan, M. J., & Canny, J. (2014). Subsemble: an ensemble method for combining subset-specific algorithm fits. Journal of Applied Statistics, 41(6), 1247-1259. http://doi.org/10.1080/02664763.2013.864263 Parameters: - partition_estimator (instance) – Estimator to use for clustering.
- partitions (int) – Number of partitions the estimator will create.
- folds (int (default = 2)) – Number of folds to create in each partition.
folds
can not be 1 ifn_partition > 1
. Note that iffolds = 1
, both the train and test set will index the full data. - fit_estimator (bool (default = True)) – whether to fit the estimator separately before generating labels.
- attr (str (default = 'predict')) – the attribute to use for generating cluster membership labels.
- X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
X
is not passed at instantiation, thefit
method must be called beforegenerate
, orX
must be passed as an argument ofgenerate
. - raise_on_exception (bool (default = True)) – whether to warn on suspicious slices or raise an error.
Examples
>>> import numpy as np >>> from sklearn.cluster import KMeans >>> from mlens.index import ClusteredSubsetIndex >>> >>> km = KMeans(3, random_state=0) >>> X = np.arange(12).reshape(-1, 1); np.random.shuffle(X) >>> print("Data: {}".format(X.ravel())) >>> >>> s = ClusteredSubsetIndex(km) >>> s.fit(X) >>> >>> P = s.partition_estimator.predict(X) >>> print("cluster labels: {}".format(P)) >>> >>> for j, i in enumerate(s.partition(as_array=True)): ... print("partition ({}) index: {}, cluster labels: {}".format(i, j + 1, P[i])) >>> >>> for i in s.generate(as_array=True): ... print("train fold index: {}, cluster labels: {}".format(i[0], P[i[0]])) Data: [ 8 7 5 2 4 10 11 1 3 6 9 0] cluster labels: [0 2 2 1 2 0 0 1 1 2 0 1] partition (1) index: [ 0 5 6 10], cluster labels: [0 0 0 0] partition (2) index: [ 3 7 8 11], cluster labels: [1 1 1 1] partition (3) index: [1 2 4 9], cluster labels: [2 2 2 2] train fold index: [0 3 5], cluster labels: [0 0 0] train fold index: [ 6 10], cluster labels: [0 0] train fold index: [2 7], cluster labels: [1 1] train fold index: [ 9 11], cluster labels: [1 1] train fold index: [1 4], cluster labels: [2 2] train fold index: [8], cluster labels: [2]
-
fit
(X, y=None, job='fit')[source]¶ Method for storing array data.
Parameters: - X (array-like of shape [n_samples, n_features]) – input array.
- y (array-like of shape [n_samples, ]) – labels.
- job (str, ['fit', 'predict'] (default='fit')) – type of estimation job. If ‘fit’, the indexer will be fitted, which involves fitting the estimator. Otherwise, the indexer will not be fitted (since it is not used for prediction).
Returns: indexer with stores sample size data.
Return type: instance
-
partition
(X=None, y=None, as_array=False)[source]¶ Get partition indices for training full subset estimators.
Returns the index range for each partition of X.
Parameters: - X (array-like of shape [n_samples, n_features] , optional) – the set to partition. The training label array is also,
accepted, as only the first dimension is used. If
X
is not passed at instantiation, thefit
method must be called beforegenerate
, orX
must be passed as an argument ofgenerate
. - y (array-like of shape [n_samples,], optional) – the labels of the set to partition.
- as_array (bool (default = False)) – whether to return partition as an index array. Otherwise tuples
of
(start, stop)
indices are returned.
- X (array-like of shape [n_samples, n_features] , optional) – the set to partition. The training label array is also,
accepted, as only the first dimension is used. If
FullIndex¶
-
class
mlens.index.
FullIndex
(X=None)[source]¶ Bases:
mlens.index.base.BaseIndex
Vacuous indexer to be used with final layers.
FullIndex is a compatibility class to be used with meta layers. It stores the sample size to be predicted for use with the
ParallelProcessing
job manager, and yields aNone, None
index when generate is called.
Utilities¶
prune_train¶
-
mlens.index.
prune_train
(start_below, stop_below, start_above, stop_above)[source]¶ Checks if indices above or below are empty and remove them.
A utility function for checking if the train indices below the a given test set range are (0, 0), or if indices above the test set range is (n, n). In this case, these will lead to an empty array and therefore can safely be removed to create a single training set index range.
Parameters: - start_below (int) –
- index number starting below the test set. Should always be the same
- for all test sets.
- stop_below : int
- the index number at which the test set is starting on.
- start_above (int) –
- index number at which the test set ends. (the) –
- stop_above (int) – The end of the data set (n). Should always be the same for all test sets.
- start_below (int) –
partition¶
-
mlens.index.
partition
(n, p)[source]¶ Get partition sizes for a given number of samples and partitions.
This method will give an array containing the sizes of
p
partitions given a total sample size ofn
. If there is a remainder from the split, the r first folds will be incremented by 1.Parameters: Examples
Return sample sizes of 2 partitions given a total of 4 samples
>>> from mlens.index.base import partition >>> _partition(4, 2) array([2, 2])
Return sample sizes of 3 partitions given a total of 8 samples
>>> from mlens.index.base import partition >>> _partition(8, 3) array([3, 3, 2])