ML-ENSEMBLE
| author: | Sebastian Flennerhag |
|---|---|
| copyright: | 2017-2018 |
| license: | MIT |
Classes for implementing various cross-validation strategies. By default, ML-Ensemble indexers generates list of tuples, as opposed to array indexes, to avoid serialization during multiprocessing.
mlens.index¶
Indexers¶
BaseIndex¶
-
class
mlens.index.BaseIndex[source]¶ Bases:
mlens.externals.sklearn.base.BaseEstimatorBase Index class.
Specification of indexer-wide methods and attributes that we can always expect to find in any indexer. Helps to provide a uniform interface during parallel estimation.
-
fit(X, y=None, job=None)[source]¶ Method for storing array data.
Parameters: - X (array-like of shape [n_samples, optional]) – array to _collect dimension data from.
- y (array-like, optional) – label data
- job (str, optional) – optional job type data
Returns: indexer with stores sample size data.
Return type: instance
Notes
Fitting an indexer stores nothing that points to the array or memmap
X. Only theshapeattribute ofXis called.
-
generate(X=None, as_array=False)[source]¶ Front-end generator method.
Generator for training and test set indices based on the generator specification in
_gen_indicies.Parameters: - X (array-like, optional) – If instance has not been fitted, the training set
Xmust be passed to thegeneratemethod, which will callfitbefore proceeding. If already fitted,Xcan be omitted. - as_array (bool (default = False)) –
whether to return train and test indices as a pair of tuple(s) or numpy arrays. If the returned tuples are singular they can be used on an array X with standard slicing syntax (
X[start:stop]), but if a list of tuples is returned slicingXproperly requires first building a list or array of index numbers from the list of tuples. This can be achieved either by settingas_arraytoTrue, or runningfor train_tup, test_tup in indexer.generate(): train_idx = \ np.hstack([np.arange(t0, t1) for t0, t1 in train_tup])
when slicing is required.
- X (array-like, optional) – If instance has not been fitted, the training set
-
partition(X=None, as_array=False)[source]¶ Partition generator method.
Default behavior is to yield
Nonefor fitting on full data. Overridden inSubsetIndexandClusteredSubsetIndexto produce partition indexes.
-
set_params(**params)[source]¶ Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>so that it’s possible to update each component of a nested object. :returns: :rtype: self
-
FoldIndex¶
-
class
mlens.index.FoldIndex(folds=2, X=None, raise_on_exception=True)[source]¶ Bases:
mlens.index.base.BaseIndexIndexer that generates the full size of
X.K-Fold iterator that generates fold index tuples.
FoldIndex creates a generator that returns a tuple of stop and start positions to be used for numpy array slicing [stop:start]. Note that slicing works well for the test set, but for the training set it is recommended to concatenate the index for training data that comes before the current test set with the index for the training data that comes after. This can easily be achieved with:
for train_tup, test_tup in self.generate(): train_slice = numpy.hstack([numpy.arange(t0, t1) for t0, t1 in train_tup]) xtrain, xtest = X[train_slice], X[test_tup[0]:test_tup[1]]
Warning
Simple clicing (i.e.
X[start:stop]generally does not work for the train set, which often requires concatenating the train index range below the current test set, and the train index range above the current test set. To build get a training index, use``hstack([np.arange(t0, t1) for t0, t1 in train_index_tuples])``.
See also
Parameters: - folds (int (default = 2)) – Number of splits to create in each partition.
foldscan not be 1 ifn_partition > 1. Note that iffolds = 1, both the train and test set will index the full data. - X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
Xis not passed at instantiation, thefitmethod must be called beforegenerate, orXmust be passed as an argument ofgenerate. - raise_on_exception (bool (default = True)) – whether to warn on suspicious slices or raise an error.
Examples
Creating arrays of folds and checking overlap
>>> import numpy as np >>> from mlens.index import FoldIndex >>> X = np.arange(10) >>> print("Data set: %r" % X) >>> print() >>> >>> idx = FoldIndex(4, X) >>> >>> for train, test in idx.generate(as_array=True): ... print('TRAIN IDX: %32r | TEST IDX: %16r' % (train, test)) >>> >>> print() >>> >>> for train, test in idx.generate(as_array=True): ... print('TRAIN SET: %32r | TEST SET: %16r' % (X[train], X[test])) >>> >>> for train_idx, test_idx in idx.generate(as_array=True): ... assert not any([i in X[test_idx] for i in X[train_idx]]) >>> >>> print() >>> >>> print("No overlap between train set and test set.") Data set: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) TRAIN IDX: array([3, 4, 5, 6, 7, 8, 9]) | TEST IDX: array([0, 1, 2]) TRAIN IDX: array([0, 1, 2, 6, 7, 8, 9]) | TEST IDX: array([3, 4, 5]) TRAIN IDX: array([0, 1, 2, 3, 4, 5, 8, 9]) | TEST IDX: array([6, 7]) TRAIN IDX: array([0, 1, 2, 3, 4, 5, 6, 7]) | TEST IDX: array([8, 9]) TRAIN SET: array([3, 4, 5, 6, 7, 8, 9]) | TEST SET: array([0, 1, 2]) TRAIN SET: array([0, 1, 2, 6, 7, 8, 9]) | TEST SET: array([3, 4, 5]) TRAIN SET: array([0, 1, 2, 3, 4, 5, 8, 9]) | TEST SET: array([6, 7]) TRAIN SET: array([0, 1, 2, 3, 4, 5, 6, 7]) | TEST SET: array([8, 9]) No overlap between train set and test set.
Passing
folds = 1without raising exception:>>> import numpy as np >>> from mlens.index import FoldIndex >>> X = np.arange(3) >>> print("Data set: %r" % X) >>> print() >>> >>> idx = FoldIndex(1, X, raise_on_exception=False) >>> >>> for train, test in idx.generate(as_array=True): ... print('TRAIN IDX: %10r | TEST IDX: %10r' % (train, test)) /../mlens/base/indexer.py:167: UserWarning: 'folds' is 1, will return full index as both training set and test set. warnings.warn("'folds' is 1, will return full index as " Data set: array([0, 1, 2]) TRAIN IDX: array([0, 1, 2]) | TEST IDX: array([0, 1, 2])
- folds (int (default = 2)) – Number of splits to create in each partition.
BlendIndex¶
-
class
mlens.index.BlendIndex(test_size=0.5, train_size=None, X=None, raise_on_exception=True)[source]¶ Bases:
mlens.index.base.BaseIndexIndexer that generates two non-overlapping subsets of
X.Iterator that generates one training fold and one test fold that are non-overlapping and that may or may not partition all of X depending on the user’s specification.
BlendIndex creates a singleton generator (has on iteration) that yields two tuples of
(start, stop)integers that can be used for numpy array slicing (i.e.X[stop:start]). If a full array index is desired this can easily be achieved with:for train_tup, test_tup in self.generate(): train_slice = numpy.hstack([numpy.arange(t0, t1) for t0, t1 in train_tup]) test_slice = numpy.hstack([numpy.arange(t0, t1) for t0, t1 in test_tup])
Parameters: - test_size (int or float (default = 0.5)) – Size of the test set. If
float, assumed to be proportion of full data set. - train_size (int or float, optional) – Size of test set. If not specified (i.e.
train_size = None, train_size is equal ton_samples - test_size. Iffloat, assumed to be a proportion of full data set. Iftrain_size+test_sizeamount to less than the observations in the full data set, a subset of specified size will be used. - X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
Xis not passed at instantiation, thefitmethod must be called beforegenerate, orXmust be passed as an argument ofgenerate. - raise_on_exception (bool (default = True)) – whether to warn on suspicious slices or raise an error.
See also
Examples
Selecting an absolute test size, with train size as the remainder
>>> import numpy as np >>> from mlens.index import BlendIndex >>> X = np.arange(8) >>> idx = BlendIndex(3, rebase=True) >>> print('Test size: 3') >>> for tri, tei in idx.generate(X): ... print('TEST (idx | array): (%i, %i) | %r ' % (tei[0], tei[1], ... X[tei[0]:tei[1]])) ... print('TRAIN (idx | array): (%i, %i) | %r ' % (tri[0], tri[1], ... X[tri[0]:tri[1]])) Test size: 3 TEST (idx | array): (5, 8) | array([5, 6, 7]) TRAIN (idx | array): (0, 5) | array([0, 1, 2, 3, 4])
Selecting a test and train size less than the total
>>> import numpy as np >>> from mlens.index import BlendIndex >>> X = np.arange(8) >>> idx = BlendIndex(3, 4, X) >>> print('Test size: 3') >>> print('Train size: 4') >>> for tri, tei in idx.generate(X): ... print('TEST (idx | array): (%i, %i) | %r ' % (tei[0], tei[1], ... X[tei[0]:tei[1]])) ... print('TRAIN (idx | array): (%i, %i) | %r ' % (tri[0], tri[1], ... X[tri[0]:tri[1]])) Test size: 3 Train size: 4 TEST (idx | array): (4, 7) | array([4, 5, 6]) TRAIN (idx | array): (0, 4) | array([0, 1, 2, 3])
Selecting a percentage of observations as test and train set
>>> import numpy as np >>> from mlens.index import BlendIndex >>> X = np.arange(8) >>> idx = BlendIndex(0.25, 0.45, X) >>> print('Test size: 25% * 8 = 2') >>> print('Train size: 45% * 8 < 4 -> 3') >>> for tri, tei in idx.generate(X): ... print('TEST (idx | array): (%i, %i) | %r ' % (tei[0], tei[1], ... X[tei[0]:tei[1]])) ... print('TRAIN (idx | array): (%i, %i) | %r ' % (tri[0], tri[1], ... X[tri[0]:tri[1]])) Test size: 25% * 8 = 2 Train size: 50% * 8 < 4 -> TEST (idx | array): (3, 5) | array([[3, 4]]) TRAIN (idx | array): (0, 3) | array([[0, 1, 2]])
Rebasing the test set to be 0-indexed
>>> import numpy as np >>> from mlens.index import BlendIndex >>> X = np.arange(8) >>> idx = BlendIndex(3, rebase=True) >>> print('Test size: 3') >>> for tri, tei in idx.generate(X): ... print('TEST tuple: (%i, %i) | array: %r' % (tei[0], tei[1], ... np.arange(tei[0], ... tei[1]))) Test size: 3 TEST tuple: (0, 3) | array: array([0, 1, 2])
- test_size (int or float (default = 0.5)) – Size of the test set. If
SubsetIndex¶
-
class
mlens.index.SubsetIndex(partitions=2, folds=2, X=None, raise_on_exception=True)[source]¶ Bases:
mlens.index.base.BaseIndexSubsample index generator.
Generates cross-validation folds according used to create
Jpartitions of the data andvfolds on each partition according to as per [1]:Split
XintoJpartitionsFor each partition:
- For each fold
v, create train index of all idx not inv - Concatenate all the fold
vindices into a test index for foldvthat spans all partitions
- For each fold
Setting
J = 1is equivalent to theFullIndexer, which returns standard K-Fold train and test set indices.See also
FoldIndex,BlendIndex,SubsembleReferences
[1] Sapp, S., van der Laan, M. J., & Canny, J. (2014). Subsemble: an ensemble method for combining subset-specific algorithm fits. Journal of Applied Statistics, 41(6), 1247-1259. http://doi.org/10.1080/02664763.2013.864263 Parameters: - partitions (int, list (default = 2)) – Number of partitions to split data in. If
partitions=1,SubsetIndexreduces to standard K-Fold. - folds (int (default = 2)) – Number of splits to create in each partition.
foldscan not be 1 ifn_partition > 1. Note that iffolds = 1, both the train and test set will index the full data. - X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
Xis not passed at instantiation, thefitmethod must be called beforegenerate, orXmust be passed as an argument ofgenerate. - raise_on_exception (bool (default = True)) – whether to warn on suspicious slices or raise an error.
Examples
>>> import numpy as np >>> from mlens.index import SubsetIndex >>> X = np.arange(10) >>> idx = SubsetIndex(3, X=X) >>> >>> print('Expected partitions of X:') >>> print('J = 1: {!r}'.format(X[0:4])) >>> print('J = 2: {!r}'.format(X[4:7])) >>> print('J = 3: {!r}'.format(X[7:10])) >>> print('SubsetIndexer partitions:') >>> for i, part in enumerate(idx.partition(as_array=True)): ... print('J = {}: {!r}'.format(i + 1, part)) >>> print('SubsetIndexer folds on partitions:') >>> for i, (tri, tei) in enumerate(idx.generate()): ... fold = i % 2 + 1 ... part = i // 2 + 1 ... train = np.hstack([np.arange(t0, t1) for t0, t1 in tri]) ... test = np.hstack([np.arange(t0, t1) for t0, t1 in tei]) >>> print("J = %i | f = %i | " ... "train: %15r | test: %r" % (part, fold, train, test)) Expected partitions of X: J = 1: array([0, 1, 2, 3]) J = 2: array([4, 5, 6]) J = 3: array([7, 8, 9]) SubsetIndexer partitions: J = 1: array([0, 1, 2, 3]) J = 2: array([4, 5, 6]) J = 3: array([7, 8, 9]) SubsetIndexer folds on partitions: J = 1 | f = 1 | train: array([2, 3]) | test: array([0, 1, 4, 5, 7, 8]) J = 1 | f = 2 | train: array([0, 1]) | test: array([2, 3, 6, 9]) J = 2 | f = 1 | train: array([6]) | test: array([0, 1, 4, 5, 7, 8]) J = 2 | f = 2 | train: array([4, 5]) | test: array([2, 3, 6, 9]) J = 3 | f = 1 | train: array([9]) | test: array([0, 1, 4, 5, 7, 8]) J = 3 | f = 2 | train: array([7, 8]) | test: array([2, 3, 6, 9])
-
fit(X, y=None, job=None)[source]¶ Method for storing array data.
Parameters: Returns: indexer with stores sample size data.
Return type: instance
-
partition(X=None, as_array=False)[source]¶ Get partition indices for training full subset estimators.
Returns the index range for each partition of X.
Parameters: - X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
Xis not passed at instantiation, thefitmethod must be called beforegenerate, orXmust be passed as an argument ofgenerate. - as_array (bool (default = False)) – whether to return partition as an index array. Otherwise tuples
of
(start, stop)indices are returned.
- X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
ClusteredSubsetIndex¶
-
class
mlens.index.ClusteredSubsetIndex(partition_estimator, partitions=2, folds=2, X=None, y=None, fit_estimator=True, attr='predict', partition_on='X', raise_on_exception=True)[source]¶ Bases:
mlens.index.base.BaseIndexClustered Subsample index generator.
Generates cross-validation folds according used to create
Jpartitions of the data andvfolds on each partition according to as per [2]:Split
XintoJpartitionsFor each partition:
- For each fold
v, create train index of all idx not inv - Concatenate all the fold
vindices into a test index for foldvthat spans all partitions
- For each fold
Setting
J = 1is equivalent to theFullIndexer, which returns standard K-Fold train and test set indices.ClusteredSubsetIndexuses a user-provided estimator to partition the data, in contrast to theSubsetIndexgenerator, which partitions data into randomly into equal sizes.See also
References
[2] Sapp, S., van der Laan, M. J., & Canny, J. (2014). Subsemble: an ensemble method for combining subset-specific algorithm fits. Journal of Applied Statistics, 41(6), 1247-1259. http://doi.org/10.1080/02664763.2013.864263 Parameters: - partition_estimator (instance) – Estimator to use for clustering.
- partitions (int) – Number of partitions the estimator will create.
- folds (int (default = 2)) – Number of folds to create in each partition.
foldscan not be 1 ifn_partition > 1. Note that iffolds = 1, both the train and test set will index the full data. - fit_estimator (bool (default = True)) – whether to fit the estimator separately before generating labels.
- attr (str (default = 'predict')) – the attribute to use for generating cluster membership labels.
- X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
Xis not passed at instantiation, thefitmethod must be called beforegenerate, orXmust be passed as an argument ofgenerate. - raise_on_exception (bool (default = True)) – whether to warn on suspicious slices or raise an error.
Examples
>>> import numpy as np >>> from sklearn.cluster import KMeans >>> from mlens.index import ClusteredSubsetIndex >>> >>> km = KMeans(3, random_state=0) >>> X = np.arange(12).reshape(-1, 1); np.random.shuffle(X) >>> print("Data: {}".format(X.ravel())) >>> >>> s = ClusteredSubsetIndex(km) >>> s.fit(X) >>> >>> P = s.partition_estimator.predict(X) >>> print("cluster labels: {}".format(P)) >>> >>> for j, i in enumerate(s.partition(as_array=True)): ... print("partition ({}) index: {}, cluster labels: {}".format(i, j + 1, P[i])) >>> >>> for i in s.generate(as_array=True): ... print("train fold index: {}, cluster labels: {}".format(i[0], P[i[0]])) Data: [ 8 7 5 2 4 10 11 1 3 6 9 0] cluster labels: [0 2 2 1 2 0 0 1 1 2 0 1] partition (1) index: [ 0 5 6 10], cluster labels: [0 0 0 0] partition (2) index: [ 3 7 8 11], cluster labels: [1 1 1 1] partition (3) index: [1 2 4 9], cluster labels: [2 2 2 2] train fold index: [0 3 5], cluster labels: [0 0 0] train fold index: [ 6 10], cluster labels: [0 0] train fold index: [2 7], cluster labels: [1 1] train fold index: [ 9 11], cluster labels: [1 1] train fold index: [1 4], cluster labels: [2 2] train fold index: [8], cluster labels: [2]
-
fit(X, y=None, job='fit')[source]¶ Method for storing array data.
Parameters: - X (array-like of shape [n_samples, n_features]) – input array.
- y (array-like of shape [n_samples, ]) – labels.
- job (str, ['fit', 'predict'] (default='fit')) – type of estimation job. If ‘fit’, the indexer will be fitted, which involves fitting the estimator. Otherwise, the indexer will not be fitted (since it is not used for prediction).
Returns: indexer with stores sample size data.
Return type: instance
-
partition(X=None, y=None, as_array=False)[source]¶ Get partition indices for training full subset estimators.
Returns the index range for each partition of X.
Parameters: - X (array-like of shape [n_samples, n_features] , optional) – the set to partition. The training label array is also,
accepted, as only the first dimension is used. If
Xis not passed at instantiation, thefitmethod must be called beforegenerate, orXmust be passed as an argument ofgenerate. - y (array-like of shape [n_samples,], optional) – the labels of the set to partition.
- as_array (bool (default = False)) – whether to return partition as an index array. Otherwise tuples
of
(start, stop)indices are returned.
- X (array-like of shape [n_samples, n_features] , optional) – the set to partition. The training label array is also,
accepted, as only the first dimension is used. If
FullIndex¶
-
class
mlens.index.FullIndex(X=None)[source]¶ Bases:
mlens.index.base.BaseIndexVacuous indexer to be used with final layers.
FullIndex is a compatibility class to be used with meta layers. It stores the sample size to be predicted for use with the
ParallelProcessingjob manager, and yields aNone, Noneindex when generate is called.
Utilities¶
prune_train¶
-
mlens.index.prune_train(start_below, stop_below, start_above, stop_above)[source]¶ Checks if indices above or below are empty and remove them.
A utility function for checking if the train indices below the a given test set range are (0, 0), or if indices above the test set range is (n, n). In this case, these will lead to an empty array and therefore can safely be removed to create a single training set index range.
Parameters: - start_below (int) –
- index number starting below the test set. Should always be the same
- for all test sets.
- stop_below : int
- the index number at which the test set is starting on.
- start_above (int) –
- index number at which the test set ends. (the) –
- stop_above (int) – The end of the data set (n). Should always be the same for all test sets.
- start_below (int) –
partition¶
-
mlens.index.partition(n, p)[source]¶ Get partition sizes for a given number of samples and partitions.
This method will give an array containing the sizes of
ppartitions given a total sample size ofn. If there is a remainder from the split, the r first folds will be incremented by 1.Parameters: Examples
Return sample sizes of 2 partitions given a total of 4 samples
>>> from mlens.index.base import partition >>> _partition(4, 2) array([2, 2])
Return sample sizes of 3 partitions given a total of 8 samples
>>> from mlens.index.base import partition >>> _partition(8, 3) array([3, 3, 2])