ML-ENSEMBLE

author:Sebastian Flennerhag
copyright:2017
license:MIT

Computational graph module for memory-neutral parallel processing of deep general-purpose ensembles.

Implements backend graph managers, base classes for interacting with graph managers, and job managers for preprocessing pipelines and estimators, as well as handles for multiple instances and wrappers for standard parallel job calls.

mlens.parallel

Graph Nodes

Layer

class mlens.parallel.layer.Layer(name=None, propagate_features=None, shuffle=False, random_state=None, verbose=False, stack=None, **kwargs)[source]

Bases: mlens.parallel.base.OutputMixin, mlens.parallel.base.IndexMixin, mlens.parallel.base.BaseStacker

Layer of preprocessing pipes and estimators.

Layer is an internal class that holds a layer and its associated data including an estimation procedure. It behaves as an estimator from an Scikit-learn API point of view.

Parameters:
  • propagate_features (list, range, optional) – Features to propagate from the input array to the output array. Carries input features to the output of the layer, useful for propagating original data through several stacked layers. Propagated features are stored in the left-most columns.
  • verbose (int or bool (default = False)) –

    level of verbosity.

    • verbose = 0 silent (same as verbose = False)
    • verbose = 1 messages at start and finish (same as verbose = True)
    • verbose = 2 messages for preprocessing and estimators
    • verbose = 3 messages for completed job

    If verbose >= 10 prints to sys.stderr, else sys.stdout.

  • shuffle (bool (default = False)) – Whether to shuffle data before fitting layer.
  • random_state (obj, int, optional) – Random seed number to use for shuffling inputs
  • **kwargs (optional) – optional arguments to BaseParallel.
collect(path=None)[source]

Collect cache estimators

data

Cross validated scores

indexers

Check indexer

learners

Generator for learners in layer

raw_data

Cross validated scores

set_output_columns(X, y, job, n_left_concats=0)[source]

Compatibility method for setting learner output columns

transformers

Generator for learners in layer

Learner

class mlens.parallel.learner.Learner(estimator, indexer=None, name=None, preprocess=None, attr=None, scorer=None, proba=False, **kwargs)[source]

Bases: mlens.parallel.base.ProbaMixin, mlens.parallel.learner.BaseNode

Wrapper for base learners.

Parameters:
  • estimator (obj) – estimator to construct learner from
  • preprocess (str, obj) – preprocess transformer. Pass either the string cache reference or the transformer instance. If the latter, the preprocess will refer to the transformer name.
  • name (str) – name of learner. If preprocess is not None, the name will be prepended to preprocess__name.
  • attr (str (default='predict')) – predict attribute, typically one of ‘predict’ and ‘predict_proba’
  • scorer (func) – function to use for scoring predictions during cross-validated fitting.
  • output_columns (dict, optional) – mapping of prediction feature columns from learner to columns in output array. Normally, this map is {0: x}, but if the indexer creates partitions, each partition needs to be mapped: {0: x, 1: x + 1}. Note that if output_columns are not given at initialization, the set_output_columns method must be called before running estimations.
  • verbose (bool, int (default = False)) – whether to report completed fits.
  • **kwargs (bool (default=True)) – Optional ParallelProcessing arguments. See BaseParallel.
scorer

Copy of scorer

Transformer

class mlens.parallel.learner.Transformer(estimator, indexer=None, name=None, **kwargs)[source]

Bases: mlens.parallel.learner.BaseNode

Preprocessing handler.

Wrapper for transformation pipeline.

Parameters:
  • indexer (obj, None) – indexer to use for generating fits. Set to None to fit only on all data.
  • estimator (obj) – transformation pipeline to construct learner from
  • name (str) – name of learner. If preprocess is not None, the name will be prepended to preprocess__name.
  • output_columns (dict, optional) – If transformer is to be used to output data, need to set output_columns. Normally, this map is {0: x}, but if the indexer creates partitions, each partition needs to be mapped: {0: x, 1: x + 1}.
  • verbose (bool, int (default = False)) – whether to report completed fits.
  • raise_on_exception (bool (default=True)) – whether to warn on non-fatal exceptions or raise an error.

EvalLearner

class mlens.parallel.learner.EvalLearner(estimator, preprocess, name, attr, scorer, error_score=None, verbose=False, **kwargs)[source]

Bases: mlens.parallel.learner.Learner

EvalLearner is a derived class from Learner used for cross-validated scoring of an estimator.

Parameters:
  • estimator (obj) – estimator to construct learner from
  • preprocess (str) – preprocess cache refernce
  • indexer (obj, None) – indexer to use for generating fits. Set to None to fit only on all data.
  • name (str) – name of learner. If preprocess is not None, the name will be prepended to preprocess__name.
  • attr (str (default='predict')) – predict attribute, typically one of ‘predict’ and ‘predict_proba’
  • scorer (func) – function to use for scoring predictions during cross-validated fitting.
  • error_score (int, float, None (default = None)) – score to set if cross-validation fails. Set to None to raise error.
  • verbose (bool, int (default = False)) – whether to report completed fits.
  • raise_on_exception (bool (default=True)) – whether to warn on non-fatal exceptions or raise an error.
gen_fit(X, y, P=None, refit=True)[source]

Generator for fitting learner on given data

EvalTransformer

class mlens.parallel.learner.EvalTransformer(estimator, indexer=None, name=None, **kwargs)[source]

Bases: mlens.parallel.learner.Transformer

Evaluator version of the Transformer.

Derived class from Transformer adapted to cross-validated grid-search. See Transformer for more details.

BaseNode

class mlens.parallel.learner.BaseNode(name, estimator, indexer=None, verbose=False, **kwargs)[source]

Bases: mlens.parallel.base.OutputMixin, mlens.parallel.base.IndexMixin, mlens.parallel.base.BaseEstimator

Base computational node inherited by job generators.

Common API for job generators. A class that inherits the base need to set a __subtype__ in the constructor. The sub-type should be the class that runs estimations and must implement a __call__, fit, transform and predict method.

clear()[source]

Clear load

cloned_estimator

Copy of estimator

collect(path=None)[source]

Load fitted estimator from cache

Parameters:path (str, list, optional) – path to cache.
data

Dictionary with aggregated data from fitting sub-learners.

gen_fit(X, y, P=None)[source]

Routine for generating fit jobs conditional on refit

Parameters:
  • X (array-like of shape [n_samples, n_features]) – input array
  • y (array-like of shape [n_samples,]) – targets
  • P (array-like of shape [n_samples, n_prediction_features], optional) – output array to populate. Must be writeable. Only pass if predictions are desired.
gen_predict(X, P=None)[source]

Generate predicting jobs

Parameters:
  • X (array-like of shape [n_samples, n_features]) – input array
  • y (array-like of shape [n_samples,]) – targets
  • P (array-like of shape [n_samples, n_prediction_features], optional) – output array to populate. Must be writeable. Only pass if predictions are desired.
gen_transform(X, P=None)[source]

Generate cross-validated predict jobs

Parameters:
  • X (array-like of shape [n_samples, n_features]) – input array
  • y (array-like of shape [n_samples,]) – targets
  • P (array-like of shape [n_samples, n_prediction_features], optional) – output array to populate. Must be writeable. Only pass if predictions are desired.
learner

Generator for learner fitted on full data

raw_data

List of data collected from each sub-learner during fitting.

set_indexer(indexer)[source]

Set indexer and auxiliary attributes

Parameters:indexer (obj) – indexer to build instance with.
set_output_columns(X=None, y=None, job=None, n_left_concats=0)[source]

Set the output_columns attribute

sublearners

Generator for learner fitted on folds

times

Fit and predict times for the final learners

SubLearner

class mlens.parallel.learner.SubLearner(job, parent, estimator, in_index, out_index, in_array, targets, out_array, index)[source]

Bases: object

Estimation task

Wrapper around a sub_learner job.

data

fit data

fit(path=None)[source]

Fit sub-learner

predict(path=None)[source]

Predict with sublearner

transform(path=None)[source]

Predict with sublearner

SubTransformer

class mlens.parallel.learner.SubTransformer(job, parent, estimator, in_index, in_array, targets, index, out_index=None, out_array=None)[source]

Bases: object

Sub-routine for fitting a pipeline

data

fit data

fit(path=None)[source]

Fit transformers

predict()[source]

Dump transformers for prediction

transform()[source]

Dump transformers for prediction

EvalSubTransformer

class mlens.parallel.learner.EvalSubLearner(job, parent, estimator, in_index, out_index, in_array, targets, index)[source]

Bases: mlens.parallel.learner.SubLearner

sub-routine for cross-validated evaluation.

data

Score data

fit(path=None)[source]

Evaluate sub-learner

IndexedEstimator

class mlens.parallel.learner.IndexedEstimator(estimator, name, index, in_index, out_index, data)[source]

Bases: object

Indexed Estimator

Lightweight wrapper around estimator dumps during fitting.

estimator

Deep copy of estimator

Cache

class mlens.parallel.learner.Cache(obj, path, verbose)[source]

Bases: object

Cache wrapper for IndexedEstimator

Handles

Group

class mlens.parallel.handles.Group(indexer=None, learners=None, transformers=None, name=None, **kwargs)[source]

Bases: mlens.parallel.base.BaseEstimator

A handle for learners and transformers that share a common indexer.

Lightweight class for pairing a set of independent learners with a set of transformers that all share the same cross-validation strategy. A Group instance is an acceptable caller to ParallelProcessing.

New in version 0.2.0.

Note

All instances will share the same indexer. If instances have a different indexer, that indexer will be replaced.

See also

To run a Group instance, see run(). To handle several groups, use the Layer class.

Parameters:
  • indexer (inst, optional) – A index indexer to build learner and transformers on. If not passed, the first indexer of the learners will be enforced on all instances.
  • learners (list, inst, optional) – Learner instance(s) attached to indexer. Note that Group overrides previous indexer parameter settings.
  • transformers (list, inst, optional) – Transformer instance(s) attached to indexer. Note that Group overrides previous indexer parameter settings.
  • name (str, optional) – name of group
  • **kwargs (optional) – Optional keyword arguments to the BaseParallel backend.

Pipeline

class mlens.parallel.handles.Pipeline(pipeline, name=None, return_y=False)[source]

Bases: mlens.externals.sklearn.base.BaseEstimator

Transformer pipeline

Pipeline class for wrapping a preprocessing pipeline of transformers.

Parameters:
  • pipeline (list, instance) –

    A Transformer instance or a list of Transformer instances. Accepted input formats:

    option_1 = transformer_1
    option_2 = [transformer_1, transformer_2]
    option_3 = [("tr-1", transformer_1), ("tr-2", transformer_2)]
    option_4 = [transformer_1, ("tr-2", transformer_2)]
    
  • name (str, optional) – name of pipeline.
  • return_y (bool, default = False) – If True, both X and y will be returned in a transform() call.
fit(X, y=None)[source]

Fit pipeline.

Note that the Pipeline accepts both X and y arguments, and can return both X and y, depending on the transformers. The pipeline itself does no checks on the input.

Parameters:
  • X (array-like of shape [n_samples, n_features]) – Input data
  • y (array-like of shape [n_samples, ]) – Targets
Returns:

self – Fitted pipeline

Return type:

instance

fit_transform(X, y=None)[source]

Fit and transform pipeline.

Note that the Pipeline accepts both X and y arguments, and can return both X and y, depending on the transformers. The pipeline itself does no checks on the input.

Parameters:
  • X (array-like of shape [n_samples, n_features]) – Input data
  • y (array-like of shape [n_samples, ]) – Targets
Returns:

  • X_processed (array-like of shape [n_samples, n_preprocessed_features]) – Preprocessed input data
  • y (array-like of shape [n_samples, ], optional) – Preprocessed targets

transform(X, y=None)[source]

Transform pipeline.

Note that the Pipeline accepts both X and y arguments, and can return both X and y, depending on the transformers. Pipeline itself does not checks the input.

Parameters:
  • X (array-like of shape [n_samples, n_features]) – Input data
  • y (array-like of shape [n_samples, ]) – Targets
Returns:

  • X_processed (array-like of shape [n_samples, n_preprocessed_features]) – Preprocessed input data
  • y (array-like of shape [n_samples, ], optional) – Original or preprocessed targets, depending on the transformers.

make_group

mlens.parallel.handles.make_group(indexer, estimators, preprocessing, learner_kwargs=None, transformer_kwargs=None, name=None)[source]

Creating a Group from a set learners and transformers

Utility function for creating mapping a set of estimators and preprocessing pipelines to a Group of Learner and Transformer instances.

Parameters:
  • indexer (instance or None, default = None) – Indexer instance to use. See index for details.
  • estimators (dict of lists or list of estimators.) –

    If preprocessing is None or list, estimators should be a list. The list can either contain estimator instances, named tuples of estimator instances, or a combination of both.

    option_1 = [estimator_1, estimator_2]
    option_2 = [("est-1", estimator_1), ("est-2", estimator_2)]
    option_3 = [estimator_1, ("est-2", estimator_2)]
    

    If different preprocessing pipelines are desired, a dictionary that maps estimators to preprocessing pipelines must be passed. The names of the estimator dictionary must correspond to the names of the estimator dictionary.

    preprocessing_cases = {"case-1": [trans_1, trans_2].
                           "case-2": [alt_trans_1, alt_trans_2]}
    
    estimators = {"case-1": [est_a, est_b].
                  "case-2": [est_c, est_d]}
    

    The lists for each dictionary entry can be any of option_1, option_2 and option_3.

  • preprocessing (dict of lists or list, optional, default = None) –

    preprocessing pipelines for given layer. If the same preprocessing applies to all estimators, preprocessing should be a list of transformer instances. The list can contain the instances directly, named tuples of transformers, or a combination of both.

    option_1 = [transformer_1, transformer_2]
    option_2 = [("trans-1", transformer_1),
                ("trans-2", transformer_2)]
    option_3 = [transformer_1, ("trans-2", transformer_2)]
    

    If different preprocessing pipelines are desired, a dictionary that maps preprocessing pipelines must be passed. The names of the preprocessing dictionary must correspond to the names of the estimator dictionary.

    preprocessing_cases = {"case-1": [trans_1, trans_2].
                           "case-2": [alt_trans_1, alt_trans_2]}
    
    estimators = {"case-1": [est_a, est_b].
                  "case-2": [est_c, est_d]}
    

    The lists for each dictionary entry can be any of option_1, option_2 and option_3.

  • transformer_kwargs (dict, optional) – Keyword arguments to pass to the Transformer instances.
  • learner_kwargs (dict, optional) – Keyword arguments to pass to the Learner instances.
  • name (str, optional) – Name of group. Should be unique.

Wrappers

EstimatorMixin

class mlens.parallel.wrapper.EstimatorMixin[source]

Bases: object

Estimator mixin

Mixin class to build an estimator from a mlens.parallel backend class. The backend class should be set as the _backend attribute of the estimator during a fit call via a _build method. E.g:

Foo(EstimatorMixin, Learner):

    def __init__(self, ...):

        self._backend = None

    def _build(self):
        self._backend = Learner(...)

It is recommended to combine EstimatorMixin with parallel.base.ParamMixin.

fit(X, y, proba=False, refit=True)[source]

Fit

Fit estimator.

Parameters:
  • X (array of size [n_samples, n_features]) – input data
  • y (array of size [n_features,]) – targets
  • proba (bool, optional) – whether to fit for later predict_proba calls. Will register number of classes to expect in later predict and transform calls.
  • refit (bool (default = True)) – Whether to refit already fitted sub-learners.
Returns:

self – fitted estimator.

Return type:

instance

fit_transform(X, y, proba=False, refit=True)[source]

Fit

Fit estimator and return cross-validated predictions.

Parameters:
  • X (array of size [n_samples, n_features]) – input data
  • y (array of size [n_features,]) – targets
  • proba (bool, optional) – whether to fit for later predict_proba calls. Will register number of classes to expect in later predict and transform calls.
  • refit (bool (default = True)) – Whether to refit already fitted sub-learners.
Returns:

P – prediction generated by cross-validation.

Return type:

array of size [n_samples, n_prediction_features]

predict(X, proba=False)[source]

Predict

Predict using full-fold estimator (fitted on all data).

Parameters:
  • X (array of size [n_samples, n_features]) – input data
  • proba (bool, optional) – whether to predict class probabilities
Returns:

P – prediction with full-fold estimator.

Return type:

array of size [n_samples, n_prediction_features]

transform(X, proba=False)[source]

Transform

Use cross-validated estimators to generate predictions.

Parameters:
  • X (array of size [n_samples, n_features]) – input data
  • proba (bool, optional) – whether to predict class probabilities
Returns:

P – prediction generated by cross-validation.

Return type:

array of size [n_samples, n_prediction_features]

run

mlens.parallel.wrapper.run(caller, job, X, y=None, map=True, **kwargs)[source]

Utility for running a ParallelProcessing job on a set of callers.

Run is a utility mapping for setting up a ParallelProcessing job and executing across a set of callers. By default run executes:

out = mgr.map(caller, job, X, y, **kwargs)

run() handles temporary parameter changes, for instance running a learner with proba=True that has proba=False as default. Similarly, instances destined to not produce output can be forced to yield predictions by passing return_preds=True as a keyword argument.

Note

To run a learner with a preprocessing dependency, the instances need to be wrapped in a Group

run(Group(learner, transformer), 'predict', X, y)
Parameters:
  • caller (instance, list) – A runnable instance, or a list of instances.
  • job (str) – type of job to run. One of 'fit', 'transform', 'predict'.
  • X (array-like) – input
  • y (array-like, optional) – targets
  • map (bool (default=True)) – whether to run a ParallelProcessing.map() job. If False, will instead run a ParallelProcessing.stack() job.
  • **kwargs (optional) – Keyword arguments. run() searches for proba and return_preds to temporarily update callers to run desired job and return desired output. Other kwargs are passed to either map or stack.

Backend

BaseProcessor

class mlens.parallel.backend.BaseProcessor(backend=None, n_jobs=None, verbose=None)[source]

Bases: object

Parallel processing base class.

Base class for parallel processing engines.

Parameters:
  • backend (str, optional) – Type of backend. One of 'threading', 'multiprocessing', 'sequential'.
  • n_jobs (int, optional) – Degree of concurrency.
  • verbose (bool, int, optional) – Level of verbosity of the Parallel instance.
clear()[source]

Destroy cache and reset instance job parameters.

initialize(job, X, y, path, warm_start=False, return_preds=False, **kwargs)[source]

Initialize processing engine.

Set up the job parameters before an estimation call. Calling clear() undoes initialization.

Parameters:
  • job (str) – type of job to complete with each task. One of 'fit', 'predict' and 'transform'.
  • X (array-like of shape [n_samples, n_features]) – Input data
  • y (array-like of shape [n_samples,], optional.) – targets. Required for fit, should not be passed to predict or transform jobs.
  • path (str or dict, optional) – Custom estimation cache. Pass a string to force use of persistent cache on disk. Pass a dict for in-memory cache (requires backend != 'multiprocessing'.
  • return_preds (bool or list, optional) – whether to return prediction ouput. If True, final prediction is returned. Alternatively, pass a list of task names for which output should be returned.
  • warm_start (bool, optional) – whether to re-use previous input data initialization. Useful if repeated jobs are made on the same input arrays.
  • **kwargs (optional) – optional keyword arguments to pass onto the task’s call method.
Returns:

out – An output parameter dictionary to pass to pass to an estimation method. Either None (no output), or {'final':True} for only final prediction, or {'final': False, 'return_names': return_preds} if a list of task-specific output was passed.

Return type:

dict

ParallelProcessing

class mlens.parallel.backend.ParallelProcessing(*args, **kwargs)[source]

Bases: mlens.parallel.backend.BaseProcessor

Parallel processing engine.

Engine for running computational graph.

ParallelProcessing is a manager for executing a sequence of tasks in a given caller, where each task is run sequentially, but assumed to be parallelized internally. The main responsibility of ParallelProcessing is to handle memory-mapping, estimation cache updates, input and output array updates and output collection.

Parameters:
  • caller (obj) – the caller of the job. Either a Layer or a meta layer class such as Sequential.
  • *args (optional) – Optional arguments to BaseProcessor
  • **kwargs (optional) – Optional keyword arguments to BaseProcessor.
get_preds(dtype=None, order='C')[source]

Return prediction matrix.

Parameters:
  • dtype (numpy dtype object, optional) – data type to return
  • order (str (default = 'C')) – data order. See numpy.asarray for details.
Returns:

P – Prediction array

Return type:

array-like

map(caller, job, X, y=None, path=None, return_preds=False, wart_start=False, split=False, **kwargs)[source]

Parallel task mapping.

Run independent tasks in caller in parallel.

Warning

By default, the :~mlens.parallel.backend.ParallelProcessing.map` runs on a shallow cache, where all tasks share the same cache. As such, the user must ensure that each task has a unique name, or cache retrieval will be corrupted. To commit a seperate sub-cache to each task, set split=True.

Parameters:
  • caller (iterable) – Iterable that generates accepted task instances. Caller should be a child of the BaseBackend class, and tasks need to implement an appropriate call method.
  • job (str) – type of job to complete with each task. One of 'fit', 'predict' and 'transform'.
  • X (array-like of shape [n_samples, n_features]) – Input data
  • y (array-like of shape [n_samples,], optional.) – targets. Required for fit, should not be passed to predict or transform jobs.
  • path (str or dict, optional) – Custom estimation cache. Pass a string to force use of persistent cache on disk. Pass a dict for in-memory cache (requires backend != 'multiprocessing'.
  • return_preds (bool or list, optional) – whether to return prediction ouput. If True, final prediction is returned. Alternatively, pass a list of task names for which output should be returned.
  • warm_start (bool, optional) – whether to re-use previous input data initialization. Useful if repeated jobs are made on the same input arrays.
  • split (bool, default = False) – whether to commit a separate sub-cache to each task.
  • **kwargs (optional) – optional keyword arguments to pass onto each task.
Returns:

out – Prediction array(s).

Return type:

array-like, list, optional

process(caller, out, **kwargs)[source]

Process job.

Main method for processing a caller. Requires the instance to be setup by a prior call to initialize().

See also

map(), stack()

Parameters:
  • caller (iterable) – Iterable that generates accepted task instances. Caller should be a child of the BaseBackend class, and tasks need to implement an appropriate call method.
  • out (dict) – A dictionary with output parameters. Pass an empty dict for no output. See initialize() for more details.
Returns:

out – Prediction array(s).

Return type:

array-like, list, optional

stack(caller, job, X, y=None, path=None, return_preds=False, warm_start=False, split=True, **kwargs)[source]

Stacked parallel task mapping.

Run stacked tasks in caller in parallel.

This method runs a stack of tasks as a stack, where the output of each task is the input to the next.

Warning

By default, the stack() method runs on a deep cache, where each tasks has a separate cache. As such, the user must ensure that tasks don’t depend on data cached by previous tasks. To run all tasks in a single sub-cache, set split=False.

Parameters:
  • caller (iterable) – Iterable that generates accepted task instances. Caller should be a child of the BaseBackend class, and tasks need to implement an appropriate call method.
  • job (str) – type of job to complete with each task. One of 'fit', 'predict' and 'transform'.
  • X (array-like of shape [n_samples, n_features]) – Input data
  • y (array-like of shape [n_samples,], optional.) – targets. Required for fit, should not be passed to predict or transform jobs.
  • path (str or dict, optional) – Custom estimation cache. Pass a string to force use of persistent cache on disk. Pass a dict for in-memory cache (requires backend != 'multiprocessing'.
  • return_preds (bool or list, optional) – whether to return prediction output. If True, final prediction is returned. Alternatively, pass a list of task names for which output should be returned.
  • warm_start (bool, optional) – whether to re-use previous input data initialization. Useful if repeated jobs are made on the same input arrays.
  • split (bool, default = True) – whether to commit a separate sub-cache to each task.
  • **kwargs (optional) – optional keyword arguments to pass onto each task.
Returns:

out – Prediction array(s).

Return type:

array-like, list, optional

ParallelEvaluation

class mlens.parallel.backend.ParallelEvaluation(*args, **kwargs)[source]

Bases: mlens.parallel.backend.BaseProcessor

Parallel cross-validation engine.

Minimal parallel processing engine. Similar to ParallelProcessing, but offers less features, only fits the callers indexer, and excepts no task output.

process(caller, case, X, y, path=None, **kwargs)[source]

Process caller.

Parameters:
  • caller (iterable) – Iterable for evaluation job.s Expected caller is a Evaluator instance.
  • case (str) – evaluation case to run on the evaluator. One of 'preprocess' and 'evaluate'.
  • X (array-like of shape [n_samples, n_features]) – Input data
  • y (array-like of shape [n_samples,], optional.) – targets. Required for fit, should not be passed to predict or transform jobs.
  • path (str or dict, optional) – Custom estimation cache. Pass a string to force use of persistent cache on disk. Pass a dict for in-memory cache (requires backend != 'multiprocessing'.

Job

class mlens.parallel.backend.Job(job, stack, split, dir=None, tmp=None, predict_in=None, targets=None, predict_out=None)[source]

Bases: object

Container class for holding and managing job data.

Job is intended as a on-the-fly job handler that keeps track of input data, predictions, and manages estimation caches.

Changed in version 0.2.0.

Parameters:
  • job (str) – Type of job to run. One of 'fit', 'transform', 'predict'.
  • stack (bool) – Whether to stack outputs when calls to update() are made. This will make the predict_out array become predict_in.
  • split (bool) – Whether to create a new sub-cache when the args property is called.
  • dir (str, dict, optional) – estimation cache. Pass dictionary for use with multiprocessing or a string pointing to the disk directory to create the cache in
  • tmp (obj, optional) – a Tempfile object for temporary directories
  • targets (array-like of shape [n_in_samples,], optional) – input targets
  • predict_in (array-like of shape [n_in_samples, n_in_features], optional) – input data
  • predict_out (array_like of shape [n_out_samples, n_out_features], optional) – prediction output array
args(**kwargs)[source]

Produce args dict

New in version 0.2.0.

Returns the arguments dictionary passed to a task of a parallel processing manager. Output dictionary has the following form:

out = {'auxiliary':
           {'X': self.predict_in, 'P': self.predict_out},
       'main':
           {'X': self.predict_in, 'P': self.predict_out},
       'dir':
           self.subdir(),
       'job':
            self.job
        }
Parameters:**kwargs (optional) – Optional keyword arguments to pass to the task.
Returns:args – Arguments dictionary
Return type:dict
clear()[source]

Clear output data for new task

rebase()[source]

Rebase output labels to input indexing.

Some indexers that only generate predictions for subsets of the training data require the targets to be rebased. Since indexers operate in a strictly sequential manner, rebase simply drop the first n observations in the target vector until number of observations remaining coincide.

See also

BlendIndex

shuffle(random_state)[source]

Shuffle inputs.

Permutes the indexing of predict_in and y arrays.

Parameters:random_state (int, obj) – Random seed number or generator to use.
subdir()[source]

Return a cache subdirectory

If split is en force, a new sub-cache will be created in the main cache. Otherwise the same sub-cache as used in previous call will be returned.

New in version 0.2.0.

Returns:cache – Either a string pointing to a cache persisted to disk, or an in-memory cache in the form of a list.
Return type:str, list
update()[source]

Updated output array and shift to input if stacked.

If stacking is en force, the output array will replace the input array, and used as input for subsequent jobs. Sparse matrices are force-converted to csr format.

dump_array

mlens.parallel.backend.dump_array(array, name, path)[source]

Dump array for memmapping.

Parameters:
  • array (array-like) – Array to be persisted
  • name (str) – Name of file
  • path (str) – Path to cache.
Returns:

f – memory-mapped array.

Return type:

array-like

Base classes

Schedulers for global setups:

Order Setup types Function calls
  1. Base setups
Independent of other features IndexMixin._setup_0_index
  1. Global setups
Reserved for aggregating classes BaseStacker._setup_1_global
  1. General local
Setups Dependents on 0 ProbaMixin.__setup_2_multiplier
  1. Conditional
Setups Dependents on 0, 2 OutputMixin.__setup_3__output_columns

Note that base classes and setup schedulers are experimental and may change without a deprecation cycle.

BaseBackend

class mlens.parallel.base.BaseBackend(backend=None, n_jobs=-1, dtype=None, raise_on_exception=True)[source]

Bases: object

Base class for parallel backend

Implements default backend settings.

__weakref__

list of weak references to the object (if defined)

BaseParallel

class mlens.parallel.base.BaseParallel(name, *args, **kwargs)[source]

Bases: mlens.parallel.base.BaseBackend

Base class for parallel objects

Parameters:
  • name (str) – name of instance. Should be unique.
  • backend (str or object (default = 'threading')) – backend infrastructure to use during call to mlens.externals.joblib.Parallel. See Joblib for further documentation. To set global backend, see set_backend().
  • raise_on_exception (bool (default = True)) – whether to issue warnings on soft exceptions or raise error. Examples include lack of layers, bad inputs, and failed fit of an estimator in a layer. If set to False, warnings are issued instead but estimation continues unless exception is fatal. Note that this can result in unexpected behavior unless the exception is anticipated.
  • verbose (int or bool (default = False)) – level of verbosity.
  • n_jobs (int (default = -1)) – Degree of concurrency in estimation. Set to -1 to maximize, 1 runs on a single process (or thread).
  • dtype (obj (default = np.float32)) – data type to use, must be compatible with a numpy array dtype.
__iter__()[source]

Iterator for process manager

setup(X, y, job, skip=None, **kwargs)[source]

Setup instance for estimation

BaseEstimator

class mlens.parallel.base.BaseEstimator(*args, **kwargs)[source]

Bases: mlens.parallel.base.ParamMixin, mlens.externals.sklearn.base.BaseEstimator, mlens.parallel.base.BaseParallel

Base Parallel Estimator class

Modified Scikit-learn class to handle backend params that we want to protect from changes.

__fitted__

Fit status

BaseStacker

class mlens.parallel.base.BaseStacker(stack=None, verbose=False, *args, **kwargs)[source]

Bases: mlens.parallel.base.BaseEstimator

Base class for instanes that stack job estimators

__fitted__

Fitted status

__stack__

Check stack

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (boolean, optional) – whether to return nested parameters.
pop(idx)[source]

Pop a previous push with index idx

push(*stack)[source]

Push onto stack

replace(idx, item)[source]

Replace a current member of the stack with a new instance

verbose

Verbosity

Mixins

ParamMixin

class mlens.parallel.base.ParamMixin[source]

Bases: mlens.externals.sklearn.base.BaseEstimator, object

Parameter Mixin

Mixin for protecting static parameters from changes after fitting.

Note

To use this mixin the instance inheriting it must set __static__=list() and _static_fit_params_=dict() in __init__.

_check_static_params()[source]

Check if current static params are identical to previous params

_store_static_params()[source]

Record current static params for future comparison.

IndexMixin

class mlens.parallel.base.IndexMixin[source]

Bases: object

Indexer mixin

Mixin for handling indexers.

Note

To use this mixin the instance inheriting it must set the
indexer or indexers attribute in __init__ (not both).
__indexer__

Flag for existence of indexer

__weakref__

list of weak references to the object (if defined)

_check_indexer(indexer)[source]

Check consistent indexer classes

_get_indexers()[source]

Return list of indexers

OutputMixin

class mlens.parallel.base.OutputMixin[source]

Bases: mlens.parallel.base.IndexMixin

Output Mixin

Mixin class for interfacing with ParallelProcessing when outputs are desired.

Note

To use this mixin the instance inheriting it must set the feature_span attribute and __no_output__ flag in __init__.

_setup_3_output_columns(X, y, job, n_left_concats=0)[source]

Set output columns for prediction array. Used during setup

set_output_columns(X, y, job, n_left_concats=0)[source]

Set output columns for prediction array

shape(job)[source]

Prediction array shape

size(attr)[source]

Get size of dim 0

ProbaMixin

class mlens.parallel.base.ProbaMixin[source]

Bases: object

“Probability Mixin

Mixin for probability features on objects interfacing with ParallelProcessing

Note

To use this mixin the instance inheriting it must set the proba and the _classes(=None)``attribute in ``__init__.

__weakref__

list of weak references to the object (if defined)

classes_

Prediction classes during proba