ML-Ensemble
| author: | Sebastian Flennerhag |
|---|---|
| copyright: | 2017-2018 |
| license: | MIT |
Model selection suite for ensemble learning.
Implements computational graph framework for maximally parallelized cross-validation of an arbitrary set of estimators over an arbitrary set of preprocessing pipelines. Model selection suite features batch randomized grid search and batch benchmarking. Ensembles can be treated as preprocessing pipelines for next-layer model selection.
mlens.model_selection¶
BaseEval¶
-
class
mlens.model_selection.BaseEval(verbose=False, array_check=None, **kwargs)[source]¶ Bases:
mlens.parallel.base.IndexMixin,mlens.parallel.base.BaseBackendBase Evaluation class.
-
raw_data¶ Cross validated scores
-
Evaluator¶
-
class
mlens.model_selection.Evaluator(scorer, cv=2, shuffle=True, random_state=None, error_score=None, metrics=None, verbose=False, **kwargs)[source]¶ Bases:
mlens.model_selection.model_selection.BaseEvalModel selection across several estimators and preprocessing pipelines.
The
Evaluatorallows users to evaluate several models in one call across a set preprocessing pipelines. The class is useful for comparing a set of estimators, especially when several preprocessing pipelines is to be evaluated. By pre-making all folds and iteratively fitting estimators with different parameter settings, array slicing and preprocessing is kept to a minimum. This can greatly reduced fit time compared to creating pipeline classes for each estimator and pipeline and fitting them one at a time in an Scikit-learnsklearn.model_selection.GridSearchclass.Preprocessing can be done before making any evaluation, and several evaluations can be made on the pre-made folds. Current implementation relies on a randomized grid search, so parameter grids must be specified as SciPy distributions (or a class that accepts an
rvsmethod).Changed in version 0.2.0.
Parameters: - scorer (function) –
a scoring function that follows the Scikit-learn API:
score = scorer(estimator, y_true, y_pred)
A user defines scoring function,
score = f(y_true, y_pred)can be made into a scorer by calling on the ML-Ensemble implementation of Scikit-learn’smake_scorer. NOTE: do not use Scikit-learn’smake_scorerif the Evaluator is to be pickled.from mlens.metrics import make_scorer scorer = make_scorer(scoring_function, **kwargs)
- error_score (int, optional) – score to assign when fitting an estimator fails. If
None, the evaluator will raise an error. - cv (int or obj, default = 2) – cross validation folds to use. Either pass a
KFoldclass that obeys the Scikit-learn API. - metrics (list, optional) – list of aggregation metrics to calculate on scores. Default is mean and standard deviation.
- shuffle (bool, default = True) – whether to shuffle input data before creating cv folds.
- random_state (int, optional) – seed for creating folds (if shuffled) and parameter draws
- n_jobs (int, default = -1) – number of CPU cores to use.
- verbose (bool or int, default = False) –
level of printed messages. Levels:
verbose=1: Message at start and end with total timeverbose=2: Additional messages for each sub-job (preprocess and evaluation)verbose in [3, 14]: Additional messages with job completion status at increasing increasing frequencyVerbose >= 15: prints each job completed as [case].[est].[draw].[fold]
If
verbose>=20, prints tosys.stderr, elsesys.stdout.
-
fit(X, y, estimators=None, param_dicts=None, n_iter=2, preprocessing=None)[source]¶ Fit
Fit preprocessing if applicable and evaluate estimators if applicable. The method automatically determines whether to only run preprocessing, only evaluation (possibly on previously fitted preprocessing), or both. Calling
fitwill overwrite previously stored data where applicable.Parameters: - X (array-like, shape=[n_samples, n_features]) – input data to preprocess and create folds from.
- y (array-like, shape=[n_samples, ]) – training labels.
- estimators (list or dict, optional) –
set of estimators to use. If no preprocessing is desired or if only on preprocessing pipeline should apply to all, pass a list of estimators. The list can contain elements of named tuples (i.e.
('my_name', my_est)).If different estimators should be mapped to preprocessing cases, a dictionary that maps estimators to each case should be passed:
{'case_a': list_of_est, ...}. - param_dicts (dict, optional) –
parameter distribution mapping for estimators. Current implementation only supports randomized grid search. Passed distribution object must have an
rvsmethod. Seescipy.statsfor details.There is quite some flexibility in specifying
param_dicts. If there is no preprocessing, or if all estimators are fitted on all preprocessing cases, theparam_dictshould have keys matching the names of the estimators.estimators = [('name', est), est] param_dicts = {'name': {'param-1': some_distribution}, 'est': {'param-1': some_distribution} }
It is possible to specify different distributions for some or all preprocessing cases:
preprocessing = {'case-1': transformer_list, 'case-2': transformer_list} estimators = [('name', est), est] param_dicts = {'name': {'param-1': some_distribution}, ('case-1', 'est'): {'param-1': some_distribution} ('case-2', 'est'): {'param-1': some_distribution, 'param-2': some_distribution} }
If estimators are mapped on a per-preprocessing case basis as a dictionary,
param_dictmust have key entries of the form(case_name, est_name). - n_iter (int) – number of parameter draws to evaluate.
- preprocessing (dict, optional) –
preprocessing cases to consider. Pass a dictionary mapping a case name to a preprocessing pipeline.
preprocessing = {'case_name': transformer_list,}
Returns: self – class instance with stored estimator evaluation results in the
resultsattribute.Return type: instance
- scorer (function) –
Benchmark¶
-
class
mlens.model_selection.Benchmark(verbose=False, **kwargs)[source]¶ Bases:
mlens.model_selection.model_selection.BaseEvalBenchmark engine without hyper-parameter grid search.
A simplified version of the
Evaluatorthat performs a single pass over a set of estimators and preprocessing pipelines for benchmarking purposes.New in version 0.2.0.
Parameters: - verbose (bool, int, optional) – Verbosity during estimation.
- **kwargs (optional) – Optional keyword argument to
BaseBackend.
-
fit(X, y, scorer, cv, estimators, preprocessing=None, error_score=None)[source]¶ Run benchmarking job on given data with given estimators.
Fit preprocessing if applicable and evaluate estimators if applicable. The method automatically determines whether to only run preprocessing, only evaluation (possibly on previously fitted preprocessing), or both. Calling
fitwill overwrite previously stored data where applicable.Parameters: - X (array-like, shape=[n_samples, n_features]) – input data to preprocess and create folds from.
- y (array-like, shape=[n_samples, ]) – training labels.
- scorer (function) –
a scoring function that follows the Scikit-learn API:
score = scorer(estimator, y_true, y_pred)
A user defines scoring function,
score = f(y_true, y_pred)can be made into a scorer by calling on the ML-Ensemble implementation of Scikit-learn’smake_scorer. NOTE: do not use Scikit-learn’smake_scorerif the Evaluator is to be pickled.from mlens.metrics import make_scorer scorer = make_scorer(scoring_function, **kwargs)
- error_score (int, optional) – score to assign when fitting an estimator fails. If
None, the evaluator will raise an error. - cv (int or obj (default = 2)) – cross validation folds to use. Either pass a
KFoldclass that obeys the Scikit-learn API. - estimators (list or dict, optional) –
set of estimators to use. If no preprocessing is desired or if only on preprocessing pipeline should apply to all, pass a list of estimators. The list can contain elements of named tuples (i.e.
('my_name', my_est)).If different estimators should be mapped to preprocessing cases, a dictionary that maps estimators to each case should be passed:
{'case_a': list_of_est, ...}. - preprocessing (dict, optional) –
preprocessing cases to consider. Pass a dictionary mapping a case name to a preprocessing pipeline.
preprocessing = {'case_name': transformer_list,}
Returns: self – Fitted Benchmark instance. Results available in the
resultsattribute.Return type: inst
benchmark¶
-
mlens.model_selection.benchmark(X, y, scorer, cv, estimators, preprocessing, error_score=None, **kwargs)[source]¶ Benchmark estimators across preprocessing pipelines.
benchmark()runs cross validation scoring of a set of estimators, possible against a set of preprocessing pipelines. Equivalent toevl = Benchmark(**kwargs) evl.fit(X, y, scorer, ...)
New in version 0.2.0.
Parameters: - X (array-like, shape=[n_samples, n_features]) – input data to preprocess and create folds from.
- y (array-like, shape=[n_samples, ]) – training labels.
- scorer (function) –
a scoring function that follows the Scikit-learn API:
score = scorer(estimator, y_true, y_pred)
A user defines scoring function,
score = f(y_true, y_pred)can be made into a scorer by calling on the ML-Ensemble implementation of Scikit-learn’smake_scorer. NOTE: do not use Scikit-learn’smake_scorerif the Evaluator is to be pickled.from mlens.metrics import make_scorer scorer = make_scorer(scoring_function, **kwargs)
- error_score (int, optional) – score to assign when fitting an estimator fails. If
None, the evaluator will raise an error. - cv (int or obj (default = 2)) – cross validation folds to use. Either pass a
KFoldclass that obeys the Scikit-learn API. - estimators (list or dict, optional) –
set of estimators to use. If no preprocessing is desired or if only on preprocessing pipeline should apply to all, pass a list of estimators. The list can contain elements of named tuples (i.e.
('my_name', my_est)).If different estimators should be mapped to preprocessing cases, a dictionary that maps estimators to each case should be passed:
{'case_a': list_of_est, ...}. - preprocessing (dict, optional) –
preprocessing cases to consider. Pass a dictionary mapping a case name to a preprocessing pipeline.
preprocessing = {'case_name': transformer_list,}
- **kwargs (optional) – Optional arguments to
BaseBackend.
Returns: results – Summary output that shows data for best mean test scores, such as test and train scores, std, fit times, and params.
Return type: