Surrogate Module

class corrai.surrogate.ModelTrainer(model, test_size=0.2, random_state=42)[source]

Bases: object

__init__(model, test_size=0.2, random_state=42)[source]

Initialize a ModelTrainer instance for training a machine learning model.

Parameters:

model_pipe – A scikit-learn compatible model pipeline for training and prediction.
test_size (float) – The proportion of the dataset to set aside as the test set (default: 0.2).
random_state (float) – Seed for random number generation to ensure reproducibility (default: 42).

The ModelTrainer prepares data for training and evaluation of the specified model.

Attributes: - test_size: The proportion of data to be used as the test set. - model_pipe: The machine learning model pipeline to be trained. - random_state: Seed for random number generation. - x_train: Training data features. - x_test: Test data features. - y_train: Training data labels. - y_test: Test data labels. - _is_trained: A boolean indicating if the model has been trained.

train(X, y)[source]

property test_nmbe_score

property test_cvrmse_score

class corrai.surrogate.MultiModelSO(models=None, cv=3, scoring='neg_mean_squared_error', fine_tuning=True, tuning_n_iter=None, use_continuous_distributions=False, n_jobs=-1, random_state=None)[source]

Bases: BaseEstimator, RegressorMixin

Multi-model selection and optimization wrapper for scikit-learn regressors.

This class automates model training, cross-validation scoring, model selection, and optional fine-tuning via grid search. It compares multiple candidate models and selects the one with the best cross-validation performance according to a specified scoring metric.

Parameters:

models (list[str]) –

List of model keys to evaluate. Must be a subset of MODEL_MAP.
”TREE_REGRESSOR”, “RANDOM_FOREST”, “LINEAR_REGRESSION”, “LINEAR_SECOND_ORDER”, “LINEAR_THIRD_ORDER”, “SUPPORT_VECTOR”, “MULTI_LAYER_PERCEPTRON”

If None (default), all models in MODEL_MAP are evaluated.
cv (int) – Number of cross-validation folds to use for model comparison.
fine_tuning (bool) – If True, perform a grid search on the best model to fine-tune its hyperparameters.
scoring (str) – Scoring function to evaluate models. Should be a valid scikit-learn scorer string (e.g. "r2", "neg_mean_absolute_error").
n_jobs (int) – Number of parallel jobs for cross-validation and grid search. -1 means using all available cores.
random_state (int) – Random seed for reproducibility.

Variables:

model_map (dict) – Dictionary mapping model keys to fitted estimator instances.
best_model_key (str) – The key of the best-performing model after training.
_is_fitted (bool) – Whether the estimator has been fitted.

fit(X, y, verbose=True)[source]: Train and evaluate all models, selecting the best one.

predict(X, model=None)[source]: Predict using the best model (or a specified model).

get_model(model=None)[source]: Retrieve the fitted estimator by key.

fine_tune(X, y, model=None, verbose=3)[source]: Perform grid search hyperparameter tuning on a given model.

Examples

>>> import pandas as pd
>>> from sklearn.datasets import load_diabetes
>>> from sklearn.model_selection import train_test_split
>>> from corrai.surrogate import MultiModelSO
>>>
>>> data = load_diabetes(as_frame=True)
>>> X = data.data
>>> y = data.target
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.2, random_state=42
... )
>>>
>>> model = MultiModelSO(
...     models=["LINEAR_REGRESSION", "RANDOM_FOREST"], cv=5, fine_tuning=False
... )
>>> model.fit(X_train, y_train, verbose=True)
=== Training results ===
Cross validation neg_mean_squared_error scores of 5 folds
                              mean(neg_mean_squared_error) std(neg_mean_squared_error)
RANDOM_FOREST                                -3143.015307                          355.466814
LINEAR_REGRESSION                            -3425.368758                          525.460964
>>> y_pred = model.predict(X_test)
>>> y_pred.head()
      0
287  139.547558
211  179.517208
72   134.038756
321  291.417029
73   123.789659

>>> # Fast configuration and training (development)
>>> model = MultiModelSO(
...     models=["LINEAR_REGRESSION", "RANDOM_FOREST", "MULTI_LAYER_PERCEPTRON"],
...     cv=3,
...     fine_tuning=True,
...     tuning_n_iter=TUNING_N_ITER_BY_MODEL,
...     use_continuous_distributions=False,
...     n_jobs=-1,
... )

>>> # Optimal configuration (production)
>>> model = MultiModelSO(
...     models=None,
...     cv=5,
...     fine_tuning=True,
...     tuning_n_iter=TUNING_N_ITER_BY_MODEL,
...     use_continuous_distributions=True,
...     n_jobs=-1,
...     random_state=42,
... )

__init__(models=None, cv=3, scoring='neg_mean_squared_error', fine_tuning=True, tuning_n_iter=None, use_continuous_distributions=False, n_jobs=-1, random_state=None)[source]

property feature_names_in_

fit(X, y, verbose=True)[source]

predict(X, model=None)[source]

Return type:: DataFrame

get_model(model=None)[source]

fine_tune(X, y, model=None, verbose=True)[source]

get_feature_importance(model=None, top_n=10)[source]

set_fit_request(*, verbose: bool | None | str = '$UNCHANGED$') → MultiModelSO

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.
Returns:: self – The updated object.
Return type:: object

set_predict_request(*, model: bool | None | str = '$UNCHANGED$') → MultiModelSO

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: model (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for model parameter in predict.
Returns:: self – The updated object.
Return type:: object

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → MultiModelSO

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
Returns:: self – The updated object.
Return type:: object

class corrai.surrogate.StaticScikitModel(scikit_model, target_name=None)[source]

Bases: Model

Wrapper class for static surrogate MultiModelSingleOutput class and scikit-learn regressors within the Corrai framework.

This class adapts corrai’s MultiModelSO and scikit-learn models to the Model interface, enabling parameter-to-property mapping and simulation execution. It is intended for non-dynamic (static) models where outputs are single values or vectors rather than time-dependent series.

Parameters:

scikit_model (MultiModelSO | RegressorMixin) – The underlying scikit-learn model or a Corrai MultiModelSO meta-estimator.
target_name (str) – Name of the output variable. Required when scikit_model is not an instance of MultiModelSO.

Variables:

is_dynamic (bool) – Always False for this wrapper, since it represents static models.
scikit_model (MultiModelSO or RegressorMixin) – The wrapped scikit-learn model used for predictions.
target_name (str) – Output variable name.

Raises:

ValueError – If target_name cannot be inferred and is not provided.

__init__(scikit_model, target_name=None)[source]

simulate(property_dict=None, simulation_options=None, **simulation_kwargs)[source]

Run the scikit-learn model prediction.

Combines provided parameter values and simulation options into a feature vector, validates compatibility with the underlying model, and returns predictions as a pandas Series.

Parameters:

property_dict (dict[str, str | int | float]) – Mapping from feature names to values to use for prediction.
simulation_options (dict) – Additional feature overrides or configuration parameters to include in the feature vector. These values override those in property_dict if keys overlap.
**simulation_kwargs – Extra keyword arguments for future extensions (currently unused).

Returns:

Prediction results with index [self.target_name].

Return type:

Series

Raises:

ValueError – If unknown feature names are provided.