Sampling Module

class corrai.sampling.Sample(parameters, is_dynamic=True, results=<factory>)[source]

Bases: object

Container for simulation samples and results.

Each Sample instance stores parameter values and the corresponding simulation results. It supports indexing, aggregation, plotting, and integration with sampling strategies.

Handle both dynamic and static models.

Parameters:

parameters (list of Parameter) – List of model parameters used to generate the samples.

Variables:

parameters (list of Parameter) – Parameters associated with this sample.
is_dynamic (Bool default True) – Specify if stored results are timeeries in a DataFrame for dynamic models or a Series of float for static models
values (ndarray of shape (n_samples, n_parameters)) – Numerical values of the sampled parameters.
results (Series of DataFrames) – Simulation results for each sample. Each element is typically a pandas DataFrame indexed by time, containing model outputs.

parameters: list[Parameter]

is_dynamic: bool = True

values: DataFrame

results: Series

get_pending_index()[source]

Identify which samples have not yet been simulated.

Returns:: Boolean mask of length len(self), where True indicates a sample without results.
Return type:: ndarray of bool

get_parameters_intervals()[source]

Return parameter intervals.

Returns:

Lower and upper bounds for each parameter.

Return type:

ndarray of shape (n_parameters, 2)

Raises:

NotImplementedError – If any parameter has type ‘Integer’.
ValueError – If parameters are not of type ‘Real’.

get_list_parameter_value_pairs(idx=None)[source]

Map parameter objects to their sampled values.

Parameters:: idx (int, list of int, ndarray, or slice, optional) – Indices of samples to retrieve. Defaults to all.
Returns:: Nested list where each inner list corresponds to a sample.
Return type:: list of list of (Parameter, value)

get_dimension_less_values(idx=slice(None, None, None))[source]

Normalize parameter values to [0, 1].

Parameters:: idx (int, list, ndarray, or slice, optional) – Indices of samples to normalize. Defaults to all.
Returns:: Dimensionless parameter values, scaled using their defined intervals.
Return type:: ndarray of shape (n_selected, n_parameters)

add_samples(values, results=None)[source]

Add new samples and optionally their results.

Parameters:

values (ndarray of shape (n_samples, n_parameters)) – Sampled parameter values to add.
results (list of DataFrame, optional) – Simulation results corresponding to values. If None, empty DataFrames are stored.

Raises:

AssertionError – If results length does not match values length.

get_aggregated_time_series(indicator, method='mean', agg_method_kwarg=None, reference_time_series=None, freq=None, prefix='aggregated')[source]

Aggregate sample results using a specified statistical or error metric.

This method extracts the specified indicator column, and aggregates the time series across simulations using the given method. If a reference time series is provided, metrics that require ground truth (e.g., mean_absolute_error) are supported.

If freq is provided, the aggregation is done over time bins, producing a table of simulation runs versus time periods.

Only works for dynamic models

Parameters:

indicator (str) – The column name in each DataFrame to extract and aggregate.
method (str, default="mean") – The aggregation method to use. Supported methods include: - “mean” - “sum” - “nmbe” - “cv_rmse” - “mean_squared_error” - “mean_absolute_error”
agg_method_kwarg (dict, optional) – Additional keyword arguments to pass to the aggregation function.
reference_time_series (pandas.Series, optional) – Reference series (y_true) to compare each simulation against. Required for error-based methods such as “mean_absolute_error”. Must have the same datetime index and length as the individual simulation results.
freq (str or pandas.Timedelta or datetime.timedelta, optional) – If provided, aggregate the time series within bins of this frequency (e.g., “d” for daily, “h” for hourly). The result will be a DataFrame where each row corresponds to a simulation and each column to a time bin.
prefix (str, default="aggregated") – Prefix to use for naming the output column when freq is not specified.

Returns:

If freq is not provided, returns a one-column DataFrame containing the aggregated metric per simulation, indexed by the same index as results.

If freq is provided, returns a DataFrame indexed by simulation IDs (same as results.index), with columns representing each aggregated time bin.

Return type:

pandas.DataFrame

Raises:

ValueError – If the shapes of results and reference_time_series are incompatible. If the datetime index is not valid or missing.

Examples

>>> import pandas as pd
>>> import numpy as np

>>> from corrai.base.parameter import Parameter
>>> from corrai.sampling import Sample

>>> sample = Sample(
...     parameters=[
...         Parameter("a", interval=(1, 10)),
...         Parameter("b", interval=(1, 10)),
...     ]
... )

>>> t = pd.date_range("2009-01-01", freq="h", periods=2)
>>> res_1 = pd.DataFrame({"a": [1, 2]}, index=t)
>>> res_2 = pd.DataFrame({"a": [3, 4]}, index=t)

>>> sample.add_samples(np.array([[1, 2], [3, 4]]), [res_1, res_2])

>>> # No frequency aggregation: one aggregated value per simulation
>>> sample.get_aggregated_time_series("a")
   aggregated_a
0           1.5
1           3.5

>>> # With frequency aggregation: one value per time bin per simulation
>>> ref = pd.Series(
...     [1, 1], index=pd.date_range("2009-01-01", freq="h", periods=2)
... )

>>> sample.get_aggregated_time_series(
...     indicator="a",
...     method="mean_absolute_error",
...     reference_time_series=ref,
...     freq="h",
...)

2009-01-01 00:00:00 2009-01-01 01:00:00

0 0.0 1.0 1 2.0 3.0

get_static_results_as_df()[source]

get_score_df(indicator, reference_time_series, scoring_methods=None, resample_rule=None, resample_agg_method='mean')[source]

Compute scoring metrics for a given indicator across all sample results.

This method evaluates the performance of dynamic model predictions by comparing them against a reference time series. It supports multiple scoring metrics (R², NMBE, CV(RMSE), MAE, RMSE, max error) and optional resampling of data.

Parameters:

indicator (str) – Name of the indicator/variable to evaluate from the simulation results. Must be a valid columns in the sample results DataFrame.
reference_time_series (pd.Series) – Ground truth or measured time series data to compare against.
scoring_methods (list of str or callable, optional) –
List of scoring methods to apply. Can be:
- String values from SCORE_MAP: "r2", "nmbe", "cv_rmse", "mae", "rmse", "max"
- Custom callable functions with signature func(y_true, y_pred) -> float
If None, all methods are used. Default is None.
resample_rule (str, pd.Timedelta or dt.timedelta, optional) – Resampling frequency for aggregating the time series data before scoring. Examples: "D" (daily), "h" (hourly), "ME" (month end). If None, no resampling is performed. Default is None.
resample_agg_method (str, optional) – Aggregation method to use when resampling. Common values include: "mean", "sum", "min", "max", "median". Default is "mean".

Returns:

DataFrame containing scoring metrics for each sample.

Index: sample identifiers from self.results
Columns: metric names (e.g., "r2_score", "nmbe", "cv_rmse")
Values: computed metric values (float)

The DataFrame’s index name is set to the resampling rule or the inferred frequency of the reference time series.

Return type:

pd.DataFrame

Raises:

NotImplementedError – If the model is not dynamic (self.is_dynamic == False).

Notes

The scoring metrics available in SCORE_MAP are:

r2: R² score (coefficient of determination)
nmbe: Normalized Mean Bias Error
cv_rmse: Coefficient of Variation of Root Mean Squared Error
mae: Mean Absolute Error
rmse: Root Mean Squared Error
max: Maximum absolute error

When resampling is applied, both the predicted and reference time series are resampled using the same rule and aggregation method to ensure alignment.

Examples

Basic usage with default metrics:

>>> import pandas as pd
>>> import numpy as np
>>> # Assuming 'sample' is an instance of Sample class with results
>>> reference = pd.Series(
...     np.random.randn(100),
...     index=pd.date_range("2023-01-01", periods=100, freq="h"),
... )
>>> scores = sample.get_score_df(
...     indicator="temperature", reference_time_series=reference
... )
>>> print(scores)
            r2_score      nmbe   cv_rmse       mae      rmse       max
0    0.85234  0.012345  0.234567  1.234567  1.567890  3.456789
1    0.82156  0.023456  0.345678  1.345678  1.678901  3.567890
...

Using specific metrics and daily resampling:

>>> scores = sample.get_score_df(
...     indicator="Energy",
...     reference_time_series=reference,
...     scoring_methods=["r2", "rmse", "mae"],
...     resample_rule="D",
...     resample_agg_method="sum",
... )
>>> print(scores)
          r2_score      rmse       mae
D
0  0.91234  12.34567  10.12345
1  0.89123  13.45678  11.23456
...

See also

sklearn.metrics.r2_score: R² metric implementation
sklearn.metrics.mean_absolute_error: MAE metric implementation
sklearn.metrics.root_mean_squared_error: RMSE metric implementation

plot_hist(indicator, method='mean', unit='', agg_method_kwarg=None, reference_time_series=None, bins=30, colors='orange', reference_value=None, reference_label='Reference', show_rug=False, title=None)[source]

Plot histogram of aggregated results.

Parameters:

indicator (str) – Name of the indicator column to plot.
method (str, default="mean") – Aggregation method.
unit (str, optional) – Unit of the indicator.
agg_method_kwarg (dict, optional) – Additional kwargs for aggregation.
reference_time_series (Series, optional) – Reference time series.
bins (int, default=30) – Histogram number of bins.
colors (str, default="orange") – Color of the histogram.
reference_value (int, float, optional) – Add a vertical dashed red line at reference value. May be used for comparison with an expected value
reference_label (str, optional) – Label name for reference value line to be displayed in the legend. Default is “Reference”
show_rug (bool, default=False) – If True, display rug plot below histogram.
title (str, optional) – Custom title.

Returns:

Plotly histogram figure.

Return type:

go.Figure

plot_sample(indicator, reference_timeseries=None, title=None, y_label=None, x_label=None, alpha=0.5, show_legends=False, round_ndigits=2, quantile_band=0.75, type_graph='area')[source]

Plot simulation results with different visualization modes.

This function allows visualization of multiple simulation samples, either as a scatter plot of all samples or as an aggregated area with min–max envelope, median, and quantile bands.

Only works for dynamic models

Parameters:

indicator (str, optional) – Column name to extract if inner elements are DataFrames with multiple columns.
reference_timeseries (pandas.Series, optional) – A reference time series to plot alongside simulations (e.g., measured data).
title (str, optional) – Plot title.
y_label (str, optional) – Label for the y-axis.
x_label (str, optional) – Label for the x-axis.
alpha (float, default=0.5) – Opacity for scatter markers when type_graph="scatter".
show_legends (bool, default=False) – Whether to display legends for each individual sample trace when type_graph="scatter".
round_ndigits (int, default=2) – Number of digits for rounding parameter values in legend strings.
quantile_band (float, default=0.75) – Upper quantile to display when type_graph="area". Both (1 - quantile_band) and quantile_band are drawn as dotted lines, e.g. 0.75 → 25% and 75%.
type_graph ({"area", "scatter"}, default="area") –
Visualization mode: - "scatter" : plot all samples individually as scatter markers. - "area" : plot aggregated area with min–max envelope,

median line, and quantile bands.

Return type:

Figure

Examples

>>> fig = plot_sample(results, reference_timeseries=ref)
>>> fig.show()

>>> fig = plot_sample(results, reference_timeseries=ref, type_graph="scatter")
>>> fig.show()

plot_pcp(indicators_configs, color_by=None, title='Parallel Coordinates — Samples', html_file_path=None)[source]

This method produces an interactive PCP visualization that allows comparison of model parameters against aggregated indicators from simulation results. It supports both dynamic and static models.

For dynamic models, the specified indicators are aggregated across time using the provided functions (e.g., “mean”, “sum”, error metrics). For static models, the indicators are taken directly from the stored results.

Parameters:

indicators_configs (list of str or list of tuple) –
Configuration of indicators to include in the plot.
- For dynamic models, each element must be a tuple of the form: (indicator_name, method) or (indicator_name, method, reference_series).
  Here:
  - indicator_name : str Column name in the simulation results to aggregate.
  - method : str or Callable Aggregation function or metric to apply.
  - reference_series : pandas.Series, optional Reference time series required for error-based methods (e.g., mean absolute error).
- For static models, a simple list of indicator names (str) is sufficient.
color_by (str, optional) – Name of a parameter or result column to use for coloring the PCP lines. If None, all lines are plotted in the same color.
title (str, default="Parallel Coordinates — Samples") – Title of the plot.
html_file_path (str, optional) – If provided, saves the interactive plot as an HTML file at the specified path.

Returns:

The generated parallel coordinates figure. The figure can be displayed interactively in a Jupyter notebook, web browser, or exported to HTML.

Return type:

plotly.graph_objects.Figure

Raises:

ValueError – If the indicators_configs are incompatible with the model type (dynamic vs static).