Benchmarker

Bases: BaseBenchmark

Benchmarker for evaluating machine learning models with tuning strategies.

This class provides functionality to benchmark various machine learning models, tuning methods, HPO techniques, and criteria over multiple encodings, sampling strategies, and evaluation criteria. It supports training and evaluation workflows for different tasks and handles configurations for holdout or cross-validation tuning with threshold optimization.

Inherits

BaseBenchmark: Provides common benchmarking attributes.

Parameters:

Name	Type	Description	Default
`task`	`str`	Task for evaluation ('pocketclosure', 'pocketclosureinf', 'improvement', or 'pdgrouprevaluation'.).	required
`learners`	`List[str]`	List of learners to benchmark ('xgb', 'rf', 'lr' or 'mlp').	required
`tuning_methods`	`List[str]`	Tuning methods for each learner ('holdout', 'cv').	required
`hpo_methods`	`List[str]`	HPO methods ('hebo' or 'rs').	required
`criteria`	`List[str]`	List of evaluation criteria ('f1', 'macro_f1', 'brier_score').	required
`encodings`	`List[str]`	List of encodings ('one_hot' or 'target').	required
`sampling`	`Optional[List[str]]`	Sampling strategies for class imbalance. Includes None, 'upsampling', 'downsampling', and 'smote'.	`None`
`factor`	`Optional[float]`	Factor to apply during resampling.	`None`
`n_configs`	`int`	Number of configurations for hyperparameter tuning. Defaults to 10.	`10`
`n_jobs`	`int`	Number of parallel jobs for processing. Defaults to 1.	`1`
`cv_folds`	`Optional[int]`	Number of folds for cross-validation. Defaults to 10.	`10`
`racing_folds`	`Optional[int]`	Number of racing folds for Random Search (RS). Defaults to None.	`None`
`test_seed`	`int`	Random seed for test splitting. Defaults to 0.	`0`
`test_size`	`float`	Proportion of data used for testing. Defaults to 0.2.	`0.2`
`val_size`	`Optional[float]`	Size of validation set in holdout tuning. Defaults to 0.2.	`0.2`
`cv_seed`	`Optional[int]`	Random seed for cross-validation. Defaults to 0	`0`
`mlp_flag`	`Optional[bool]`	Enables MLP training with early stopping. Defaults to None.	`None`
`threshold_tuning`	`Optional[bool]`	Enables threshold tuning for binary classification. Defaults to None.	`None`
`verbose`	`bool`	If True, enables detailed logging during benchmarking. Defaults to True.	`True`
`path`	`Path`	Path to the directory containing processed data files. Defaults to Path("data/processed/processed_data.csv").	`Path('data/processed/processed_data.csv')`

Attributes:

Name	Type	Description
`task`	`str`	The specified task for evaluation.
`learners`	`List[str]`	List of learners to evaluate.
`tuning_methods`	`List[str]`	Tuning methods for model evaluation.
`hpo_methods`	`List[str]`	HPO methods for hyperparameter tuning.
`criteria`	`List[str]`	List of evaluation metrics.
`encodings`	`List[str]`	Encoding types for categorical features.
`sampling`	`List[str]`	Resampling strategies for class balancing.
`factor`	`float`	Resampling factor for balancing.
`n_configs`	`int`	Number of configurations for hyperparameter tuning.
`n_jobs`	`int`	Number of parallel jobs for model training.
`cv_folds`	`int`	Number of cross-validation folds.
`racing_folds`	`int`	Number of racing folds for random search.
`test_seed`	`int`	Seed for reproducible train-test splits.
`test_size`	`float`	Size of the test split.
`val_size`	`float`	Size of the validation split in holdout tuning.
`cv_seed`	`int`	Seed for cross-validation splits.
`mlp_flag`	`bool`	Indicates if MLP training with early stopping is used.
`threshold_tuning`	`bool`	Enables threshold tuning for binary classification.
`verbose`	`bool`	Enables detailed logging during benchmarking.
`path`	`Path`	Directory path for processed data.
`name`	`str`	File name for processed data.
`data_cache`	`dict`	Cached data for different task and encoding combinations.

Methods:

Name	Description
`run_all_benchmarks`	Executes benchmarks for all combinations of learners, tuning methods, HPO, criteria, encodings, and sampling strategies, and returns a DataFrame summary and a dictionary of top models.

Example

benchmarker = Benchmarker(
    task="pocketclosure",
    learners=["xgb", "rf"],
    tuning_methods=["holdout", "cv"],
    hpo_methods=["hebo", "rs"],
    criteria=["f1", "brier_score"],
    encodings=["one_hot", "target"],
    sampling=["upsampling", "downsampling"],
    factor=1.5,
    n_configs=20,
    n_jobs=4,
    cv_folds=5,
    test_seed=42,
    test_size=0.2,
    verbose=True,
    path="/data/processed/processed_data.csv",
)

# Running all benchmarks
results_df, top_models = benchmarker.run_all_benchmarks()
print(results_df)
print(top_models)

Source code in periomod/benchmarking/_benchmark.py

class Benchmarker(BaseBenchmark):
    """Benchmarker for evaluating machine learning models with tuning strategies.

    This class provides functionality to benchmark various machine learning
    models, tuning methods, HPO techniques, and criteria over multiple
    encodings, sampling strategies, and evaluation criteria. It supports
    training and evaluation workflows for different tasks and handles
    configurations for holdout or cross-validation tuning with threshold
    optimization.

    Inherits:
        - `BaseBenchmark`: Provides common benchmarking attributes.

    Args:
        task (str): Task for evaluation ('pocketclosure', 'pocketclosureinf',
            'improvement', or 'pdgrouprevaluation'.).
        learners (List[str]): List of learners to benchmark ('xgb', 'rf', 'lr' or
            'mlp').
        tuning_methods (List[str]): Tuning methods for each learner ('holdout',
            'cv').
        hpo_methods (List[str]): HPO methods ('hebo' or 'rs').
        criteria (List[str]): List of evaluation criteria ('f1', 'macro_f1',
            'brier_score').
        encodings (List[str]): List of encodings ('one_hot' or 'target').
        sampling (Optional[List[str]]): Sampling strategies for class imbalance.
            Includes None, 'upsampling', 'downsampling', and 'smote'.
        factor (Optional[float]): Factor to apply during resampling.
        n_configs (int): Number of configurations for hyperparameter tuning.
            Defaults to 10.
        n_jobs (int): Number of parallel jobs for processing. Defaults to 1.
        cv_folds (Optional[int]): Number of folds for cross-validation.
            Defaults to 10.
        racing_folds (Optional[int]): Number of racing folds for Random Search (RS).
            Defaults to None.
        test_seed (int): Random seed for test splitting. Defaults to 0.
        test_size (float): Proportion of data used for testing. Defaults to
            0.2.
        val_size (Optional[float]): Size of validation set in holdout tuning.
            Defaults to 0.2.
        cv_seed (Optional[int]): Random seed for cross-validation. Defaults to 0
        mlp_flag (Optional[bool]): Enables MLP training with early stopping.
            Defaults to None.
        threshold_tuning (Optional[bool]): Enables threshold tuning for binary
            classification. Defaults to None.
        verbose (bool): If True, enables detailed logging during benchmarking.
            Defaults to True.
        path (Path): Path to the directory containing processed data files.
            Defaults to Path("data/processed/processed_data.csv").

    Attributes:
        task (str): The specified task for evaluation.
        learners (List[str]): List of learners to evaluate.
        tuning_methods (List[str]): Tuning methods for model evaluation.
        hpo_methods (List[str]): HPO methods for hyperparameter tuning.
        criteria (List[str]): List of evaluation metrics.
        encodings (List[str]): Encoding types for categorical features.
        sampling (List[str]): Resampling strategies for class balancing.
        factor (float): Resampling factor for balancing.
        n_configs (int): Number of configurations for hyperparameter tuning.
        n_jobs (int): Number of parallel jobs for model training.
        cv_folds (int): Number of cross-validation folds.
        racing_folds (int): Number of racing folds for random search.
        test_seed (int): Seed for reproducible train-test splits.
        test_size (float): Size of the test split.
        val_size (float): Size of the validation split in holdout tuning.
        cv_seed (int): Seed for cross-validation splits.
        mlp_flag (bool): Indicates if MLP training with early stopping is used.
        threshold_tuning (bool): Enables threshold tuning for binary classification.
        verbose (bool): Enables detailed logging during benchmarking.
        path (Path): Directory path for processed data.
        name (str): File name for processed data.
        data_cache (dict): Cached data for different task and encoding combinations.

    Methods:
        run_all_benchmarks: Executes benchmarks for all combinations of learners,
            tuning methods, HPO, criteria, encodings, and sampling strategies,
            and returns a DataFrame summary and a dictionary of top models.

    Example:
        ```
        benchmarker = Benchmarker(
            task="pocketclosure",
            learners=["xgb", "rf"],
            tuning_methods=["holdout", "cv"],
            hpo_methods=["hebo", "rs"],
            criteria=["f1", "brier_score"],
            encodings=["one_hot", "target"],
            sampling=["upsampling", "downsampling"],
            factor=1.5,
            n_configs=20,
            n_jobs=4,
            cv_folds=5,
            test_seed=42,
            test_size=0.2,
            verbose=True,
            path="/data/processed/processed_data.csv",
        )

        # Running all benchmarks
        results_df, top_models = benchmarker.run_all_benchmarks()
        print(results_df)
        print(top_models)
        ```
    """

    def __init__(
        self,
        task: str,
        learners: List[str],
        tuning_methods: List[str],
        hpo_methods: List[str],
        criteria: List[str],
        encodings: List[str],
        sampling: Optional[List[Union[str, None]]] = None,
        factor: Optional[float] = None,
        n_configs: int = 10,
        n_jobs: int = 1,
        cv_folds: Optional[int] = 10,
        racing_folds: Optional[int] = None,
        test_seed: int = 0,
        test_size: float = 0.2,
        val_size: Optional[float] = 0.2,
        cv_seed: Optional[int] = 0,
        mlp_flag: Optional[bool] = None,
        threshold_tuning: Optional[bool] = None,
        verbose: bool = True,
        path: Path = Path("data/processed/processed_data.csv"),
    ) -> None:
        """Initialize the Experiment with different tasks, learners, etc.

        Args:
            task (str): Task for evaluation ('pocketclosure', 'pocketclosureinf',
                'improvement', or 'pdgrouprevaluation'.).
            learners (List[str]): List of learners to benchmark ('xgb', 'rf', 'lr' or
                'mlp').
            tuning_methods (List[str]): Tuning methods for each learner ('holdout',
                'cv').
            hpo_methods (List[str]): HPO methods ('hebo' or 'rs').
            criteria (List[str]): List of evaluation criteria ('f1', 'macro_f1',
                'brier_score').
            encodings (List[str]): List of encodings ('one_hot' or 'target').
            sampling (Optional[List[str]]): Sampling strategies for class imbalance.
                Includes None, 'upsampling', 'downsampling', and 'smote'.
            factor (Optional[float]): Factor to apply during resampling.
            n_configs (int): Number of configurations for hyperparameter tuning.
                Defaults to 10.
            n_jobs (int): Number of parallel jobs for processing. Defaults to 1.
            cv_folds (Optional[int]): Number of folds for cross-validation.
                Defaults to 10.
            racing_folds (Optional[int]): Number of racing folds for Random Search (RS).
                Defaults to None.
            test_seed (int): Random seed for test splitting. Defaults to 0.
            test_size (float): Proportion of data used for testing. Defaults to
                0.2.
            val_size (Optional[float]): Size of validation set in holdout tuning.
                Defaults to 0.2.
            cv_seed (Optional[int]): Random seed for cross-validation. Defaults to 0
            mlp_flag (Optional[bool]): Enables MLP training with early stopping.
                Defaults to None.
            threshold_tuning (Optional[bool]): Enables threshold tuning for binary
                classification. Defaults to None.
            verbose (bool): If True, enables detailed logging during benchmarking.
                Defaults to True.
            path (Path): Path to the directory containing processed data files.
                Defaults to Path("data/processed/processed_data.csv").
        """
        super().__init__(
            task=task,
            learners=learners,
            tuning_methods=tuning_methods,
            hpo_methods=hpo_methods,
            criteria=criteria,
            encodings=encodings,
            sampling=sampling,
            factor=factor,
            n_configs=n_configs,
            n_jobs=n_jobs,
            cv_folds=cv_folds,
            racing_folds=racing_folds,
            test_seed=test_seed,
            test_size=test_size,
            val_size=val_size,
            cv_seed=cv_seed,
            mlp_flag=mlp_flag,
            threshold_tuning=threshold_tuning,
            verbose=verbose,
            path=path,
        )
        self.data_cache = self._load_data_for_tasks()

    def _load_data_for_tasks(self) -> dict:
        """Load and transform data for each task and encoding combination once.

        Returns:
            dict: A dictionary containing transformed data for each task-encoding pair.
        """
        data_cache = {}
        for encoding in self.encodings:
            cache_key = encoding

            if cache_key not in data_cache:
                dataloader = ProcessedDataLoader(task=self.task, encoding=encoding)
                df = dataloader.load_data(path=self.path)
                transformed_df = dataloader.transform_data(df)
                data_cache[cache_key] = transformed_df

        return data_cache

    def run_benchmarks(self) -> Tuple[pd.DataFrame, dict]:
        """Benchmark all combinations of inputs.

        Returns:
            tuple: DataFrame summarizing the benchmark results with metrics for each
                configuration and dictionary mapping model keys to models for top
                configurations per criterion.

        Raises:
            KeyError: If an unknown criterion is encountered in `metric_map`.
        """
        results = []
        learners_dict = {}
        top_models_per_criterion: Dict[
            str, List[Tuple[float, object, str, str, str, str]]
        ] = {criterion: [] for criterion in self.criteria}

        metric_map = {
            "f1": "F1 Score",
            "brier_score": (
                "Multiclass Brier Score"
                if self.task == "pdgrouprevaluation"
                else "Brier Score"
            ),
            "macro_f1": "Macro F1",
        }

        for learner, tuning, hpo, criterion, encoding, sampling in itertools.product(
            self.learners,
            self.tuning_methods,
            self.hpo_methods,
            self.criteria,
            self.encodings,
            self.sampling or ["no_sampling"],
        ):
            if sampling is None:
                self.factor = None

            if (criterion == "macro_f1" and self.task != "pdgrouprevaluation") or (
                criterion == "f1" and self.task == "pdgrouprevaluation"
            ):
                print(f"Criterion '{criterion}' and task '{self.task}' not valid.")
                continue
            if self.verbose:
                print(
                    f"\nRunning benchmark for Task: {self.task}, Learner: {learner}, "
                    f"Tuning: {tuning}, HPO: {hpo}, Criterion: {criterion}, "
                    f"Sampling: {sampling}, Factor: {self.factor}."
                )
            df = self.data_cache[(encoding)]

            exp = Experiment(
                df=df,
                task=self.task,
                learner=learner,
                criterion=criterion,
                encoding=encoding,
                tuning=tuning,
                hpo=hpo,
                sampling=sampling,
                factor=self.factor,
                n_configs=self.n_configs,
                racing_folds=self.racing_folds,
                n_jobs=self.n_jobs,
                cv_folds=self.cv_folds,
                test_seed=self.test_seed,
                test_size=self.test_size,
                val_size=self.val_size,
                cv_seed=self.cv_seed,
                mlp_flag=self.mlp_flag,
                threshold_tuning=self.threshold_tuning,
                verbose=self.verbose,
            )

            try:
                result = exp.perform_evaluation()
                metrics = result["metrics"]
                trained_model = result["model"]

                unpacked_metrics = {
                    k: round(v, 4) if isinstance(v, float) else v
                    for k, v in metrics.items()
                }
                results.append(
                    {
                        "Task": self.task,
                        "Learner": learner,
                        "Tuning": tuning,
                        "HPO": hpo,
                        "Criterion": criterion,
                        "Sampling": sampling,
                        "Factor": self.factor,
                        **unpacked_metrics,
                    }
                )

                metric_key = metric_map.get(criterion)
                if metric_key is None:
                    raise KeyError(f"Unknown criterion '{criterion}'")

                criterion_value = metrics[metric_key]

                current_model_data = (
                    criterion_value,
                    trained_model,
                    learner,
                    tuning,
                    hpo,
                    encoding,
                )

                if len(top_models_per_criterion[criterion]) < 4:
                    top_models_per_criterion[criterion].append(current_model_data)
                else:
                    worst_model_idx = min(
                        range(len(top_models_per_criterion[criterion])),
                        key=lambda idx: (
                            top_models_per_criterion[criterion][idx][0]
                            if criterion != "brier_score"
                            else -top_models_per_criterion[criterion][idx][0]
                        ),
                    )
                    worst_model_score = top_models_per_criterion[criterion][
                        worst_model_idx
                    ][0]
                    if (
                        criterion != "brier_score"
                        and criterion_value > worst_model_score
                    ) or (
                        criterion == "brier_score"
                        and criterion_value < worst_model_score
                    ):
                        top_models_per_criterion[criterion][
                            worst_model_idx
                        ] = current_model_data

            except Exception as e:
                error_message = str(e)
                if (
                    "Matrix not positive definite after repeatedly adding jitter"
                    in error_message
                    or "elements of the" in error_message
                    and "are NaN" in error_message
                    or "cholesky_cpu" in error_message
                ):
                    print(
                        f"Suppressed NotPSDError for {self.task}, {learner} due to"
                        f"convergence issue \n"
                    )
                else:
                    print(
                        f"Error running benchmark for {self.task}, {learner}: "
                        f"{error_message}\n"
                    )
                    traceback.print_exc()

        for criterion, models in top_models_per_criterion.items():
            sorted_models = sorted(
                models, key=lambda x: -x[0] if criterion != "brier_score" else x[0]
            )
            for idx, (score, model, learner, tuning, hpo, encoding) in enumerate(
                sorted_models
            ):
                learners_dict_key = (
                    f"{self.task}_{learner}_{tuning}_{hpo}_{criterion}_{encoding}_"
                    f"{sampling or 'no_sampling'}_factor{self.factor}_rank{idx+1}_"
                    f"score{round(score, 4)}"
                )
                learners_dict[learners_dict_key] = model

        df_results = pd.DataFrame(results)
        pd.set_option("display.max_columns", None, "display.width", 1000)

        if self.verbose:
            print(f"\nBenchmark Results Summary:\n{df_results}")

        return df_results, learners_dict

`init(task, learners, tuning_methods, hpo_methods, criteria, encodings, sampling=None, factor=None, n_configs=10, n_jobs=1, cv_folds=10, racing_folds=None, test_seed=0, test_size=0.2, val_size=0.2, cv_seed=0, mlp_flag=None, threshold_tuning=None, verbose=True, path=Path('data/processed/processed_data.csv'))` ¶

Initialize the Experiment with different tasks, learners, etc.

Parameters:

Name	Type	Description	Default
`task`	`str`	Task for evaluation ('pocketclosure', 'pocketclosureinf', 'improvement', or 'pdgrouprevaluation'.).	required
`learners`	`List[str]`	List of learners to benchmark ('xgb', 'rf', 'lr' or 'mlp').	required
`tuning_methods`	`List[str]`	Tuning methods for each learner ('holdout', 'cv').	required
`hpo_methods`	`List[str]`	HPO methods ('hebo' or 'rs').	required
`criteria`	`List[str]`	List of evaluation criteria ('f1', 'macro_f1', 'brier_score').	required
`encodings`	`List[str]`	List of encodings ('one_hot' or 'target').	required
`sampling`	`Optional[List[str]]`	Sampling strategies for class imbalance. Includes None, 'upsampling', 'downsampling', and 'smote'.	`None`
`factor`	`Optional[float]`	Factor to apply during resampling.	`None`
`n_configs`	`int`	Number of configurations for hyperparameter tuning. Defaults to 10.	`10`
`n_jobs`	`int`	Number of parallel jobs for processing. Defaults to 1.	`1`
`cv_folds`	`Optional[int]`	Number of folds for cross-validation. Defaults to 10.	`10`
`racing_folds`	`Optional[int]`	Number of racing folds for Random Search (RS). Defaults to None.	`None`
`test_seed`	`int`	Random seed for test splitting. Defaults to 0.	`0`
`test_size`	`float`	Proportion of data used for testing. Defaults to 0.2.	`0.2`
`val_size`	`Optional[float]`	Size of validation set in holdout tuning. Defaults to 0.2.	`0.2`
`cv_seed`	`Optional[int]`	Random seed for cross-validation. Defaults to 0	`0`
`mlp_flag`	`Optional[bool]`	Enables MLP training with early stopping. Defaults to None.	`None`
`threshold_tuning`	`Optional[bool]`	Enables threshold tuning for binary classification. Defaults to None.	`None`
`verbose`	`bool`	If True, enables detailed logging during benchmarking. Defaults to True.	`True`
`path`	`Path`	Path to the directory containing processed data files. Defaults to Path("data/processed/processed_data.csv").	`Path('data/processed/processed_data.csv')`

Source code in periomod/benchmarking/_benchmark.py

def __init__(
    self,
    task: str,
    learners: List[str],
    tuning_methods: List[str],
    hpo_methods: List[str],
    criteria: List[str],
    encodings: List[str],
    sampling: Optional[List[Union[str, None]]] = None,
    factor: Optional[float] = None,
    n_configs: int = 10,
    n_jobs: int = 1,
    cv_folds: Optional[int] = 10,
    racing_folds: Optional[int] = None,
    test_seed: int = 0,
    test_size: float = 0.2,
    val_size: Optional[float] = 0.2,
    cv_seed: Optional[int] = 0,
    mlp_flag: Optional[bool] = None,
    threshold_tuning: Optional[bool] = None,
    verbose: bool = True,
    path: Path = Path("data/processed/processed_data.csv"),
) -> None:
    """Initialize the Experiment with different tasks, learners, etc.

    Args:
        task (str): Task for evaluation ('pocketclosure', 'pocketclosureinf',
            'improvement', or 'pdgrouprevaluation'.).
        learners (List[str]): List of learners to benchmark ('xgb', 'rf', 'lr' or
            'mlp').
        tuning_methods (List[str]): Tuning methods for each learner ('holdout',
            'cv').
        hpo_methods (List[str]): HPO methods ('hebo' or 'rs').
        criteria (List[str]): List of evaluation criteria ('f1', 'macro_f1',
            'brier_score').
        encodings (List[str]): List of encodings ('one_hot' or 'target').
        sampling (Optional[List[str]]): Sampling strategies for class imbalance.
            Includes None, 'upsampling', 'downsampling', and 'smote'.
        factor (Optional[float]): Factor to apply during resampling.
        n_configs (int): Number of configurations for hyperparameter tuning.
            Defaults to 10.
        n_jobs (int): Number of parallel jobs for processing. Defaults to 1.
        cv_folds (Optional[int]): Number of folds for cross-validation.
            Defaults to 10.
        racing_folds (Optional[int]): Number of racing folds for Random Search (RS).
            Defaults to None.
        test_seed (int): Random seed for test splitting. Defaults to 0.
        test_size (float): Proportion of data used for testing. Defaults to
            0.2.
        val_size (Optional[float]): Size of validation set in holdout tuning.
            Defaults to 0.2.
        cv_seed (Optional[int]): Random seed for cross-validation. Defaults to 0
        mlp_flag (Optional[bool]): Enables MLP training with early stopping.
            Defaults to None.
        threshold_tuning (Optional[bool]): Enables threshold tuning for binary
            classification. Defaults to None.
        verbose (bool): If True, enables detailed logging during benchmarking.
            Defaults to True.
        path (Path): Path to the directory containing processed data files.
            Defaults to Path("data/processed/processed_data.csv").
    """
    super().__init__(
        task=task,
        learners=learners,
        tuning_methods=tuning_methods,
        hpo_methods=hpo_methods,
        criteria=criteria,
        encodings=encodings,
        sampling=sampling,
        factor=factor,
        n_configs=n_configs,
        n_jobs=n_jobs,
        cv_folds=cv_folds,
        racing_folds=racing_folds,
        test_seed=test_seed,
        test_size=test_size,
        val_size=val_size,
        cv_seed=cv_seed,
        mlp_flag=mlp_flag,
        threshold_tuning=threshold_tuning,
        verbose=verbose,
        path=path,
    )
    self.data_cache = self._load_data_for_tasks()

`run_benchmarks()` ¶

Benchmark all combinations of inputs.

Returns:

Name	Type	Description
`tuple`	`Tuple[DataFrame, dict]`	DataFrame summarizing the benchmark results with metrics for each configuration and dictionary mapping model keys to models for top configurations per criterion.

Raises:

Type	Description
`KeyError`	If an unknown criterion is encountered in `metric_map`.

Source code in periomod/benchmarking/_benchmark.py

def run_benchmarks(self) -> Tuple[pd.DataFrame, dict]:
    """Benchmark all combinations of inputs.

    Returns:
        tuple: DataFrame summarizing the benchmark results with metrics for each
            configuration and dictionary mapping model keys to models for top
            configurations per criterion.

    Raises:
        KeyError: If an unknown criterion is encountered in `metric_map`.
    """
    results = []
    learners_dict = {}
    top_models_per_criterion: Dict[
        str, List[Tuple[float, object, str, str, str, str]]
    ] = {criterion: [] for criterion in self.criteria}

    metric_map = {
        "f1": "F1 Score",
        "brier_score": (
            "Multiclass Brier Score"
            if self.task == "pdgrouprevaluation"
            else "Brier Score"
        ),
        "macro_f1": "Macro F1",
    }

    for learner, tuning, hpo, criterion, encoding, sampling in itertools.product(
        self.learners,
        self.tuning_methods,
        self.hpo_methods,
        self.criteria,
        self.encodings,
        self.sampling or ["no_sampling"],
    ):
        if sampling is None:
            self.factor = None

        if (criterion == "macro_f1" and self.task != "pdgrouprevaluation") or (
            criterion == "f1" and self.task == "pdgrouprevaluation"
        ):
            print(f"Criterion '{criterion}' and task '{self.task}' not valid.")
            continue
        if self.verbose:
            print(
                f"\nRunning benchmark for Task: {self.task}, Learner: {learner}, "
                f"Tuning: {tuning}, HPO: {hpo}, Criterion: {criterion}, "
                f"Sampling: {sampling}, Factor: {self.factor}."
            )
        df = self.data_cache[(encoding)]

        exp = Experiment(
            df=df,
            task=self.task,
            learner=learner,
            criterion=criterion,
            encoding=encoding,
            tuning=tuning,
            hpo=hpo,
            sampling=sampling,
            factor=self.factor,
            n_configs=self.n_configs,
            racing_folds=self.racing_folds,
            n_jobs=self.n_jobs,
            cv_folds=self.cv_folds,
            test_seed=self.test_seed,
            test_size=self.test_size,
            val_size=self.val_size,
            cv_seed=self.cv_seed,
            mlp_flag=self.mlp_flag,
            threshold_tuning=self.threshold_tuning,
            verbose=self.verbose,
        )

        try:
            result = exp.perform_evaluation()
            metrics = result["metrics"]
            trained_model = result["model"]

            unpacked_metrics = {
                k: round(v, 4) if isinstance(v, float) else v
                for k, v in metrics.items()
            }
            results.append(
                {
                    "Task": self.task,
                    "Learner": learner,
                    "Tuning": tuning,
                    "HPO": hpo,
                    "Criterion": criterion,
                    "Sampling": sampling,
                    "Factor": self.factor,
                    **unpacked_metrics,
                }
            )

            metric_key = metric_map.get(criterion)
            if metric_key is None:
                raise KeyError(f"Unknown criterion '{criterion}'")

            criterion_value = metrics[metric_key]

            current_model_data = (
                criterion_value,
                trained_model,
                learner,
                tuning,
                hpo,
                encoding,
            )

            if len(top_models_per_criterion[criterion]) < 4:
                top_models_per_criterion[criterion].append(current_model_data)
            else:
                worst_model_idx = min(
                    range(len(top_models_per_criterion[criterion])),
                    key=lambda idx: (
                        top_models_per_criterion[criterion][idx][0]
                        if criterion != "brier_score"
                        else -top_models_per_criterion[criterion][idx][0]
                    ),
                )
                worst_model_score = top_models_per_criterion[criterion][
                    worst_model_idx
                ][0]
                if (
                    criterion != "brier_score"
                    and criterion_value > worst_model_score
                ) or (
                    criterion == "brier_score"
                    and criterion_value < worst_model_score
                ):
                    top_models_per_criterion[criterion][
                        worst_model_idx
                    ] = current_model_data

        except Exception as e:
            error_message = str(e)
            if (
                "Matrix not positive definite after repeatedly adding jitter"
                in error_message
                or "elements of the" in error_message
                and "are NaN" in error_message
                or "cholesky_cpu" in error_message
            ):
                print(
                    f"Suppressed NotPSDError for {self.task}, {learner} due to"
                    f"convergence issue \n"
                )
            else:
                print(
                    f"Error running benchmark for {self.task}, {learner}: "
                    f"{error_message}\n"
                )
                traceback.print_exc()

    for criterion, models in top_models_per_criterion.items():
        sorted_models = sorted(
            models, key=lambda x: -x[0] if criterion != "brier_score" else x[0]
        )
        for idx, (score, model, learner, tuning, hpo, encoding) in enumerate(
            sorted_models
        ):
            learners_dict_key = (
                f"{self.task}_{learner}_{tuning}_{hpo}_{criterion}_{encoding}_"
                f"{sampling or 'no_sampling'}_factor{self.factor}_rank{idx+1}_"
                f"score{round(score, 4)}"
            )
            learners_dict[learners_dict_key] = model

    df_results = pd.DataFrame(results)
    pd.set_option("display.max_columns", None, "display.width", 1000)

    if self.verbose:
        print(f"\nBenchmark Results Summary:\n{df_results}")

    return df_results, learners_dict

Benchmarker

run_benchmarks() ¶

`run_benchmarks()` ¶