Skip to content

Benchmarker

Bases: BaseBenchmark

Benchmarker for evaluating machine learning models with tuning strategies.

This class provides functionality to benchmark various machine learning models, tuning methods, HPO techniques, and criteria over multiple encodings, sampling strategies, and evaluation criteria. It supports training and evaluation workflows for different tasks and handles configurations for holdout or cross-validation tuning with threshold optimization.

Inherits
  • BaseBenchmark: Provides common benchmarking attributes.

Parameters:

Name Type Description Default
task str

Task for evaluation ('pocketclosure', 'pocketclosureinf', 'improvement', or 'pdgrouprevaluation'.).

required
learners List[str]

List of learners to benchmark ('xgb', 'rf', 'lr' or 'mlp').

required
tuning_methods List[str]

Tuning methods for each learner ('holdout', 'cv').

required
hpo_methods List[str]

HPO methods ('hebo' or 'rs').

required
criteria List[str]

List of evaluation criteria ('f1', 'macro_f1', 'brier_score').

required
encodings List[str]

List of encodings ('one_hot' or 'target').

required
sampling Optional[List[str]]

Sampling strategies for class imbalance. Includes None, 'upsampling', 'downsampling', and 'smote'.

None
factor Optional[float]

Factor to apply during resampling.

None
n_configs int

Number of configurations for hyperparameter tuning. Defaults to 10.

10
n_jobs int

Number of parallel jobs for processing. Defaults to 1.

1
cv_folds Optional[int]

Number of folds for cross-validation. Defaults to 10.

10
racing_folds Optional[int]

Number of racing folds for Random Search (RS). Defaults to None.

None
test_seed int

Random seed for test splitting. Defaults to 0.

0
test_size float

Proportion of data used for testing. Defaults to 0.2.

0.2
val_size Optional[float]

Size of validation set in holdout tuning. Defaults to 0.2.

0.2
cv_seed Optional[int]

Random seed for cross-validation. Defaults to 0

0
mlp_flag Optional[bool]

Enables MLP training with early stopping. Defaults to None.

None
threshold_tuning Optional[bool]

Enables threshold tuning for binary classification. Defaults to None.

None
verbose bool

If True, enables detailed logging during benchmarking. Defaults to True.

True
path Path

Path to the directory containing processed data files. Defaults to Path("data/processed/processed_data.csv").

Path('data/processed/processed_data.csv')

Attributes:

Name Type Description
task str

The specified task for evaluation.

learners List[str]

List of learners to evaluate.

tuning_methods List[str]

Tuning methods for model evaluation.

hpo_methods List[str]

HPO methods for hyperparameter tuning.

criteria List[str]

List of evaluation metrics.

encodings List[str]

Encoding types for categorical features.

sampling List[str]

Resampling strategies for class balancing.

factor float

Resampling factor for balancing.

n_configs int

Number of configurations for hyperparameter tuning.

n_jobs int

Number of parallel jobs for model training.

cv_folds int

Number of cross-validation folds.

racing_folds int

Number of racing folds for random search.

test_seed int

Seed for reproducible train-test splits.

test_size float

Size of the test split.

val_size float

Size of the validation split in holdout tuning.

cv_seed int

Seed for cross-validation splits.

mlp_flag bool

Indicates if MLP training with early stopping is used.

threshold_tuning bool

Enables threshold tuning for binary classification.

verbose bool

Enables detailed logging during benchmarking.

path Path

Directory path for processed data.

name str

File name for processed data.

data_cache dict

Cached data for different task and encoding combinations.

Methods:

Name Description
run_all_benchmarks

Executes benchmarks for all combinations of learners, tuning methods, HPO, criteria, encodings, and sampling strategies, and returns a DataFrame summary and a dictionary of top models.

Example
benchmarker = Benchmarker(
    task="pocketclosure",
    learners=["xgb", "rf"],
    tuning_methods=["holdout", "cv"],
    hpo_methods=["hebo", "rs"],
    criteria=["f1", "brier_score"],
    encodings=["one_hot", "target"],
    sampling=["upsampling", "downsampling"],
    factor=1.5,
    n_configs=20,
    n_jobs=4,
    cv_folds=5,
    test_seed=42,
    test_size=0.2,
    verbose=True,
    path="/data/processed/processed_data.csv",
)

# Running all benchmarks
results_df, top_models = benchmarker.run_all_benchmarks()
print(results_df)
print(top_models)
Source code in periomod/benchmarking/_benchmark.py
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
class Benchmarker(BaseBenchmark):
    """Benchmarker for evaluating machine learning models with tuning strategies.

    This class provides functionality to benchmark various machine learning
    models, tuning methods, HPO techniques, and criteria over multiple
    encodings, sampling strategies, and evaluation criteria. It supports
    training and evaluation workflows for different tasks and handles
    configurations for holdout or cross-validation tuning with threshold
    optimization.

    Inherits:
        - `BaseBenchmark`: Provides common benchmarking attributes.

    Args:
        task (str): Task for evaluation ('pocketclosure', 'pocketclosureinf',
            'improvement', or 'pdgrouprevaluation'.).
        learners (List[str]): List of learners to benchmark ('xgb', 'rf', 'lr' or
            'mlp').
        tuning_methods (List[str]): Tuning methods for each learner ('holdout',
            'cv').
        hpo_methods (List[str]): HPO methods ('hebo' or 'rs').
        criteria (List[str]): List of evaluation criteria ('f1', 'macro_f1',
            'brier_score').
        encodings (List[str]): List of encodings ('one_hot' or 'target').
        sampling (Optional[List[str]]): Sampling strategies for class imbalance.
            Includes None, 'upsampling', 'downsampling', and 'smote'.
        factor (Optional[float]): Factor to apply during resampling.
        n_configs (int): Number of configurations for hyperparameter tuning.
            Defaults to 10.
        n_jobs (int): Number of parallel jobs for processing. Defaults to 1.
        cv_folds (Optional[int]): Number of folds for cross-validation.
            Defaults to 10.
        racing_folds (Optional[int]): Number of racing folds for Random Search (RS).
            Defaults to None.
        test_seed (int): Random seed for test splitting. Defaults to 0.
        test_size (float): Proportion of data used for testing. Defaults to
            0.2.
        val_size (Optional[float]): Size of validation set in holdout tuning.
            Defaults to 0.2.
        cv_seed (Optional[int]): Random seed for cross-validation. Defaults to 0
        mlp_flag (Optional[bool]): Enables MLP training with early stopping.
            Defaults to None.
        threshold_tuning (Optional[bool]): Enables threshold tuning for binary
            classification. Defaults to None.
        verbose (bool): If True, enables detailed logging during benchmarking.
            Defaults to True.
        path (Path): Path to the directory containing processed data files.
            Defaults to Path("data/processed/processed_data.csv").

    Attributes:
        task (str): The specified task for evaluation.
        learners (List[str]): List of learners to evaluate.
        tuning_methods (List[str]): Tuning methods for model evaluation.
        hpo_methods (List[str]): HPO methods for hyperparameter tuning.
        criteria (List[str]): List of evaluation metrics.
        encodings (List[str]): Encoding types for categorical features.
        sampling (List[str]): Resampling strategies for class balancing.
        factor (float): Resampling factor for balancing.
        n_configs (int): Number of configurations for hyperparameter tuning.
        n_jobs (int): Number of parallel jobs for model training.
        cv_folds (int): Number of cross-validation folds.
        racing_folds (int): Number of racing folds for random search.
        test_seed (int): Seed for reproducible train-test splits.
        test_size (float): Size of the test split.
        val_size (float): Size of the validation split in holdout tuning.
        cv_seed (int): Seed for cross-validation splits.
        mlp_flag (bool): Indicates if MLP training with early stopping is used.
        threshold_tuning (bool): Enables threshold tuning for binary classification.
        verbose (bool): Enables detailed logging during benchmarking.
        path (Path): Directory path for processed data.
        name (str): File name for processed data.
        data_cache (dict): Cached data for different task and encoding combinations.

    Methods:
        run_all_benchmarks: Executes benchmarks for all combinations of learners,
            tuning methods, HPO, criteria, encodings, and sampling strategies,
            and returns a DataFrame summary and a dictionary of top models.

    Example:
        ```
        benchmarker = Benchmarker(
            task="pocketclosure",
            learners=["xgb", "rf"],
            tuning_methods=["holdout", "cv"],
            hpo_methods=["hebo", "rs"],
            criteria=["f1", "brier_score"],
            encodings=["one_hot", "target"],
            sampling=["upsampling", "downsampling"],
            factor=1.5,
            n_configs=20,
            n_jobs=4,
            cv_folds=5,
            test_seed=42,
            test_size=0.2,
            verbose=True,
            path="/data/processed/processed_data.csv",
        )

        # Running all benchmarks
        results_df, top_models = benchmarker.run_all_benchmarks()
        print(results_df)
        print(top_models)
        ```
    """

    def __init__(
        self,
        task: str,
        learners: List[str],
        tuning_methods: List[str],
        hpo_methods: List[str],
        criteria: List[str],
        encodings: List[str],
        sampling: Optional[List[Union[str, None]]] = None,
        factor: Optional[float] = None,
        n_configs: int = 10,
        n_jobs: int = 1,
        cv_folds: Optional[int] = 10,
        racing_folds: Optional[int] = None,
        test_seed: int = 0,
        test_size: float = 0.2,
        val_size: Optional[float] = 0.2,
        cv_seed: Optional[int] = 0,
        mlp_flag: Optional[bool] = None,
        threshold_tuning: Optional[bool] = None,
        verbose: bool = True,
        path: Path = Path("data/processed/processed_data.csv"),
    ) -> None:
        """Initialize the Experiment with different tasks, learners, etc.

        Args:
            task (str): Task for evaluation ('pocketclosure', 'pocketclosureinf',
                'improvement', or 'pdgrouprevaluation'.).
            learners (List[str]): List of learners to benchmark ('xgb', 'rf', 'lr' or
                'mlp').
            tuning_methods (List[str]): Tuning methods for each learner ('holdout',
                'cv').
            hpo_methods (List[str]): HPO methods ('hebo' or 'rs').
            criteria (List[str]): List of evaluation criteria ('f1', 'macro_f1',
                'brier_score').
            encodings (List[str]): List of encodings ('one_hot' or 'target').
            sampling (Optional[List[str]]): Sampling strategies for class imbalance.
                Includes None, 'upsampling', 'downsampling', and 'smote'.
            factor (Optional[float]): Factor to apply during resampling.
            n_configs (int): Number of configurations for hyperparameter tuning.
                Defaults to 10.
            n_jobs (int): Number of parallel jobs for processing. Defaults to 1.
            cv_folds (Optional[int]): Number of folds for cross-validation.
                Defaults to 10.
            racing_folds (Optional[int]): Number of racing folds for Random Search (RS).
                Defaults to None.
            test_seed (int): Random seed for test splitting. Defaults to 0.
            test_size (float): Proportion of data used for testing. Defaults to
                0.2.
            val_size (Optional[float]): Size of validation set in holdout tuning.
                Defaults to 0.2.
            cv_seed (Optional[int]): Random seed for cross-validation. Defaults to 0
            mlp_flag (Optional[bool]): Enables MLP training with early stopping.
                Defaults to None.
            threshold_tuning (Optional[bool]): Enables threshold tuning for binary
                classification. Defaults to None.
            verbose (bool): If True, enables detailed logging during benchmarking.
                Defaults to True.
            path (Path): Path to the directory containing processed data files.
                Defaults to Path("data/processed/processed_data.csv").
        """
        super().__init__(
            task=task,
            learners=learners,
            tuning_methods=tuning_methods,
            hpo_methods=hpo_methods,
            criteria=criteria,
            encodings=encodings,
            sampling=sampling,
            factor=factor,
            n_configs=n_configs,
            n_jobs=n_jobs,
            cv_folds=cv_folds,
            racing_folds=racing_folds,
            test_seed=test_seed,
            test_size=test_size,
            val_size=val_size,
            cv_seed=cv_seed,
            mlp_flag=mlp_flag,
            threshold_tuning=threshold_tuning,
            verbose=verbose,
            path=path,
        )
        self.data_cache = self._load_data_for_tasks()

    def _load_data_for_tasks(self) -> dict:
        """Load and transform data for each task and encoding combination once.

        Returns:
            dict: A dictionary containing transformed data for each task-encoding pair.
        """
        data_cache = {}
        for encoding in self.encodings:
            cache_key = encoding

            if cache_key not in data_cache:
                dataloader = ProcessedDataLoader(task=self.task, encoding=encoding)
                df = dataloader.load_data(path=self.path)
                transformed_df = dataloader.transform_data(df)
                data_cache[cache_key] = transformed_df

        return data_cache

    def run_benchmarks(self) -> Tuple[pd.DataFrame, dict]:
        """Benchmark all combinations of inputs.

        Returns:
            tuple: DataFrame summarizing the benchmark results with metrics for each
                configuration and dictionary mapping model keys to models for top
                configurations per criterion.

        Raises:
            KeyError: If an unknown criterion is encountered in `metric_map`.
        """
        results = []
        learners_dict = {}
        top_models_per_criterion: Dict[
            str, List[Tuple[float, object, str, str, str, str]]
        ] = {criterion: [] for criterion in self.criteria}

        metric_map = {
            "f1": "F1 Score",
            "brier_score": (
                "Multiclass Brier Score"
                if self.task == "pdgrouprevaluation"
                else "Brier Score"
            ),
            "macro_f1": "Macro F1",
        }

        for learner, tuning, hpo, criterion, encoding, sampling in itertools.product(
            self.learners,
            self.tuning_methods,
            self.hpo_methods,
            self.criteria,
            self.encodings,
            self.sampling or ["no_sampling"],
        ):
            if sampling is None:
                self.factor = None

            if (criterion == "macro_f1" and self.task != "pdgrouprevaluation") or (
                criterion == "f1" and self.task == "pdgrouprevaluation"
            ):
                print(f"Criterion '{criterion}' and task '{self.task}' not valid.")
                continue
            if self.verbose:
                print(
                    f"\nRunning benchmark for Task: {self.task}, Learner: {learner}, "
                    f"Tuning: {tuning}, HPO: {hpo}, Criterion: {criterion}, "
                    f"Sampling: {sampling}, Factor: {self.factor}."
                )
            df = self.data_cache[(encoding)]

            exp = Experiment(
                df=df,
                task=self.task,
                learner=learner,
                criterion=criterion,
                encoding=encoding,
                tuning=tuning,
                hpo=hpo,
                sampling=sampling,
                factor=self.factor,
                n_configs=self.n_configs,
                racing_folds=self.racing_folds,
                n_jobs=self.n_jobs,
                cv_folds=self.cv_folds,
                test_seed=self.test_seed,
                test_size=self.test_size,
                val_size=self.val_size,
                cv_seed=self.cv_seed,
                mlp_flag=self.mlp_flag,
                threshold_tuning=self.threshold_tuning,
                verbose=self.verbose,
            )

            try:
                result = exp.perform_evaluation()
                metrics = result["metrics"]
                trained_model = result["model"]

                unpacked_metrics = {
                    k: round(v, 4) if isinstance(v, float) else v
                    for k, v in metrics.items()
                }
                results.append(
                    {
                        "Task": self.task,
                        "Learner": learner,
                        "Tuning": tuning,
                        "HPO": hpo,
                        "Criterion": criterion,
                        "Sampling": sampling,
                        "Factor": self.factor,
                        **unpacked_metrics,
                    }
                )

                metric_key = metric_map.get(criterion)
                if metric_key is None:
                    raise KeyError(f"Unknown criterion '{criterion}'")

                criterion_value = metrics[metric_key]

                current_model_data = (
                    criterion_value,
                    trained_model,
                    learner,
                    tuning,
                    hpo,
                    encoding,
                )

                if len(top_models_per_criterion[criterion]) < 4:
                    top_models_per_criterion[criterion].append(current_model_data)
                else:
                    worst_model_idx = min(
                        range(len(top_models_per_criterion[criterion])),
                        key=lambda idx: (
                            top_models_per_criterion[criterion][idx][0]
                            if criterion != "brier_score"
                            else -top_models_per_criterion[criterion][idx][0]
                        ),
                    )
                    worst_model_score = top_models_per_criterion[criterion][
                        worst_model_idx
                    ][0]
                    if (
                        criterion != "brier_score"
                        and criterion_value > worst_model_score
                    ) or (
                        criterion == "brier_score"
                        and criterion_value < worst_model_score
                    ):
                        top_models_per_criterion[criterion][
                            worst_model_idx
                        ] = current_model_data

            except Exception as e:
                error_message = str(e)
                if (
                    "Matrix not positive definite after repeatedly adding jitter"
                    in error_message
                    or "elements of the" in error_message
                    and "are NaN" in error_message
                    or "cholesky_cpu" in error_message
                ):
                    print(
                        f"Suppressed NotPSDError for {self.task}, {learner} due to"
                        f"convergence issue \n"
                    )
                else:
                    print(
                        f"Error running benchmark for {self.task}, {learner}: "
                        f"{error_message}\n"
                    )
                    traceback.print_exc()

        for criterion, models in top_models_per_criterion.items():
            sorted_models = sorted(
                models, key=lambda x: -x[0] if criterion != "brier_score" else x[0]
            )
            for idx, (score, model, learner, tuning, hpo, encoding) in enumerate(
                sorted_models
            ):
                learners_dict_key = (
                    f"{self.task}_{learner}_{tuning}_{hpo}_{criterion}_{encoding}_"
                    f"{sampling or 'no_sampling'}_factor{self.factor}_rank{idx+1}_"
                    f"score{round(score, 4)}"
                )
                learners_dict[learners_dict_key] = model

        df_results = pd.DataFrame(results)
        pd.set_option("display.max_columns", None, "display.width", 1000)

        if self.verbose:
            print(f"\nBenchmark Results Summary:\n{df_results}")

        return df_results, learners_dict

__init__(task, learners, tuning_methods, hpo_methods, criteria, encodings, sampling=None, factor=None, n_configs=10, n_jobs=1, cv_folds=10, racing_folds=None, test_seed=0, test_size=0.2, val_size=0.2, cv_seed=0, mlp_flag=None, threshold_tuning=None, verbose=True, path=Path('data/processed/processed_data.csv'))

Initialize the Experiment with different tasks, learners, etc.

Parameters:

Name Type Description Default
task str

Task for evaluation ('pocketclosure', 'pocketclosureinf', 'improvement', or 'pdgrouprevaluation'.).

required
learners List[str]

List of learners to benchmark ('xgb', 'rf', 'lr' or 'mlp').

required
tuning_methods List[str]

Tuning methods for each learner ('holdout', 'cv').

required
hpo_methods List[str]

HPO methods ('hebo' or 'rs').

required
criteria List[str]

List of evaluation criteria ('f1', 'macro_f1', 'brier_score').

required
encodings List[str]

List of encodings ('one_hot' or 'target').

required
sampling Optional[List[str]]

Sampling strategies for class imbalance. Includes None, 'upsampling', 'downsampling', and 'smote'.

None
factor Optional[float]

Factor to apply during resampling.

None
n_configs int

Number of configurations for hyperparameter tuning. Defaults to 10.

10
n_jobs int

Number of parallel jobs for processing. Defaults to 1.

1
cv_folds Optional[int]

Number of folds for cross-validation. Defaults to 10.

10
racing_folds Optional[int]

Number of racing folds for Random Search (RS). Defaults to None.

None
test_seed int

Random seed for test splitting. Defaults to 0.

0
test_size float

Proportion of data used for testing. Defaults to 0.2.

0.2
val_size Optional[float]

Size of validation set in holdout tuning. Defaults to 0.2.

0.2
cv_seed Optional[int]

Random seed for cross-validation. Defaults to 0

0
mlp_flag Optional[bool]

Enables MLP training with early stopping. Defaults to None.

None
threshold_tuning Optional[bool]

Enables threshold tuning for binary classification. Defaults to None.

None
verbose bool

If True, enables detailed logging during benchmarking. Defaults to True.

True
path Path

Path to the directory containing processed data files. Defaults to Path("data/processed/processed_data.csv").

Path('data/processed/processed_data.csv')
Source code in periomod/benchmarking/_benchmark.py
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
def __init__(
    self,
    task: str,
    learners: List[str],
    tuning_methods: List[str],
    hpo_methods: List[str],
    criteria: List[str],
    encodings: List[str],
    sampling: Optional[List[Union[str, None]]] = None,
    factor: Optional[float] = None,
    n_configs: int = 10,
    n_jobs: int = 1,
    cv_folds: Optional[int] = 10,
    racing_folds: Optional[int] = None,
    test_seed: int = 0,
    test_size: float = 0.2,
    val_size: Optional[float] = 0.2,
    cv_seed: Optional[int] = 0,
    mlp_flag: Optional[bool] = None,
    threshold_tuning: Optional[bool] = None,
    verbose: bool = True,
    path: Path = Path("data/processed/processed_data.csv"),
) -> None:
    """Initialize the Experiment with different tasks, learners, etc.

    Args:
        task (str): Task for evaluation ('pocketclosure', 'pocketclosureinf',
            'improvement', or 'pdgrouprevaluation'.).
        learners (List[str]): List of learners to benchmark ('xgb', 'rf', 'lr' or
            'mlp').
        tuning_methods (List[str]): Tuning methods for each learner ('holdout',
            'cv').
        hpo_methods (List[str]): HPO methods ('hebo' or 'rs').
        criteria (List[str]): List of evaluation criteria ('f1', 'macro_f1',
            'brier_score').
        encodings (List[str]): List of encodings ('one_hot' or 'target').
        sampling (Optional[List[str]]): Sampling strategies for class imbalance.
            Includes None, 'upsampling', 'downsampling', and 'smote'.
        factor (Optional[float]): Factor to apply during resampling.
        n_configs (int): Number of configurations for hyperparameter tuning.
            Defaults to 10.
        n_jobs (int): Number of parallel jobs for processing. Defaults to 1.
        cv_folds (Optional[int]): Number of folds for cross-validation.
            Defaults to 10.
        racing_folds (Optional[int]): Number of racing folds for Random Search (RS).
            Defaults to None.
        test_seed (int): Random seed for test splitting. Defaults to 0.
        test_size (float): Proportion of data used for testing. Defaults to
            0.2.
        val_size (Optional[float]): Size of validation set in holdout tuning.
            Defaults to 0.2.
        cv_seed (Optional[int]): Random seed for cross-validation. Defaults to 0
        mlp_flag (Optional[bool]): Enables MLP training with early stopping.
            Defaults to None.
        threshold_tuning (Optional[bool]): Enables threshold tuning for binary
            classification. Defaults to None.
        verbose (bool): If True, enables detailed logging during benchmarking.
            Defaults to True.
        path (Path): Path to the directory containing processed data files.
            Defaults to Path("data/processed/processed_data.csv").
    """
    super().__init__(
        task=task,
        learners=learners,
        tuning_methods=tuning_methods,
        hpo_methods=hpo_methods,
        criteria=criteria,
        encodings=encodings,
        sampling=sampling,
        factor=factor,
        n_configs=n_configs,
        n_jobs=n_jobs,
        cv_folds=cv_folds,
        racing_folds=racing_folds,
        test_seed=test_seed,
        test_size=test_size,
        val_size=val_size,
        cv_seed=cv_seed,
        mlp_flag=mlp_flag,
        threshold_tuning=threshold_tuning,
        verbose=verbose,
        path=path,
    )
    self.data_cache = self._load_data_for_tasks()

run_benchmarks()

Benchmark all combinations of inputs.

Returns:

Name Type Description
tuple Tuple[DataFrame, dict]

DataFrame summarizing the benchmark results with metrics for each configuration and dictionary mapping model keys to models for top configurations per criterion.

Raises:

Type Description
KeyError

If an unknown criterion is encountered in metric_map.

Source code in periomod/benchmarking/_benchmark.py
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
def run_benchmarks(self) -> Tuple[pd.DataFrame, dict]:
    """Benchmark all combinations of inputs.

    Returns:
        tuple: DataFrame summarizing the benchmark results with metrics for each
            configuration and dictionary mapping model keys to models for top
            configurations per criterion.

    Raises:
        KeyError: If an unknown criterion is encountered in `metric_map`.
    """
    results = []
    learners_dict = {}
    top_models_per_criterion: Dict[
        str, List[Tuple[float, object, str, str, str, str]]
    ] = {criterion: [] for criterion in self.criteria}

    metric_map = {
        "f1": "F1 Score",
        "brier_score": (
            "Multiclass Brier Score"
            if self.task == "pdgrouprevaluation"
            else "Brier Score"
        ),
        "macro_f1": "Macro F1",
    }

    for learner, tuning, hpo, criterion, encoding, sampling in itertools.product(
        self.learners,
        self.tuning_methods,
        self.hpo_methods,
        self.criteria,
        self.encodings,
        self.sampling or ["no_sampling"],
    ):
        if sampling is None:
            self.factor = None

        if (criterion == "macro_f1" and self.task != "pdgrouprevaluation") or (
            criterion == "f1" and self.task == "pdgrouprevaluation"
        ):
            print(f"Criterion '{criterion}' and task '{self.task}' not valid.")
            continue
        if self.verbose:
            print(
                f"\nRunning benchmark for Task: {self.task}, Learner: {learner}, "
                f"Tuning: {tuning}, HPO: {hpo}, Criterion: {criterion}, "
                f"Sampling: {sampling}, Factor: {self.factor}."
            )
        df = self.data_cache[(encoding)]

        exp = Experiment(
            df=df,
            task=self.task,
            learner=learner,
            criterion=criterion,
            encoding=encoding,
            tuning=tuning,
            hpo=hpo,
            sampling=sampling,
            factor=self.factor,
            n_configs=self.n_configs,
            racing_folds=self.racing_folds,
            n_jobs=self.n_jobs,
            cv_folds=self.cv_folds,
            test_seed=self.test_seed,
            test_size=self.test_size,
            val_size=self.val_size,
            cv_seed=self.cv_seed,
            mlp_flag=self.mlp_flag,
            threshold_tuning=self.threshold_tuning,
            verbose=self.verbose,
        )

        try:
            result = exp.perform_evaluation()
            metrics = result["metrics"]
            trained_model = result["model"]

            unpacked_metrics = {
                k: round(v, 4) if isinstance(v, float) else v
                for k, v in metrics.items()
            }
            results.append(
                {
                    "Task": self.task,
                    "Learner": learner,
                    "Tuning": tuning,
                    "HPO": hpo,
                    "Criterion": criterion,
                    "Sampling": sampling,
                    "Factor": self.factor,
                    **unpacked_metrics,
                }
            )

            metric_key = metric_map.get(criterion)
            if metric_key is None:
                raise KeyError(f"Unknown criterion '{criterion}'")

            criterion_value = metrics[metric_key]

            current_model_data = (
                criterion_value,
                trained_model,
                learner,
                tuning,
                hpo,
                encoding,
            )

            if len(top_models_per_criterion[criterion]) < 4:
                top_models_per_criterion[criterion].append(current_model_data)
            else:
                worst_model_idx = min(
                    range(len(top_models_per_criterion[criterion])),
                    key=lambda idx: (
                        top_models_per_criterion[criterion][idx][0]
                        if criterion != "brier_score"
                        else -top_models_per_criterion[criterion][idx][0]
                    ),
                )
                worst_model_score = top_models_per_criterion[criterion][
                    worst_model_idx
                ][0]
                if (
                    criterion != "brier_score"
                    and criterion_value > worst_model_score
                ) or (
                    criterion == "brier_score"
                    and criterion_value < worst_model_score
                ):
                    top_models_per_criterion[criterion][
                        worst_model_idx
                    ] = current_model_data

        except Exception as e:
            error_message = str(e)
            if (
                "Matrix not positive definite after repeatedly adding jitter"
                in error_message
                or "elements of the" in error_message
                and "are NaN" in error_message
                or "cholesky_cpu" in error_message
            ):
                print(
                    f"Suppressed NotPSDError for {self.task}, {learner} due to"
                    f"convergence issue \n"
                )
            else:
                print(
                    f"Error running benchmark for {self.task}, {learner}: "
                    f"{error_message}\n"
                )
                traceback.print_exc()

    for criterion, models in top_models_per_criterion.items():
        sorted_models = sorted(
            models, key=lambda x: -x[0] if criterion != "brier_score" else x[0]
        )
        for idx, (score, model, learner, tuning, hpo, encoding) in enumerate(
            sorted_models
        ):
            learners_dict_key = (
                f"{self.task}_{learner}_{tuning}_{hpo}_{criterion}_{encoding}_"
                f"{sampling or 'no_sampling'}_factor{self.factor}_rank{idx+1}_"
                f"score{round(score, 4)}"
            )
            learners_dict[learners_dict_key] = model

    df_results = pd.DataFrame(results)
    pd.set_option("display.max_columns", None, "display.width", 1000)

    if self.verbose:
        print(f"\nBenchmark Results Summary:\n{df_results}")

    return df_results, learners_dict