Skip to content

EvaluatorWrapper

Bases: BaseEvaluatorWrapper

Wrapper class for model evaluation, feature importance, and inference.

Extends the base evaluation functionality to enable comprehensive model evaluation, feature importance analysis, patient inference, and jackknife resampling for confidence interval estimation.

Inherits
  • BaseEvaluatorWrapper: Provides foundational methods and attributes for model evaluation, data preparation, and inference.

Parameters:

Name Type Description Default
learners_dict Dict

Dictionary containing trained models and their metadata.

required
criterion str

The criterion used to select the best model ('f1', 'macro_f1', 'brier_score').

required
aggregate bool

Whether to aggregate one-hot encoding. Defaults to True.

True
verbose bool

If True, enables verbose logging during evaluation and inference. Defaults to False.

False
random_state int

Random state for resampling. Defaults to 0.

0
path Path

Path to the directory containing processed data files. Defaults to Path("data/processed/processed_data.csv").

Path('data/processed/processed_data.csv')

Attributes:

Name Type Description
learners_dict Dict

Contains metadata about trained models.

criterion str

Criterion used for model selection.

aggregate bool

Flag for aggregating one-hot encoded metrics.

verbose bool

Controls verbose in evaluation processes.

model object

Best-ranked model based on the criterion.

encoding str

Encoding method ('one_hot' or 'target').

learner str

Type of model (learner) used in training.

task str

Task associated with the extracted model.

factor Optional[float]

Resampling factor if applicable.

sampling Optional[str]

Resampling strategy ('upsampling', 'smote', etc.).

classification str

Classification type ('binary' or 'multiclass').

dataloader ProcessedDataLoader

Data loader and transformer.

resampler Resampler

Resampling strategy for training and testing.

df DataFrame

Loaded dataset.

df_processed DataFrame

Processed dataset.

train_df DataFrame

Training data after splitting.

test_df DataFrame

Test data after splitting.

X_train DataFrame

Training features.

y_train Series

Training labels.

X_test DataFrame

Test features.

y_test Series

Test labels.

base_target Optional[ndarray]

Baseline target for evaluations.

baseline Baseline

Basline class for model analysis.

evaluator ModelEvaluator

Evaluator for model metrics and feature importance.

inference_engine ModelInference

Model inference manager.

trainer Trainer

Trainer for model evaluation and optimization.

Methods:

Name Description
wrapped_evaluation

Runs comprehensive evaluation with optional plots for metrics such as confusion matrix and Brier scores.

evaluate_cluster

Performs clustering and calculates Brier scores. Allows subsetting of test set.

evaluate_feature_importance

Computes feature importance using specified methods (e.g., SHAP, permutation importance). Allows subsetting of test set.

average_over_splits

Aggregates metrics across multiple data splits for robust evaluation.

wrapped_patient_inference

Conducts inference on individual patient data.

wrapped_jackknife

Executes jackknife resampling on patient data to estimate confidence intervals.

Inherited Properties
  • criterion (str): Retrieves or sets current evaluation criterion for model selection. Supports 'f1', 'brier_score', and 'macro_f1'.
  • model (object): Retrieves best-ranked model dynamically based on the current criterion. Recalculates when criterion is updated.

Examples:

from periomod.base import Patient, patient_to_dataframe
from periomod.wrapper import EvaluatorWrapper, load_benchmark, load_learners

benchmark = load_benchmark(path="reports/experiment/benchmark.csv")
learners = load_learners(path="models/experiments")

# Initialize evaluator with learners from BenchmarkWrapper and f1 criterion
evaluator = EvaluatorWrapper(
    learners_dict=learners,
    criterion="f1",
    path="data/processed/processed_data.csv"
)

# Evaluate the model and generate plots
evaluator.wrapped_evaluation()

# Cluster analysis on predictions with brier score smaller than threshold
evaluator.evaluate_cluster(brier_threshold=0.15)

# Calculate feature importance
evaluator.evaluate_feature_importance(fi_types=["shap", "permutation"])

# Train and average over multiple random splits
avg_metrics_df = evaluator.average_over_splits(num_splits=5, n_jobs=-1)

# Define a patient instance
patient = Patient()
patient_df = patient_to_df(patient=patient)

# Run inference on a specific patient's data
predict_data, output, results = evaluator.wrapped_patient_inference(
    patient=patient
    )

# Execute jackknife resampling for robust inference
jackknife_results, ci_plots = evaluator.wrapped_jackknife(
    patient=my_patient, results=results_df, sample_fraction=0.8, n_jobs=-1
)
Source code in periomod/wrapper/_wrapper.py
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
class EvaluatorWrapper(BaseEvaluatorWrapper):
    """Wrapper class for model evaluation, feature importance, and inference.

    Extends the base evaluation functionality to enable comprehensive model
    evaluation, feature importance analysis, patient inference, and jackknife
    resampling for confidence interval estimation.

    Inherits:
        - `BaseEvaluatorWrapper`: Provides foundational methods and attributes for
          model evaluation, data preparation, and inference.

    Args:
        learners_dict (Dict): Dictionary containing trained models and their metadata.
        criterion (str): The criterion used to select the best model ('f1', 'macro_f1',
            'brier_score').
        aggregate (bool): Whether to aggregate one-hot encoding. Defaults
            to True.
        verbose (bool): If True, enables verbose logging during evaluation
            and inference. Defaults to False.
        random_state (int): Random state for resampling. Defaults to 0.
        path (Path): Path to the directory containing processed data files.
            Defaults to Path("data/processed/processed_data.csv").

    Attributes:
        learners_dict (Dict): Contains metadata about trained models.
        criterion (str): Criterion used for model selection.
        aggregate (bool): Flag for aggregating one-hot encoded metrics.
        verbose (bool): Controls verbose in evaluation processes.
        model (object): Best-ranked model based on the criterion.
        encoding (str): Encoding method ('one_hot' or 'target').
        learner (str): Type of model (learner) used in training.
        task (str): Task associated with the extracted model.
        factor (Optional[float]): Resampling factor if applicable.
        sampling (Optional[str]): Resampling strategy ('upsampling', 'smote', etc.).
        classification (str): Classification type ('binary' or 'multiclass').
        dataloader (ProcessedDataLoader): Data loader and transformer.
        resampler (Resampler): Resampling strategy for training and testing.
        df (pd.DataFrame): Loaded dataset.
        df_processed (pd.DataFrame): Processed dataset.
        train_df (pd.DataFrame): Training data after splitting.
        test_df (pd.DataFrame): Test data after splitting.
        X_train (pd.DataFrame): Training features.
        y_train (pd.Series): Training labels.
        X_test (pd.DataFrame): Test features.
        y_test (pd.Series): Test labels.
        base_target (Optional[np.ndarray]): Baseline target for evaluations.
        baseline (Baseline): Basline class for model analysis.
        evaluator (ModelEvaluator): Evaluator for model metrics and feature importance.
        inference_engine (ModelInference): Model inference manager.
        trainer (Trainer): Trainer for model evaluation and optimization.

    Methods:
        wrapped_evaluation: Runs comprehensive evaluation with optional
            plots for metrics such as confusion matrix and Brier scores.
        evaluate_cluster: Performs clustering and calculates Brier scores.
            Allows subsetting of test set.
        evaluate_feature_importance: Computes feature importance using
            specified methods (e.g., SHAP, permutation importance). Allows subsetting
            of test set.
        average_over_splits: Aggregates metrics across multiple data
            splits for robust evaluation.
        wrapped_patient_inference: Conducts inference on individual patient data.
        wrapped_jackknife: Executes jackknife resampling on patient data to
            estimate confidence intervals.

    Inherited Properties:
        - `criterion (str):` Retrieves or sets current evaluation criterion for model
            selection. Supports 'f1', 'brier_score', and 'macro_f1'.
        - `model (object):` Retrieves best-ranked model dynamically based on the current
            criterion. Recalculates when criterion is updated.

    Examples:
        ```
        from periomod.base import Patient, patient_to_dataframe
        from periomod.wrapper import EvaluatorWrapper, load_benchmark, load_learners

        benchmark = load_benchmark(path="reports/experiment/benchmark.csv")
        learners = load_learners(path="models/experiments")

        # Initialize evaluator with learners from BenchmarkWrapper and f1 criterion
        evaluator = EvaluatorWrapper(
            learners_dict=learners,
            criterion="f1",
            path="data/processed/processed_data.csv"
        )

        # Evaluate the model and generate plots
        evaluator.wrapped_evaluation()

        # Cluster analysis on predictions with brier score smaller than threshold
        evaluator.evaluate_cluster(brier_threshold=0.15)

        # Calculate feature importance
        evaluator.evaluate_feature_importance(fi_types=["shap", "permutation"])

        # Train and average over multiple random splits
        avg_metrics_df = evaluator.average_over_splits(num_splits=5, n_jobs=-1)

        # Define a patient instance
        patient = Patient()
        patient_df = patient_to_df(patient=patient)

        # Run inference on a specific patient's data
        predict_data, output, results = evaluator.wrapped_patient_inference(
            patient=patient
            )

        # Execute jackknife resampling for robust inference
        jackknife_results, ci_plots = evaluator.wrapped_jackknife(
            patient=my_patient, results=results_df, sample_fraction=0.8, n_jobs=-1
        )
        ```
    """

    def __init__(
        self,
        learners_dict: Dict,
        criterion: str,
        aggregate: bool = True,
        verbose: bool = False,
        random_state: int = 0,
        path: Path = Path("data/processed/processed_data.csv"),
    ) -> None:
        """Initializes EvaluatorWrapper with model, evaluation, and inference setup.

        Args:
            learners_dict (Dict): Dictionary containing trained models.
            criterion (str): The criterion used to select the best model ('f1',
                'macro_f1', 'brier_score').
            aggregate (bool): Whether to aggregate one-hot encoding. Defaults
                to True.
            verbose (bool): If True, enables verbose logging during evaluation
                and inference. Defaults to False.
            random_state (int): Random state for resampling. Defaults to 0.
            path (Path): Path to the directory containing processed data files.
                Defaults to Path("data/processed/processed_data.csv").

        """
        super().__init__(
            learners_dict=learners_dict,
            criterion=criterion,
            aggregate=aggregate,
            verbose=verbose,
            random_state=random_state,
            path=path,
        )

    def wrapped_evaluation(
        self,
        cm: bool = True,
        cm_base: bool = True,
        brier_groups: bool = True,
        calibration: bool = True,
        tight_layout: bool = False,
    ) -> None:
        """Runs evaluation on the best-ranked model.

        Args:
            cm (bool): Plot the confusion matrix. Defaults to True.
            cm_base (bool): Plot confusion matrix vs value before treatment.
                Defaults to True.
            brier_groups (bool): Calculate Brier score groups. Defaults to True.
            calibration (bool): Plots model calibration. Defaults to True.
            tight_layout (bool): If True, applies tight layout to the plot.
                Defaults to False.
        """
        if cm:
            self.evaluator.plot_confusion_matrix(
                tight_layout=tight_layout, task=self.task
            )
        if cm_base:
            if self.task in [
                "pocketclosure",
                "pocketclosureinf",
                "pdgrouprevaluation",
            ]:
                self.evaluator.plot_confusion_matrix(
                    col=self.base_target,
                    y_label="Pocket Closure",
                    tight_layout=tight_layout,
                    task=self.task,
                )
        if brier_groups:
            self.evaluator.brier_score_groups(tight_layout=tight_layout, task=self.task)
        if calibration:
            self.evaluator.calibration_plot(task=self.task, tight_layout=tight_layout)

    def compare_bss(
        self,
        base: Optional[str] = None,
        revaluation: Optional[str] = None,
        true_preds: bool = False,
        brier_threshold: Optional[float] = None,
        tight_layout: bool = False,
    ) -> None:
        """Compares Brier Skill Score of model with baseline on test set.

        Args:
            base (Optional[str]): Baseline variable for comparison. Defaults to None.
            revaluation (Optional[str]): Revaluation variable. Defaults to None.
            true_preds (bool): Subset by correct predictions. Defaults to False.
            brier_threshold (Optional[float]): Filters observations ny Brier score
                threshold. Defaults to None.
            tight_layout (bool): If True, applies tight layout to the plot.
                Defaults to False.
        """
        baseline_models, _, _ = self.baseline.train_baselines()
        self.evaluator.X, self.evaluator.y, patients = self._test_filters(
            X=self.evaluator.X,
            y=self.evaluator.y,
            base=base,
            revaluation=revaluation,
            true_preds=true_preds,
            brier_threshold=brier_threshold,
        )
        self.evaluator.bss_comparison(
            baseline_models=baseline_models,
            classification=self.classification,
            num_patients=patients,
            tight_layout=tight_layout,
        )
        self.evaluator.X, self.evaluator.y = self.X_test, self.y_test

    def evaluate_cluster(
        self,
        n_cluster: int = 3,
        base: Optional[str] = None,
        revaluation: Optional[str] = None,
        true_preds: bool = False,
        brier_threshold: Optional[float] = None,
        tight_layout: bool = False,
    ) -> None:
        """Performs cluster analysis with Brier scores, optionally applying subsetting.

        This method allows detailed feature analysis by offering multiple subsetting
        options for the test set. The base and revaluation columns allow filtering of
        observations that have not changed after treatment. With true_preds, only
        observations that were correctly predicted are considered. The brier_threshold
        enables filtering of observations that achieved a smaller Brier score at
        prediction time than the threshold.

        Args:
            n_cluster (int): Number of clusters for Brier score clustering analysis.
                Defaults to 3.
            base (Optional[str]): Baseline variable for comparison. Defaults to None.
            revaluation (Optional[str]): Revaluation variable. Defaults to None.
            true_preds (bool): Subset by correct predictions. Defaults to False.
            brier_threshold (Optional[float]): Filters observations ny Brier score
                threshold. Defaults to None.
            tight_layout (bool): If True, applies tight layout to the plot.
                Defaults to False.
        """
        self.evaluator.X, self.evaluator.y, patients = self._test_filters(
            X=self.evaluator.X,
            y=self.evaluator.y,
            base=base,
            revaluation=revaluation,
            true_preds=true_preds,
            brier_threshold=brier_threshold,
        )
        print(f"Number of patients in test set: {patients}")
        print(f"Number of tooth sites: {len(self.evaluator.y)}")
        self.evaluator.analyze_brier_within_clusters(
            n_clusters=n_cluster, tight_layout=tight_layout
        )
        self.evaluator.X, self.evaluator.y = self.X_test, self.y_test

    def evaluate_feature_importance(
        self,
        fi_types: List[str],
        base: Optional[str] = None,
        revaluation: Optional[str] = None,
        true_preds: bool = False,
        brier_threshold: Optional[float] = None,
    ) -> None:
        """Evaluates feature importance using the evaluator, with optional subsetting.

        This method allows detailed feature analysis by offering multiple subsetting
        options for the test set. The base and revaluation columns allow filtering of
        observations that have not changed after treatment. With true_preds, only
        observations that were correctly predicted are considered. The brier_threshold
        enables filtering of observations that achieved a smaller Brier score at
        prediction time than the threshold.

        Args:
            fi_types (List[str]): List of feature importance types to evaluate.
            base (Optional[str]): Baseline variable for comparison. Defaults to None.
            revaluation (Optional[str]): Revaluation variable. Defaults to None.
            true_preds (bool): Subset by correct predictions. Defaults to False.
            brier_threshold (Optional[float]): Filters observations ny Brier score
                threshold. Defaults to None.
        """
        self.evaluator.X, self.evaluator.y, patients = self._test_filters(
            X=self.evaluator.X,
            y=self.evaluator.y,
            base=base,
            revaluation=revaluation,
            true_preds=true_preds,
            brier_threshold=brier_threshold,
        )
        print(f"Number of patients in test set: {patients}")
        print(f"Number of tooth sites: {len(self.evaluator.y)}")
        self.evaluator.evaluate_feature_importance(fi_types=fi_types)
        self.evaluator.X, self.evaluator.y = self.X_test, self.y_test

    def average_over_splits(
        self, num_splits: int = 5, n_jobs: int = -1
    ) -> pd.DataFrame:
        """Trains the final model over multiple splits with different seeds.

        Args:
            num_splits (int): Number of random seeds/splits to train the model on.
                Defaults to 5.
            n_jobs (int): Number of parallel jobs. Defaults to -1 (use all processors).

        Returns:
            DataFrame: DataFrame containing average performance metrics.
        """
        seeds = range(num_splits)
        metrics_list = Parallel(n_jobs=n_jobs)(
            delayed(self._train_and_get_metrics)(seed, self.learner) for seed in seeds
        )
        avg_metrics = {}
        for metric in metrics_list[0]:
            if metric == "Confusion Matrix":
                continue
            values = [d[metric] for d in metrics_list if d[metric] is not None]
            avg_metrics[metric] = sum(values) / len(values) if values else None

        avg_confusion_matrix = None
        if self.classification == "binary" and "Confusion Matrix" in metrics_list[0]:
            avg_confusion_matrix = (
                np.mean([d["Confusion Matrix"] for d in metrics_list], axis=0)
                .astype(int)
                .tolist()
            )

        results = {
            "Task": self.task,
            "Learner": self.learner,
            "Criterion": self.criterion,
            "Sampling": self.sampling,
            "Factor": self.factor,
            **{
                metric: round(value, 4) if isinstance(value, (int, float)) else value
                for metric, value in avg_metrics.items()
            },
        }

        if avg_confusion_matrix is not None:
            results["Confusion Matrix"] = avg_confusion_matrix

        return pd.DataFrame([results])

    def wrapped_patient_inference(
        self,
        patient: Patient,
    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
        """Runs inference on the patient's data using the best-ranked model.

        Args:
            patient (Patient): A `Patient` dataclass instance containing patient-level,
                tooth-level, and side-level information.

        Returns:
            DataFrame: DataFrame with predictions and probabilities for each side
                of the patient's teeth.
        """
        patient_data = patient_to_df(patient=patient)
        predict_data, patient_data = self.inference_engine.prepare_inference(
            task=self.task,
            patient_data=patient_data,
            encoding=self.encoding,
            X_train=self.X_train,
            y_train=self.y_train,
        )

        return self.inference_engine.patient_inference(
            predict_data=predict_data, patient_data=patient_data
        )

    def wrapped_jackknife(
        self,
        patient: Patient,
        results: pd.DataFrame,
        sample_fraction: float = 1.0,
        n_jobs: int = -1,
        max_plots: int = 192,
    ) -> pd.DataFrame:
        """Runs jackknife resampling for inference on a given patient's data.

        Args:
            patient (Patient): `Patient` dataclass instance containing patient-level
                information, tooth-level, and side-level details.
            results (pd.DataFrame): DataFrame to store results from jackknife inference.
            sample_fraction (float, optional): The fraction of patient data to use for
                jackknife resampling. Defaults to 1.0.
            n_jobs (int, optional): The number of parallel jobs to run. Defaults to -1.
            max_plots (int): Maximum number of plots for jackknife intervals.

        Returns:
            DataFrame: The results of jackknife inference.
        """
        patient_data = patient_to_df(patient=patient)
        patient_data, _ = self.inference_engine.prepare_inference(
            task=self.task,
            patient_data=patient_data,
            encoding=self.encoding,
            X_train=self.X_train,
            y_train=self.y_train,
        )
        return self.inference_engine.jackknife_inference(
            model=self.model,
            train_df=self.train_df,
            patient_data=patient_data,
            encoding=self.encoding,
            inference_results=results,
            sample_fraction=sample_fraction,
            n_jobs=n_jobs,
            max_plots=max_plots,
        )

__init__(learners_dict, criterion, aggregate=True, verbose=False, random_state=0, path=Path('data/processed/processed_data.csv'))

Initializes EvaluatorWrapper with model, evaluation, and inference setup.

Parameters:

Name Type Description Default
learners_dict Dict

Dictionary containing trained models.

required
criterion str

The criterion used to select the best model ('f1', 'macro_f1', 'brier_score').

required
aggregate bool

Whether to aggregate one-hot encoding. Defaults to True.

True
verbose bool

If True, enables verbose logging during evaluation and inference. Defaults to False.

False
random_state int

Random state for resampling. Defaults to 0.

0
path Path

Path to the directory containing processed data files. Defaults to Path("data/processed/processed_data.csv").

Path('data/processed/processed_data.csv')
Source code in periomod/wrapper/_wrapper.py
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
def __init__(
    self,
    learners_dict: Dict,
    criterion: str,
    aggregate: bool = True,
    verbose: bool = False,
    random_state: int = 0,
    path: Path = Path("data/processed/processed_data.csv"),
) -> None:
    """Initializes EvaluatorWrapper with model, evaluation, and inference setup.

    Args:
        learners_dict (Dict): Dictionary containing trained models.
        criterion (str): The criterion used to select the best model ('f1',
            'macro_f1', 'brier_score').
        aggregate (bool): Whether to aggregate one-hot encoding. Defaults
            to True.
        verbose (bool): If True, enables verbose logging during evaluation
            and inference. Defaults to False.
        random_state (int): Random state for resampling. Defaults to 0.
        path (Path): Path to the directory containing processed data files.
            Defaults to Path("data/processed/processed_data.csv").

    """
    super().__init__(
        learners_dict=learners_dict,
        criterion=criterion,
        aggregate=aggregate,
        verbose=verbose,
        random_state=random_state,
        path=path,
    )

average_over_splits(num_splits=5, n_jobs=-1)

Trains the final model over multiple splits with different seeds.

Parameters:

Name Type Description Default
num_splits int

Number of random seeds/splits to train the model on. Defaults to 5.

5
n_jobs int

Number of parallel jobs. Defaults to -1 (use all processors).

-1

Returns:

Name Type Description
DataFrame DataFrame

DataFrame containing average performance metrics.

Source code in periomod/wrapper/_wrapper.py
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
def average_over_splits(
    self, num_splits: int = 5, n_jobs: int = -1
) -> pd.DataFrame:
    """Trains the final model over multiple splits with different seeds.

    Args:
        num_splits (int): Number of random seeds/splits to train the model on.
            Defaults to 5.
        n_jobs (int): Number of parallel jobs. Defaults to -1 (use all processors).

    Returns:
        DataFrame: DataFrame containing average performance metrics.
    """
    seeds = range(num_splits)
    metrics_list = Parallel(n_jobs=n_jobs)(
        delayed(self._train_and_get_metrics)(seed, self.learner) for seed in seeds
    )
    avg_metrics = {}
    for metric in metrics_list[0]:
        if metric == "Confusion Matrix":
            continue
        values = [d[metric] for d in metrics_list if d[metric] is not None]
        avg_metrics[metric] = sum(values) / len(values) if values else None

    avg_confusion_matrix = None
    if self.classification == "binary" and "Confusion Matrix" in metrics_list[0]:
        avg_confusion_matrix = (
            np.mean([d["Confusion Matrix"] for d in metrics_list], axis=0)
            .astype(int)
            .tolist()
        )

    results = {
        "Task": self.task,
        "Learner": self.learner,
        "Criterion": self.criterion,
        "Sampling": self.sampling,
        "Factor": self.factor,
        **{
            metric: round(value, 4) if isinstance(value, (int, float)) else value
            for metric, value in avg_metrics.items()
        },
    }

    if avg_confusion_matrix is not None:
        results["Confusion Matrix"] = avg_confusion_matrix

    return pd.DataFrame([results])

compare_bss(base=None, revaluation=None, true_preds=False, brier_threshold=None, tight_layout=False)

Compares Brier Skill Score of model with baseline on test set.

Parameters:

Name Type Description Default
base Optional[str]

Baseline variable for comparison. Defaults to None.

None
revaluation Optional[str]

Revaluation variable. Defaults to None.

None
true_preds bool

Subset by correct predictions. Defaults to False.

False
brier_threshold Optional[float]

Filters observations ny Brier score threshold. Defaults to None.

None
tight_layout bool

If True, applies tight layout to the plot. Defaults to False.

False
Source code in periomod/wrapper/_wrapper.py
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
def compare_bss(
    self,
    base: Optional[str] = None,
    revaluation: Optional[str] = None,
    true_preds: bool = False,
    brier_threshold: Optional[float] = None,
    tight_layout: bool = False,
) -> None:
    """Compares Brier Skill Score of model with baseline on test set.

    Args:
        base (Optional[str]): Baseline variable for comparison. Defaults to None.
        revaluation (Optional[str]): Revaluation variable. Defaults to None.
        true_preds (bool): Subset by correct predictions. Defaults to False.
        brier_threshold (Optional[float]): Filters observations ny Brier score
            threshold. Defaults to None.
        tight_layout (bool): If True, applies tight layout to the plot.
            Defaults to False.
    """
    baseline_models, _, _ = self.baseline.train_baselines()
    self.evaluator.X, self.evaluator.y, patients = self._test_filters(
        X=self.evaluator.X,
        y=self.evaluator.y,
        base=base,
        revaluation=revaluation,
        true_preds=true_preds,
        brier_threshold=brier_threshold,
    )
    self.evaluator.bss_comparison(
        baseline_models=baseline_models,
        classification=self.classification,
        num_patients=patients,
        tight_layout=tight_layout,
    )
    self.evaluator.X, self.evaluator.y = self.X_test, self.y_test

evaluate_cluster(n_cluster=3, base=None, revaluation=None, true_preds=False, brier_threshold=None, tight_layout=False)

Performs cluster analysis with Brier scores, optionally applying subsetting.

This method allows detailed feature analysis by offering multiple subsetting options for the test set. The base and revaluation columns allow filtering of observations that have not changed after treatment. With true_preds, only observations that were correctly predicted are considered. The brier_threshold enables filtering of observations that achieved a smaller Brier score at prediction time than the threshold.

Parameters:

Name Type Description Default
n_cluster int

Number of clusters for Brier score clustering analysis. Defaults to 3.

3
base Optional[str]

Baseline variable for comparison. Defaults to None.

None
revaluation Optional[str]

Revaluation variable. Defaults to None.

None
true_preds bool

Subset by correct predictions. Defaults to False.

False
brier_threshold Optional[float]

Filters observations ny Brier score threshold. Defaults to None.

None
tight_layout bool

If True, applies tight layout to the plot. Defaults to False.

False
Source code in periomod/wrapper/_wrapper.py
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
def evaluate_cluster(
    self,
    n_cluster: int = 3,
    base: Optional[str] = None,
    revaluation: Optional[str] = None,
    true_preds: bool = False,
    brier_threshold: Optional[float] = None,
    tight_layout: bool = False,
) -> None:
    """Performs cluster analysis with Brier scores, optionally applying subsetting.

    This method allows detailed feature analysis by offering multiple subsetting
    options for the test set. The base and revaluation columns allow filtering of
    observations that have not changed after treatment. With true_preds, only
    observations that were correctly predicted are considered. The brier_threshold
    enables filtering of observations that achieved a smaller Brier score at
    prediction time than the threshold.

    Args:
        n_cluster (int): Number of clusters for Brier score clustering analysis.
            Defaults to 3.
        base (Optional[str]): Baseline variable for comparison. Defaults to None.
        revaluation (Optional[str]): Revaluation variable. Defaults to None.
        true_preds (bool): Subset by correct predictions. Defaults to False.
        brier_threshold (Optional[float]): Filters observations ny Brier score
            threshold. Defaults to None.
        tight_layout (bool): If True, applies tight layout to the plot.
            Defaults to False.
    """
    self.evaluator.X, self.evaluator.y, patients = self._test_filters(
        X=self.evaluator.X,
        y=self.evaluator.y,
        base=base,
        revaluation=revaluation,
        true_preds=true_preds,
        brier_threshold=brier_threshold,
    )
    print(f"Number of patients in test set: {patients}")
    print(f"Number of tooth sites: {len(self.evaluator.y)}")
    self.evaluator.analyze_brier_within_clusters(
        n_clusters=n_cluster, tight_layout=tight_layout
    )
    self.evaluator.X, self.evaluator.y = self.X_test, self.y_test

evaluate_feature_importance(fi_types, base=None, revaluation=None, true_preds=False, brier_threshold=None)

Evaluates feature importance using the evaluator, with optional subsetting.

This method allows detailed feature analysis by offering multiple subsetting options for the test set. The base and revaluation columns allow filtering of observations that have not changed after treatment. With true_preds, only observations that were correctly predicted are considered. The brier_threshold enables filtering of observations that achieved a smaller Brier score at prediction time than the threshold.

Parameters:

Name Type Description Default
fi_types List[str]

List of feature importance types to evaluate.

required
base Optional[str]

Baseline variable for comparison. Defaults to None.

None
revaluation Optional[str]

Revaluation variable. Defaults to None.

None
true_preds bool

Subset by correct predictions. Defaults to False.

False
brier_threshold Optional[float]

Filters observations ny Brier score threshold. Defaults to None.

None
Source code in periomod/wrapper/_wrapper.py
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
def evaluate_feature_importance(
    self,
    fi_types: List[str],
    base: Optional[str] = None,
    revaluation: Optional[str] = None,
    true_preds: bool = False,
    brier_threshold: Optional[float] = None,
) -> None:
    """Evaluates feature importance using the evaluator, with optional subsetting.

    This method allows detailed feature analysis by offering multiple subsetting
    options for the test set. The base and revaluation columns allow filtering of
    observations that have not changed after treatment. With true_preds, only
    observations that were correctly predicted are considered. The brier_threshold
    enables filtering of observations that achieved a smaller Brier score at
    prediction time than the threshold.

    Args:
        fi_types (List[str]): List of feature importance types to evaluate.
        base (Optional[str]): Baseline variable for comparison. Defaults to None.
        revaluation (Optional[str]): Revaluation variable. Defaults to None.
        true_preds (bool): Subset by correct predictions. Defaults to False.
        brier_threshold (Optional[float]): Filters observations ny Brier score
            threshold. Defaults to None.
    """
    self.evaluator.X, self.evaluator.y, patients = self._test_filters(
        X=self.evaluator.X,
        y=self.evaluator.y,
        base=base,
        revaluation=revaluation,
        true_preds=true_preds,
        brier_threshold=brier_threshold,
    )
    print(f"Number of patients in test set: {patients}")
    print(f"Number of tooth sites: {len(self.evaluator.y)}")
    self.evaluator.evaluate_feature_importance(fi_types=fi_types)
    self.evaluator.X, self.evaluator.y = self.X_test, self.y_test

wrapped_evaluation(cm=True, cm_base=True, brier_groups=True, calibration=True, tight_layout=False)

Runs evaluation on the best-ranked model.

Parameters:

Name Type Description Default
cm bool

Plot the confusion matrix. Defaults to True.

True
cm_base bool

Plot confusion matrix vs value before treatment. Defaults to True.

True
brier_groups bool

Calculate Brier score groups. Defaults to True.

True
calibration bool

Plots model calibration. Defaults to True.

True
tight_layout bool

If True, applies tight layout to the plot. Defaults to False.

False
Source code in periomod/wrapper/_wrapper.py
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
def wrapped_evaluation(
    self,
    cm: bool = True,
    cm_base: bool = True,
    brier_groups: bool = True,
    calibration: bool = True,
    tight_layout: bool = False,
) -> None:
    """Runs evaluation on the best-ranked model.

    Args:
        cm (bool): Plot the confusion matrix. Defaults to True.
        cm_base (bool): Plot confusion matrix vs value before treatment.
            Defaults to True.
        brier_groups (bool): Calculate Brier score groups. Defaults to True.
        calibration (bool): Plots model calibration. Defaults to True.
        tight_layout (bool): If True, applies tight layout to the plot.
            Defaults to False.
    """
    if cm:
        self.evaluator.plot_confusion_matrix(
            tight_layout=tight_layout, task=self.task
        )
    if cm_base:
        if self.task in [
            "pocketclosure",
            "pocketclosureinf",
            "pdgrouprevaluation",
        ]:
            self.evaluator.plot_confusion_matrix(
                col=self.base_target,
                y_label="Pocket Closure",
                tight_layout=tight_layout,
                task=self.task,
            )
    if brier_groups:
        self.evaluator.brier_score_groups(tight_layout=tight_layout, task=self.task)
    if calibration:
        self.evaluator.calibration_plot(task=self.task, tight_layout=tight_layout)

wrapped_jackknife(patient, results, sample_fraction=1.0, n_jobs=-1, max_plots=192)

Runs jackknife resampling for inference on a given patient's data.

Parameters:

Name Type Description Default
patient Patient

Patient dataclass instance containing patient-level information, tooth-level, and side-level details.

required
results DataFrame

DataFrame to store results from jackknife inference.

required
sample_fraction float

The fraction of patient data to use for jackknife resampling. Defaults to 1.0.

1.0
n_jobs int

The number of parallel jobs to run. Defaults to -1.

-1
max_plots int

Maximum number of plots for jackknife intervals.

192

Returns:

Name Type Description
DataFrame DataFrame

The results of jackknife inference.

Source code in periomod/wrapper/_wrapper.py
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
def wrapped_jackknife(
    self,
    patient: Patient,
    results: pd.DataFrame,
    sample_fraction: float = 1.0,
    n_jobs: int = -1,
    max_plots: int = 192,
) -> pd.DataFrame:
    """Runs jackknife resampling for inference on a given patient's data.

    Args:
        patient (Patient): `Patient` dataclass instance containing patient-level
            information, tooth-level, and side-level details.
        results (pd.DataFrame): DataFrame to store results from jackknife inference.
        sample_fraction (float, optional): The fraction of patient data to use for
            jackknife resampling. Defaults to 1.0.
        n_jobs (int, optional): The number of parallel jobs to run. Defaults to -1.
        max_plots (int): Maximum number of plots for jackknife intervals.

    Returns:
        DataFrame: The results of jackknife inference.
    """
    patient_data = patient_to_df(patient=patient)
    patient_data, _ = self.inference_engine.prepare_inference(
        task=self.task,
        patient_data=patient_data,
        encoding=self.encoding,
        X_train=self.X_train,
        y_train=self.y_train,
    )
    return self.inference_engine.jackknife_inference(
        model=self.model,
        train_df=self.train_df,
        patient_data=patient_data,
        encoding=self.encoding,
        inference_results=results,
        sample_fraction=sample_fraction,
        n_jobs=n_jobs,
        max_plots=max_plots,
    )

wrapped_patient_inference(patient)

Runs inference on the patient's data using the best-ranked model.

Parameters:

Name Type Description Default
patient Patient

A Patient dataclass instance containing patient-level, tooth-level, and side-level information.

required

Returns:

Name Type Description
DataFrame Tuple[DataFrame, DataFrame, DataFrame]

DataFrame with predictions and probabilities for each side of the patient's teeth.

Source code in periomod/wrapper/_wrapper.py
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
def wrapped_patient_inference(
    self,
    patient: Patient,
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """Runs inference on the patient's data using the best-ranked model.

    Args:
        patient (Patient): A `Patient` dataclass instance containing patient-level,
            tooth-level, and side-level information.

    Returns:
        DataFrame: DataFrame with predictions and probabilities for each side
            of the patient's teeth.
    """
    patient_data = patient_to_df(patient=patient)
    predict_data, patient_data = self.inference_engine.prepare_inference(
        task=self.task,
        patient_data=patient_data,
        encoding=self.encoding,
        X_train=self.X_train,
        y_train=self.y_train,
    )

    return self.inference_engine.patient_inference(
        predict_data=predict_data, patient_data=patient_data
    )