EvaluatorWrapper

Bases: BaseEvaluatorWrapper

Wrapper class for model evaluation, feature importance, and inference.

Extends the base evaluation functionality to enable comprehensive model evaluation, feature importance analysis, patient inference, and jackknife resampling for confidence interval estimation.

Inherits

BaseEvaluatorWrapper: Provides foundational methods and attributes for model evaluation, data preparation, and inference.

Parameters:

Name	Type	Description	Default
`learners_dict`	`Dict`	Dictionary containing trained models and their metadata.	required
`criterion`	`str`	The criterion used to select the best model ('f1', 'macro_f1', 'brier_score').	required
`aggregate`	`bool`	Whether to aggregate one-hot encoding. Defaults to True.	`True`
`verbose`	`bool`	If True, enables verbose logging during evaluation and inference. Defaults to False.	`False`
`random_state`	`int`	Random state for resampling. Defaults to 0.	`0`
`path`	`Path`	Path to the directory containing processed data files. Defaults to Path("data/processed/processed_data.csv").	`Path('data/processed/processed_data.csv')`

Attributes:

Name	Type	Description
`learners_dict`	`Dict`	Contains metadata about trained models.
`criterion`	`str`	Criterion used for model selection.
`aggregate`	`bool`	Flag for aggregating one-hot encoded metrics.
`verbose`	`bool`	Controls verbose in evaluation processes.
`model`	`object`	Best-ranked model based on the criterion.
`encoding`	`str`	Encoding method ('one_hot' or 'target').
`learner`	`str`	Type of model (learner) used in training.
`task`	`str`	Task associated with the extracted model.
`factor`	`Optional[float]`	Resampling factor if applicable.
`sampling`	`Optional[str]`	Resampling strategy ('upsampling', 'smote', etc.).
`classification`	`str`	Classification type ('binary' or 'multiclass').
`dataloader`	`ProcessedDataLoader`	Data loader and transformer.
`resampler`	`Resampler`	Resampling strategy for training and testing.
`df`	`DataFrame`	Loaded dataset.
`df_processed`	`DataFrame`	Processed dataset.
`train_df`	`DataFrame`	Training data after splitting.
`test_df`	`DataFrame`	Test data after splitting.
`X_train`	`DataFrame`	Training features.
`y_train`	`Series`	Training labels.
`X_test`	`DataFrame`	Test features.
`y_test`	`Series`	Test labels.
`base_target`	`Optional[ndarray]`	Baseline target for evaluations.
`baseline`	`Baseline`	Basline class for model analysis.
`evaluator`	`ModelEvaluator`	Evaluator for model metrics and feature importance.
`inference_engine`	`ModelInference`	Model inference manager.
`trainer`	`Trainer`	Trainer for model evaluation and optimization.

Methods:

Name	Description
`wrapped_evaluation`	Runs comprehensive evaluation with optional plots for metrics such as confusion matrix and Brier scores.
`evaluate_cluster`	Performs clustering and calculates Brier scores. Allows subsetting of test set.
`evaluate_feature_importance`	Computes feature importance using specified methods (e.g., SHAP, permutation importance). Allows subsetting of test set.
`average_over_splits`	Aggregates metrics across multiple data splits for robust evaluation.
`wrapped_patient_inference`	Conducts inference on individual patient data.
`wrapped_jackknife`	Executes jackknife resampling on patient data to estimate confidence intervals.

Inherited Properties

criterion (str): Retrieves or sets current evaluation criterion for model selection. Supports 'f1', 'brier_score', and 'macro_f1'.
model (object): Retrieves best-ranked model dynamically based on the current criterion. Recalculates when criterion is updated.

Examples:

from periomod.base import Patient, patient_to_dataframe
from periomod.wrapper import EvaluatorWrapper, load_benchmark, load_learners

benchmark = load_benchmark(path="reports/experiment/benchmark.csv")
learners = load_learners(path="models/experiments")

# Initialize evaluator with learners from BenchmarkWrapper and f1 criterion
evaluator = EvaluatorWrapper(
    learners_dict=learners,
    criterion="f1",
    path="data/processed/processed_data.csv"
)

# Evaluate the model and generate plots
evaluator.wrapped_evaluation()

# Cluster analysis on predictions with brier score smaller than threshold
evaluator.evaluate_cluster(brier_threshold=0.15)

# Calculate feature importance
evaluator.evaluate_feature_importance(fi_types=["shap", "permutation"])

# Train and average over multiple random splits
avg_metrics_df = evaluator.average_over_splits(num_splits=5, n_jobs=-1)

# Define a patient instance
patient = Patient()
patient_df = patient_to_df(patient=patient)

# Run inference on a specific patient's data
predict_data, output, results = evaluator.wrapped_patient_inference(
    patient=patient
    )

# Execute jackknife resampling for robust inference
jackknife_results, ci_plots = evaluator.wrapped_jackknife(
    patient=my_patient, results=results_df, sample_fraction=0.8, n_jobs=-1
)

Source code in periomod/wrapper/_wrapper.py

class EvaluatorWrapper(BaseEvaluatorWrapper):
    """Wrapper class for model evaluation, feature importance, and inference.

    Extends the base evaluation functionality to enable comprehensive model
    evaluation, feature importance analysis, patient inference, and jackknife
    resampling for confidence interval estimation.

    Inherits:
        - `BaseEvaluatorWrapper`: Provides foundational methods and attributes for
          model evaluation, data preparation, and inference.

    Args:
        learners_dict (Dict): Dictionary containing trained models and their metadata.
        criterion (str): The criterion used to select the best model ('f1', 'macro_f1',
            'brier_score').
        aggregate (bool): Whether to aggregate one-hot encoding. Defaults
            to True.
        verbose (bool): If True, enables verbose logging during evaluation
            and inference. Defaults to False.
        random_state (int): Random state for resampling. Defaults to 0.
        path (Path): Path to the directory containing processed data files.
            Defaults to Path("data/processed/processed_data.csv").

    Attributes:
        learners_dict (Dict): Contains metadata about trained models.
        criterion (str): Criterion used for model selection.
        aggregate (bool): Flag for aggregating one-hot encoded metrics.
        verbose (bool): Controls verbose in evaluation processes.
        model (object): Best-ranked model based on the criterion.
        encoding (str): Encoding method ('one_hot' or 'target').
        learner (str): Type of model (learner) used in training.
        task (str): Task associated with the extracted model.
        factor (Optional[float]): Resampling factor if applicable.
        sampling (Optional[str]): Resampling strategy ('upsampling', 'smote', etc.).
        classification (str): Classification type ('binary' or 'multiclass').
        dataloader (ProcessedDataLoader): Data loader and transformer.
        resampler (Resampler): Resampling strategy for training and testing.
        df (pd.DataFrame): Loaded dataset.
        df_processed (pd.DataFrame): Processed dataset.
        train_df (pd.DataFrame): Training data after splitting.
        test_df (pd.DataFrame): Test data after splitting.
        X_train (pd.DataFrame): Training features.
        y_train (pd.Series): Training labels.
        X_test (pd.DataFrame): Test features.
        y_test (pd.Series): Test labels.
        base_target (Optional[np.ndarray]): Baseline target for evaluations.
        baseline (Baseline): Basline class for model analysis.
        evaluator (ModelEvaluator): Evaluator for model metrics and feature importance.
        inference_engine (ModelInference): Model inference manager.
        trainer (Trainer): Trainer for model evaluation and optimization.

    Methods:
        wrapped_evaluation: Runs comprehensive evaluation with optional
            plots for metrics such as confusion matrix and Brier scores.
        evaluate_cluster: Performs clustering and calculates Brier scores.
            Allows subsetting of test set.
        evaluate_feature_importance: Computes feature importance using
            specified methods (e.g., SHAP, permutation importance). Allows subsetting
            of test set.
        average_over_splits: Aggregates metrics across multiple data
            splits for robust evaluation.
        wrapped_patient_inference: Conducts inference on individual patient data.
        wrapped_jackknife: Executes jackknife resampling on patient data to
            estimate confidence intervals.

    Inherited Properties:
        - `criterion (str):` Retrieves or sets current evaluation criterion for model
            selection. Supports 'f1', 'brier_score', and 'macro_f1'.
        - `model (object):` Retrieves best-ranked model dynamically based on the current
            criterion. Recalculates when criterion is updated.

    Examples:
        ```
        from periomod.base import Patient, patient_to_dataframe
        from periomod.wrapper import EvaluatorWrapper, load_benchmark, load_learners

        benchmark = load_benchmark(path="reports/experiment/benchmark.csv")
        learners = load_learners(path="models/experiments")

        # Initialize evaluator with learners from BenchmarkWrapper and f1 criterion
        evaluator = EvaluatorWrapper(
            learners_dict=learners,
            criterion="f1",
            path="data/processed/processed_data.csv"
        )

        # Evaluate the model and generate plots
        evaluator.wrapped_evaluation()

        # Cluster analysis on predictions with brier score smaller than threshold
        evaluator.evaluate_cluster(brier_threshold=0.15)

        # Calculate feature importance
        evaluator.evaluate_feature_importance(fi_types=["shap", "permutation"])

        # Train and average over multiple random splits
        avg_metrics_df = evaluator.average_over_splits(num_splits=5, n_jobs=-1)

        # Define a patient instance
        patient = Patient()
        patient_df = patient_to_df(patient=patient)

        # Run inference on a specific patient's data
        predict_data, output, results = evaluator.wrapped_patient_inference(
            patient=patient
            )

        # Execute jackknife resampling for robust inference
        jackknife_results, ci_plots = evaluator.wrapped_jackknife(
            patient=my_patient, results=results_df, sample_fraction=0.8, n_jobs=-1
        )
        ```
    """

    def __init__(
        self,
        learners_dict: Dict,
        criterion: str,
        aggregate: bool = True,
        verbose: bool = False,
        random_state: int = 0,
        path: Path = Path("data/processed/processed_data.csv"),
    ) -> None:
        """Initializes EvaluatorWrapper with model, evaluation, and inference setup.

        Args:
            learners_dict (Dict): Dictionary containing trained models.
            criterion (str): The criterion used to select the best model ('f1',
                'macro_f1', 'brier_score').
            aggregate (bool): Whether to aggregate one-hot encoding. Defaults
                to True.
            verbose (bool): If True, enables verbose logging during evaluation
                and inference. Defaults to False.
            random_state (int): Random state for resampling. Defaults to 0.
            path (Path): Path to the directory containing processed data files.
                Defaults to Path("data/processed/processed_data.csv").

        """
        super().__init__(
            learners_dict=learners_dict,
            criterion=criterion,
            aggregate=aggregate,
            verbose=verbose,
            random_state=random_state,
            path=path,
        )

    def wrapped_evaluation(
        self,
        cm: bool = True,
        cm_base: bool = True,
        brier_groups: bool = True,
        calibration: bool = True,
        tight_layout: bool = False,
    ) -> None:
        """Runs evaluation on the best-ranked model.

        Args:
            cm (bool): Plot the confusion matrix. Defaults to True.
            cm_base (bool): Plot confusion matrix vs value before treatment.
                Defaults to True.
            brier_groups (bool): Calculate Brier score groups. Defaults to True.
            calibration (bool): Plots model calibration. Defaults to True.
            tight_layout (bool): If True, applies tight layout to the plot.
                Defaults to False.
        """
        if cm:
            self.evaluator.plot_confusion_matrix(
                tight_layout=tight_layout, task=self.task
            )
        if cm_base:
            if self.task in [
                "pocketclosure",
                "pocketclosureinf",
                "pdgrouprevaluation",
            ]:
                self.evaluator.plot_confusion_matrix(
                    col=self.base_target,
                    y_label="Pocket Closure",
                    tight_layout=tight_layout,
                    task=self.task,
                )
        if brier_groups:
            self.evaluator.brier_score_groups(tight_layout=tight_layout, task=self.task)
        if calibration:
            self.evaluator.calibration_plot(task=self.task, tight_layout=tight_layout)

    def compare_bss(
        self,
        base: Optional[str] = None,
        revaluation: Optional[str] = None,
        true_preds: bool = False,
        brier_threshold: Optional[float] = None,
        tight_layout: bool = False,
    ) -> None:
        """Compares Brier Skill Score of model with baseline on test set.

        Args:
            base (Optional[str]): Baseline variable for comparison. Defaults to None.
            revaluation (Optional[str]): Revaluation variable. Defaults to None.
            true_preds (bool): Subset by correct predictions. Defaults to False.
            brier_threshold (Optional[float]): Filters observations ny Brier score
                threshold. Defaults to None.
            tight_layout (bool): If True, applies tight layout to the plot.
                Defaults to False.
        """
        baseline_models, _, _ = self.baseline.train_baselines()
        self.evaluator.X, self.evaluator.y, patients = self._test_filters(
            X=self.evaluator.X,
            y=self.evaluator.y,
            base=base,
            revaluation=revaluation,
            true_preds=true_preds,
            brier_threshold=brier_threshold,
        )
        self.evaluator.bss_comparison(
            baseline_models=baseline_models,
            classification=self.classification,
            num_patients=patients,
            tight_layout=tight_layout,
        )
        self.evaluator.X, self.evaluator.y = self.X_test, self.y_test

    def evaluate_cluster(
        self,
        n_cluster: int = 3,
        base: Optional[str] = None,
        revaluation: Optional[str] = None,
        true_preds: bool = False,
        brier_threshold: Optional[float] = None,
        tight_layout: bool = False,
    ) -> None:
        """Performs cluster analysis with Brier scores, optionally applying subsetting.

        This method allows detailed feature analysis by offering multiple subsetting
        options for the test set. The base and revaluation columns allow filtering of
        observations that have not changed after treatment. With true_preds, only
        observations that were correctly predicted are considered. The brier_threshold
        enables filtering of observations that achieved a smaller Brier score at
        prediction time than the threshold.

        Args:
            n_cluster (int): Number of clusters for Brier score clustering analysis.
                Defaults to 3.
            base (Optional[str]): Baseline variable for comparison. Defaults to None.
            revaluation (Optional[str]): Revaluation variable. Defaults to None.
            true_preds (bool): Subset by correct predictions. Defaults to False.
            brier_threshold (Optional[float]): Filters observations ny Brier score
                threshold. Defaults to None.
            tight_layout (bool): If True, applies tight layout to the plot.
                Defaults to False.
        """
        self.evaluator.X, self.evaluator.y, patients = self._test_filters(
            X=self.evaluator.X,
            y=self.evaluator.y,
            base=base,
            revaluation=revaluation,
            true_preds=true_preds,
            brier_threshold=brier_threshold,
        )
        print(f"Number of patients in test set: {patients}")
        print(f"Number of tooth sites: {len(self.evaluator.y)}")
        self.evaluator.analyze_brier_within_clusters(
            n_clusters=n_cluster, tight_layout=tight_layout
        )
        self.evaluator.X, self.evaluator.y = self.X_test, self.y_test

    def evaluate_feature_importance(
        self,
        fi_types: List[str],
        base: Optional[str] = None,
        revaluation: Optional[str] = None,
        true_preds: bool = False,
        brier_threshold: Optional[float] = None,
    ) -> None:
        """Evaluates feature importance using the evaluator, with optional subsetting.

        This method allows detailed feature analysis by offering multiple subsetting
        options for the test set. The base and revaluation columns allow filtering of
        observations that have not changed after treatment. With true_preds, only
        observations that were correctly predicted are considered. The brier_threshold
        enables filtering of observations that achieved a smaller Brier score at
        prediction time than the threshold.

        Args:
            fi_types (List[str]): List of feature importance types to evaluate.
            base (Optional[str]): Baseline variable for comparison. Defaults to None.
            revaluation (Optional[str]): Revaluation variable. Defaults to None.
            true_preds (bool): Subset by correct predictions. Defaults to False.
            brier_threshold (Optional[float]): Filters observations ny Brier score
                threshold. Defaults to None.
        """
        self.evaluator.X, self.evaluator.y, patients = self._test_filters(
            X=self.evaluator.X,
            y=self.evaluator.y,
            base=base,
            revaluation=revaluation,
            true_preds=true_preds,
            brier_threshold=brier_threshold,
        )
        print(f"Number of patients in test set: {patients}")
        print(f"Number of tooth sites: {len(self.evaluator.y)}")
        self.evaluator.evaluate_feature_importance(fi_types=fi_types)
        self.evaluator.X, self.evaluator.y = self.X_test, self.y_test

    def average_over_splits(
        self, num_splits: int = 5, n_jobs: int = -1
    ) -> pd.DataFrame:
        """Trains the final model over multiple splits with different seeds.

        Args:
            num_splits (int): Number of random seeds/splits to train the model on.
                Defaults to 5.
            n_jobs (int): Number of parallel jobs. Defaults to -1 (use all processors).

        Returns:
            DataFrame: DataFrame containing average performance metrics.
        """
        seeds = range(num_splits)
        metrics_list = Parallel(n_jobs=n_jobs)(
            delayed(self._train_and_get_metrics)(seed, self.learner) for seed in seeds
        )
        avg_metrics = {}
        for metric in metrics_list[0]:
            if metric == "Confusion Matrix":
                continue
            values = [d[metric] for d in metrics_list if d[metric] is not None]
            avg_metrics[metric] = sum(values) / len(values) if values else None

        avg_confusion_matrix = None
        if self.classification == "binary" and "Confusion Matrix" in metrics_list[0]:
            avg_confusion_matrix = (
                np.mean([d["Confusion Matrix"] for d in metrics_list], axis=0)
                .astype(int)
                .tolist()
            )

        results = {
            "Task": self.task,
            "Learner": self.learner,
            "Criterion": self.criterion,
            "Sampling": self.sampling,
            "Factor": self.factor,
            **{
                metric: round(value, 4) if isinstance(value, (int, float)) else value
                for metric, value in avg_metrics.items()
            },
        }

        if avg_confusion_matrix is not None:
            results["Confusion Matrix"] = avg_confusion_matrix

        return pd.DataFrame([results])

    def wrapped_patient_inference(
        self,
        patient: Patient,
    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
        """Runs inference on the patient's data using the best-ranked model.

        Args:
            patient (Patient): A `Patient` dataclass instance containing patient-level,
                tooth-level, and side-level information.

        Returns:
            DataFrame: DataFrame with predictions and probabilities for each side
                of the patient's teeth.
        """
        patient_data = patient_to_df(patient=patient)
        predict_data, patient_data = self.inference_engine.prepare_inference(
            task=self.task,
            patient_data=patient_data,
            encoding=self.encoding,
            X_train=self.X_train,
            y_train=self.y_train,
        )

        return self.inference_engine.patient_inference(
            predict_data=predict_data, patient_data=patient_data
        )

    def wrapped_jackknife(
        self,
        patient: Patient,
        results: pd.DataFrame,
        sample_fraction: float = 1.0,
        n_jobs: int = -1,
        max_plots: int = 192,
    ) -> pd.DataFrame:
        """Runs jackknife resampling for inference on a given patient's data.

        Args:
            patient (Patient): `Patient` dataclass instance containing patient-level
                information, tooth-level, and side-level details.
            results (pd.DataFrame): DataFrame to store results from jackknife inference.
            sample_fraction (float, optional): The fraction of patient data to use for
                jackknife resampling. Defaults to 1.0.
            n_jobs (int, optional): The number of parallel jobs to run. Defaults to -1.
            max_plots (int): Maximum number of plots for jackknife intervals.

        Returns:
            DataFrame: The results of jackknife inference.
        """
        patient_data = patient_to_df(patient=patient)
        patient_data, _ = self.inference_engine.prepare_inference(
            task=self.task,
            patient_data=patient_data,
            encoding=self.encoding,
            X_train=self.X_train,
            y_train=self.y_train,
        )
        return self.inference_engine.jackknife_inference(
            model=self.model,
            train_df=self.train_df,
            patient_data=patient_data,
            encoding=self.encoding,
            inference_results=results,
            sample_fraction=sample_fraction,
            n_jobs=n_jobs,
            max_plots=max_plots,
        )

`init(learners_dict, criterion, aggregate=True, verbose=False, random_state=0, path=Path('data/processed/processed_data.csv'))` ¶

Initializes EvaluatorWrapper with model, evaluation, and inference setup.

Parameters:

Name	Type	Description	Default
`learners_dict`	`Dict`	Dictionary containing trained models.	required
`criterion`	`str`	The criterion used to select the best model ('f1', 'macro_f1', 'brier_score').	required
`aggregate`	`bool`	Whether to aggregate one-hot encoding. Defaults to True.	`True`
`verbose`	`bool`	If True, enables verbose logging during evaluation and inference. Defaults to False.	`False`
`random_state`	`int`	Random state for resampling. Defaults to 0.	`0`
`path`	`Path`	Path to the directory containing processed data files. Defaults to Path("data/processed/processed_data.csv").	`Path('data/processed/processed_data.csv')`

Source code in periomod/wrapper/_wrapper.py

def __init__(
    self,
    learners_dict: Dict,
    criterion: str,
    aggregate: bool = True,
    verbose: bool = False,
    random_state: int = 0,
    path: Path = Path("data/processed/processed_data.csv"),
) -> None:
    """Initializes EvaluatorWrapper with model, evaluation, and inference setup.

    Args:
        learners_dict (Dict): Dictionary containing trained models.
        criterion (str): The criterion used to select the best model ('f1',
            'macro_f1', 'brier_score').
        aggregate (bool): Whether to aggregate one-hot encoding. Defaults
            to True.
        verbose (bool): If True, enables verbose logging during evaluation
            and inference. Defaults to False.
        random_state (int): Random state for resampling. Defaults to 0.
        path (Path): Path to the directory containing processed data files.
            Defaults to Path("data/processed/processed_data.csv").

    """
    super().__init__(
        learners_dict=learners_dict,
        criterion=criterion,
        aggregate=aggregate,
        verbose=verbose,
        random_state=random_state,
        path=path,
    )

`average_over_splits(num_splits=5, n_jobs=-1)` ¶

Trains the final model over multiple splits with different seeds.

Parameters:

Name	Type	Description	Default
`num_splits`	`int`	Number of random seeds/splits to train the model on. Defaults to 5.	`5`
`n_jobs`	`int`	Number of parallel jobs. Defaults to -1 (use all processors).	`-1`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame containing average performance metrics.

Source code in periomod/wrapper/_wrapper.py

def average_over_splits(
    self, num_splits: int = 5, n_jobs: int = -1
) -> pd.DataFrame:
    """Trains the final model over multiple splits with different seeds.

    Args:
        num_splits (int): Number of random seeds/splits to train the model on.
            Defaults to 5.
        n_jobs (int): Number of parallel jobs. Defaults to -1 (use all processors).

    Returns:
        DataFrame: DataFrame containing average performance metrics.
    """
    seeds = range(num_splits)
    metrics_list = Parallel(n_jobs=n_jobs)(
        delayed(self._train_and_get_metrics)(seed, self.learner) for seed in seeds
    )
    avg_metrics = {}
    for metric in metrics_list[0]:
        if metric == "Confusion Matrix":
            continue
        values = [d[metric] for d in metrics_list if d[metric] is not None]
        avg_metrics[metric] = sum(values) / len(values) if values else None

    avg_confusion_matrix = None
    if self.classification == "binary" and "Confusion Matrix" in metrics_list[0]:
        avg_confusion_matrix = (
            np.mean([d["Confusion Matrix"] for d in metrics_list], axis=0)
            .astype(int)
            .tolist()
        )

    results = {
        "Task": self.task,
        "Learner": self.learner,
        "Criterion": self.criterion,
        "Sampling": self.sampling,
        "Factor": self.factor,
        **{
            metric: round(value, 4) if isinstance(value, (int, float)) else value
            for metric, value in avg_metrics.items()
        },
    }

    if avg_confusion_matrix is not None:
        results["Confusion Matrix"] = avg_confusion_matrix

    return pd.DataFrame([results])

`compare_bss(base=None, revaluation=None, true_preds=False, brier_threshold=None, tight_layout=False)` ¶

Compares Brier Skill Score of model with baseline on test set.

Parameters:

Name	Type	Description	Default
`base`	`Optional[str]`	Baseline variable for comparison. Defaults to None.	`None`
`revaluation`	`Optional[str]`	Revaluation variable. Defaults to None.	`None`
`true_preds`	`bool`	Subset by correct predictions. Defaults to False.	`False`
`brier_threshold`	`Optional[float]`	Filters observations ny Brier score threshold. Defaults to None.	`None`
`tight_layout`	`bool`	If True, applies tight layout to the plot. Defaults to False.	`False`

Source code in periomod/wrapper/_wrapper.py

def compare_bss(
    self,
    base: Optional[str] = None,
    revaluation: Optional[str] = None,
    true_preds: bool = False,
    brier_threshold: Optional[float] = None,
    tight_layout: bool = False,
) -> None:
    """Compares Brier Skill Score of model with baseline on test set.

    Args:
        base (Optional[str]): Baseline variable for comparison. Defaults to None.
        revaluation (Optional[str]): Revaluation variable. Defaults to None.
        true_preds (bool): Subset by correct predictions. Defaults to False.
        brier_threshold (Optional[float]): Filters observations ny Brier score
            threshold. Defaults to None.
        tight_layout (bool): If True, applies tight layout to the plot.
            Defaults to False.
    """
    baseline_models, _, _ = self.baseline.train_baselines()
    self.evaluator.X, self.evaluator.y, patients = self._test_filters(
        X=self.evaluator.X,
        y=self.evaluator.y,
        base=base,
        revaluation=revaluation,
        true_preds=true_preds,
        brier_threshold=brier_threshold,
    )
    self.evaluator.bss_comparison(
        baseline_models=baseline_models,
        classification=self.classification,
        num_patients=patients,
        tight_layout=tight_layout,
    )
    self.evaluator.X, self.evaluator.y = self.X_test, self.y_test

`evaluate_cluster(n_cluster=3, base=None, revaluation=None, true_preds=False, brier_threshold=None, tight_layout=False)` ¶

Performs cluster analysis with Brier scores, optionally applying subsetting.

This method allows detailed feature analysis by offering multiple subsetting options for the test set. The base and revaluation columns allow filtering of observations that have not changed after treatment. With true_preds, only observations that were correctly predicted are considered. The brier_threshold enables filtering of observations that achieved a smaller Brier score at prediction time than the threshold.

Parameters:

Name	Type	Description	Default
`n_cluster`	`int`	Number of clusters for Brier score clustering analysis. Defaults to 3.	`3`
`base`	`Optional[str]`	Baseline variable for comparison. Defaults to None.	`None`
`revaluation`	`Optional[str]`	Revaluation variable. Defaults to None.	`None`
`true_preds`	`bool`	Subset by correct predictions. Defaults to False.	`False`
`brier_threshold`	`Optional[float]`	Filters observations ny Brier score threshold. Defaults to None.	`None`
`tight_layout`	`bool`	If True, applies tight layout to the plot. Defaults to False.	`False`

Source code in periomod/wrapper/_wrapper.py

def evaluate_cluster(
    self,
    n_cluster: int = 3,
    base: Optional[str] = None,
    revaluation: Optional[str] = None,
    true_preds: bool = False,
    brier_threshold: Optional[float] = None,
    tight_layout: bool = False,
) -> None:
    """Performs cluster analysis with Brier scores, optionally applying subsetting.

    This method allows detailed feature analysis by offering multiple subsetting
    options for the test set. The base and revaluation columns allow filtering of
    observations that have not changed after treatment. With true_preds, only
    observations that were correctly predicted are considered. The brier_threshold
    enables filtering of observations that achieved a smaller Brier score at
    prediction time than the threshold.

    Args:
        n_cluster (int): Number of clusters for Brier score clustering analysis.
            Defaults to 3.
        base (Optional[str]): Baseline variable for comparison. Defaults to None.
        revaluation (Optional[str]): Revaluation variable. Defaults to None.
        true_preds (bool): Subset by correct predictions. Defaults to False.
        brier_threshold (Optional[float]): Filters observations ny Brier score
            threshold. Defaults to None.
        tight_layout (bool): If True, applies tight layout to the plot.
            Defaults to False.
    """
    self.evaluator.X, self.evaluator.y, patients = self._test_filters(
        X=self.evaluator.X,
        y=self.evaluator.y,
        base=base,
        revaluation=revaluation,
        true_preds=true_preds,
        brier_threshold=brier_threshold,
    )
    print(f"Number of patients in test set: {patients}")
    print(f"Number of tooth sites: {len(self.evaluator.y)}")
    self.evaluator.analyze_brier_within_clusters(
        n_clusters=n_cluster, tight_layout=tight_layout
    )
    self.evaluator.X, self.evaluator.y = self.X_test, self.y_test

`evaluate_feature_importance(fi_types, base=None, revaluation=None, true_preds=False, brier_threshold=None)` ¶

Evaluates feature importance using the evaluator, with optional subsetting.

This method allows detailed feature analysis by offering multiple subsetting options for the test set. The base and revaluation columns allow filtering of observations that have not changed after treatment. With true_preds, only observations that were correctly predicted are considered. The brier_threshold enables filtering of observations that achieved a smaller Brier score at prediction time than the threshold.

Parameters:

Name	Type	Description	Default
`fi_types`	`List[str]`	List of feature importance types to evaluate.	required
`base`	`Optional[str]`	Baseline variable for comparison. Defaults to None.	`None`
`revaluation`	`Optional[str]`	Revaluation variable. Defaults to None.	`None`
`true_preds`	`bool`	Subset by correct predictions. Defaults to False.	`False`
`brier_threshold`	`Optional[float]`	Filters observations ny Brier score threshold. Defaults to None.	`None`

Source code in periomod/wrapper/_wrapper.py

def evaluate_feature_importance(
    self,
    fi_types: List[str],
    base: Optional[str] = None,
    revaluation: Optional[str] = None,
    true_preds: bool = False,
    brier_threshold: Optional[float] = None,
) -> None:
    """Evaluates feature importance using the evaluator, with optional subsetting.

    This method allows detailed feature analysis by offering multiple subsetting
    options for the test set. The base and revaluation columns allow filtering of
    observations that have not changed after treatment. With true_preds, only
    observations that were correctly predicted are considered. The brier_threshold
    enables filtering of observations that achieved a smaller Brier score at
    prediction time than the threshold.

    Args:
        fi_types (List[str]): List of feature importance types to evaluate.
        base (Optional[str]): Baseline variable for comparison. Defaults to None.
        revaluation (Optional[str]): Revaluation variable. Defaults to None.
        true_preds (bool): Subset by correct predictions. Defaults to False.
        brier_threshold (Optional[float]): Filters observations ny Brier score
            threshold. Defaults to None.
    """
    self.evaluator.X, self.evaluator.y, patients = self._test_filters(
        X=self.evaluator.X,
        y=self.evaluator.y,
        base=base,
        revaluation=revaluation,
        true_preds=true_preds,
        brier_threshold=brier_threshold,
    )
    print(f"Number of patients in test set: {patients}")
    print(f"Number of tooth sites: {len(self.evaluator.y)}")
    self.evaluator.evaluate_feature_importance(fi_types=fi_types)
    self.evaluator.X, self.evaluator.y = self.X_test, self.y_test

`wrapped_evaluation(cm=True, cm_base=True, brier_groups=True, calibration=True, tight_layout=False)` ¶

Runs evaluation on the best-ranked model.

Parameters:

Name	Type	Description	Default
`cm`	`bool`	Plot the confusion matrix. Defaults to True.	`True`
`cm_base`	`bool`	Plot confusion matrix vs value before treatment. Defaults to True.	`True`
`brier_groups`	`bool`	Calculate Brier score groups. Defaults to True.	`True`
`calibration`	`bool`	Plots model calibration. Defaults to True.	`True`
`tight_layout`	`bool`	If True, applies tight layout to the plot. Defaults to False.	`False`

Source code in periomod/wrapper/_wrapper.py

def wrapped_evaluation(
    self,
    cm: bool = True,
    cm_base: bool = True,
    brier_groups: bool = True,
    calibration: bool = True,
    tight_layout: bool = False,
) -> None:
    """Runs evaluation on the best-ranked model.

    Args:
        cm (bool): Plot the confusion matrix. Defaults to True.
        cm_base (bool): Plot confusion matrix vs value before treatment.
            Defaults to True.
        brier_groups (bool): Calculate Brier score groups. Defaults to True.
        calibration (bool): Plots model calibration. Defaults to True.
        tight_layout (bool): If True, applies tight layout to the plot.
            Defaults to False.
    """
    if cm:
        self.evaluator.plot_confusion_matrix(
            tight_layout=tight_layout, task=self.task
        )
    if cm_base:
        if self.task in [
            "pocketclosure",
            "pocketclosureinf",
            "pdgrouprevaluation",
        ]:
            self.evaluator.plot_confusion_matrix(
                col=self.base_target,
                y_label="Pocket Closure",
                tight_layout=tight_layout,
                task=self.task,
            )
    if brier_groups:
        self.evaluator.brier_score_groups(tight_layout=tight_layout, task=self.task)
    if calibration:
        self.evaluator.calibration_plot(task=self.task, tight_layout=tight_layout)

`wrapped_jackknife(patient, results, sample_fraction=1.0, n_jobs=-1, max_plots=192)` ¶

Runs jackknife resampling for inference on a given patient's data.

Parameters:

Name	Type	Description	Default
`patient`	`Patient`	`Patient` dataclass instance containing patient-level information, tooth-level, and side-level details.	required
`results`	`DataFrame`	DataFrame to store results from jackknife inference.	required
`sample_fraction`	`float`	The fraction of patient data to use for jackknife resampling. Defaults to 1.0.	`1.0`
`n_jobs`	`int`	The number of parallel jobs to run. Defaults to -1.	`-1`
`max_plots`	`int`	Maximum number of plots for jackknife intervals.	`192`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	The results of jackknife inference.

Source code in periomod/wrapper/_wrapper.py

def wrapped_jackknife(
    self,
    patient: Patient,
    results: pd.DataFrame,
    sample_fraction: float = 1.0,
    n_jobs: int = -1,
    max_plots: int = 192,
) -> pd.DataFrame:
    """Runs jackknife resampling for inference on a given patient's data.

    Args:
        patient (Patient): `Patient` dataclass instance containing patient-level
            information, tooth-level, and side-level details.
        results (pd.DataFrame): DataFrame to store results from jackknife inference.
        sample_fraction (float, optional): The fraction of patient data to use for
            jackknife resampling. Defaults to 1.0.
        n_jobs (int, optional): The number of parallel jobs to run. Defaults to -1.
        max_plots (int): Maximum number of plots for jackknife intervals.

    Returns:
        DataFrame: The results of jackknife inference.
    """
    patient_data = patient_to_df(patient=patient)
    patient_data, _ = self.inference_engine.prepare_inference(
        task=self.task,
        patient_data=patient_data,
        encoding=self.encoding,
        X_train=self.X_train,
        y_train=self.y_train,
    )
    return self.inference_engine.jackknife_inference(
        model=self.model,
        train_df=self.train_df,
        patient_data=patient_data,
        encoding=self.encoding,
        inference_results=results,
        sample_fraction=sample_fraction,
        n_jobs=n_jobs,
        max_plots=max_plots,
    )

`wrapped_patient_inference(patient)` ¶

Runs inference on the patient's data using the best-ranked model.

Parameters:

Name	Type	Description	Default
`patient`	`Patient`	A `Patient` dataclass instance containing patient-level, tooth-level, and side-level information.	required

Returns:

Name	Type	Description
`DataFrame`	`Tuple[DataFrame, DataFrame, DataFrame]`	DataFrame with predictions and probabilities for each side of the patient's teeth.

Source code in periomod/wrapper/_wrapper.py

def wrapped_patient_inference(
    self,
    patient: Patient,
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """Runs inference on the patient's data using the best-ranked model.

    Args:
        patient (Patient): A `Patient` dataclass instance containing patient-level,
            tooth-level, and side-level information.

    Returns:
        DataFrame: DataFrame with predictions and probabilities for each side
            of the patient's teeth.
    """
    patient_data = patient_to_df(patient=patient)
    predict_data, patient_data = self.inference_engine.prepare_inference(
        task=self.task,
        patient_data=patient_data,
        encoding=self.encoding,
        X_train=self.X_train,
        y_train=self.y_train,
    )

    return self.inference_engine.patient_inference(
        predict_data=predict_data, patient_data=patient_data
    )

EvaluatorWrapper

__init__(learners_dict, criterion, aggregate=True, verbose=False, random_state=0, path=Path('data/processed/processed_data.csv')) ¶

average_over_splits(num_splits=5, n_jobs=-1) ¶

compare_bss(base=None, revaluation=None, true_preds=False, brier_threshold=None, tight_layout=False) ¶

evaluate_cluster(n_cluster=3, base=None, revaluation=None, true_preds=False, brier_threshold=None, tight_layout=False) ¶

evaluate_feature_importance(fi_types, base=None, revaluation=None, true_preds=False, brier_threshold=None) ¶

wrapped_evaluation(cm=True, cm_base=True, brier_groups=True, calibration=True, tight_layout=False) ¶

wrapped_jackknife(patient, results, sample_fraction=1.0, n_jobs=-1, max_plots=192) ¶

wrapped_patient_inference(patient) ¶

`init(learners_dict, criterion, aggregate=True, verbose=False, random_state=0, path=Path('data/processed/processed_data.csv'))` ¶

`average_over_splits(num_splits=5, n_jobs=-1)` ¶

`compare_bss(base=None, revaluation=None, true_preds=False, brier_threshold=None, tight_layout=False)` ¶

`evaluate_cluster(n_cluster=3, base=None, revaluation=None, true_preds=False, brier_threshold=None, tight_layout=False)` ¶

`evaluate_feature_importance(fi_types, base=None, revaluation=None, true_preds=False, brier_threshold=None)` ¶

`wrapped_evaluation(cm=True, cm_base=True, brier_groups=True, calibration=True, tight_layout=False)` ¶

`wrapped_jackknife(patient, results, sample_fraction=1.0, n_jobs=-1, max_plots=192)` ¶

`wrapped_patient_inference(patient)` ¶