Skip to content

Resampler

Bases: BaseResampler

Resampler class for handling data resampling and train-test splitting.

This class extends BaseResampler to provide additional functionality for resampling datasets using various strategies (e.g., SMOTE, upsampling, downsampling) and for handling train-test splitting and cross-validation with group constraints.

Inherits
  • BaseResampler: Base class for resampling and validation methods.

Parameters:

Name Type Description Default
classification str

Specifies the type of classification ('binary' or 'multiclass').

required
encoding str

Specifies the encoding type ('one_hot' or 'target').

required

Attributes:

Name Type Description
classification str

Type of classification task ('binary' or 'multiclass').

encoding str

Encoding strategy for categorical features ('one_hot' or 'target').

all_cat_vars list

List of categorical variables in the dataset, used in target encoding when applicable.

Methods:

Name Description
split_train_test_df

Splits the dataset into train and test sets based on group constraints, ensuring reproducibility.

split_x_y

Separates features and target labels in both train and test sets, applying optional sampling and encoding.

cv_folds

Performs group-based cross-validation, applying resampling strategies to balance training data where specified.

Inherited Methods
  • apply_sampling: Applies specified sampling strategy to balance the dataset, supporting SMOTE, upsampling, and downsampling.
  • apply_target_encoding: Applies target encoding to categorical variables in the dataset.
  • validate_dataframe: Validates that input data meets requirements, such as having specified columns.
  • validate_n_folds: Ensures the number of cross-validation folds is a positive integer.
  • validate_sampling_strategy: Verifies the sampling strategy is one of the allowed options.
Example
from periomod.data import ProcessedDataLoader
from periomod.resampling import Resampler

df = dataloader.load_data(path="data/processed/training_data.csv")

resampler = Resampler(classification="binary", encoding="one_hot")
train_df, test_df = resampler.split_train_test_df(df=df, seed=42, test_size=0.3)

# upsample minority class by a factor of 2.
X_train, y_train, X_test, y_test = resampler.split_x_y(
    train_df, test_df, sampling="upsampling", factor=2
)
# performs grouped cross-validation with "smote" sampling on the training folds
outer_splits, cv_folds_indices = resampler.cv_folds(
    df, sampling="smote", factor=2.0, seed=42, n_folds=5
)
Source code in periomod/resampling/_resampler.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
class Resampler(BaseResampler):
    """Resampler class for handling data resampling and train-test splitting.

    This class extends `BaseResampler` to provide additional functionality
    for resampling datasets using various strategies (e.g., SMOTE, upsampling,
    downsampling) and for handling train-test splitting and cross-validation
    with group constraints.

    Inherits:
        - `BaseResampler`: Base class for resampling and validation methods.

    Args:
        classification (str): Specifies the type of classification ('binary'
            or 'multiclass').
        encoding (str): Specifies the encoding type ('one_hot' or 'target').

    Attributes:
        classification (str): Type of classification task ('binary' or 'multiclass').
        encoding (str): Encoding strategy for categorical features
            ('one_hot' or 'target').
        all_cat_vars (list): List of categorical variables in the dataset, used in
            target encoding when applicable.

    Methods:
        split_train_test_df: Splits the dataset into train and test sets based
            on group constraints, ensuring reproducibility.
        split_x_y: Separates features and target labels in both train and test sets,
            applying optional sampling and encoding.
        cv_folds: Performs group-based cross-validation, applying resampling
            strategies to balance training data where specified.

    Inherited Methods:
        - `apply_sampling`: Applies specified sampling strategy to balance
          the dataset, supporting SMOTE, upsampling, and downsampling.
        - `apply_target_encoding`: Applies target encoding to categorical
          variables in the dataset.
        - `validate_dataframe`: Validates that input data meets requirements,
          such as having specified columns.
        - `validate_n_folds`: Ensures the number of cross-validation folds
          is a positive integer.
        - `validate_sampling_strategy`: Verifies the sampling strategy is
          one of the allowed options.

    Example:
        ```
        from periomod.data import ProcessedDataLoader
        from periomod.resampling import Resampler

        df = dataloader.load_data(path="data/processed/training_data.csv")

        resampler = Resampler(classification="binary", encoding="one_hot")
        train_df, test_df = resampler.split_train_test_df(df=df, seed=42, test_size=0.3)

        # upsample minority class by a factor of 2.
        X_train, y_train, X_test, y_test = resampler.split_x_y(
            train_df, test_df, sampling="upsampling", factor=2
        )
        # performs grouped cross-validation with "smote" sampling on the training folds
        outer_splits, cv_folds_indices = resampler.cv_folds(
            df, sampling="smote", factor=2.0, seed=42, n_folds=5
        )
        ```
    """

    def __init__(self, classification: str, encoding: str) -> None:
        """Initializes the Resampler class."""
        super().__init__(classification=classification, encoding=encoding)

    def split_train_test_df(
        self,
        df: pd.DataFrame,
        seed: int = 0,
        test_size: Optional[float] = 0.2,
    ) -> Tuple[pd.DataFrame, pd.DataFrame]:
        """Splits the dataset into train_df and test_df based on group identifiers.

        Args:
            df (pd.DataFrame): Input DataFrame.
            seed (int): Random seed for splitting. Defaults to 0.
            test_size (Optional[float]): Size of grouped train test split.
                Defaults to 0.2.

        Returns:
            Tuple: Tuple containing the training and test DataFrames
                (train_df, test_df).

        Raises:
            ValueError: If required columns are missing from the input DataFrame.
            TypeError: If the input DataFrame is not a pandas DataFrame.
        """
        self.validate_dataframe(df=df, required_columns=[self.y, self.group_col])

        gss = GroupShuffleSplit(
            n_splits=1,
            test_size=test_size,
            random_state=seed,
        )
        train_idx, test_idx = next(gss.split(df, groups=df[self.group_col]))

        train_df = df.iloc[train_idx].reset_index(drop=True)
        test_df = df.iloc[test_idx].reset_index(drop=True)

        train_patient_ids = set(train_df[self.group_col])
        test_patient_ids = set(test_df[self.group_col])
        if not train_patient_ids.isdisjoint(test_patient_ids):
            raise ValueError(
                "Overlapping group values between the train and test sets."
            )

        return train_df, test_df

    def split_x_y(
        self,
        train_df: pd.DataFrame,
        test_df: pd.DataFrame,
        sampling: Union[str, None] = None,
        factor: Union[float, None] = None,
    ) -> Tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
        """Splits the train and test DataFrames into feature and label sets.

        Splits into (X_train, y_train, X_test, y_test).

        Args:
            train_df (pd.DataFrame): The training DataFrame.
            test_df (pd.DataFrame): The testing DataFrame.
            sampling (str, optional): Resampling method to apply (e.g.,
                'upsampling', 'downsampling', 'smote'), defaults to None.
            factor (float, optional): Factor for sampling, defaults to None.

        Returns:
            Tuple: Tuple containing feature and label sets
                (X_train, y_train, X_test, y_test).

        Raises:
            ValueError: If required columns are missing or sampling method is invalid.
        """
        X_train = train_df.drop([self.y], axis=1)
        y_train = train_df[self.y]
        X_test = test_df.drop([self.y], axis=1)
        y_test = test_df[self.y]

        if self.encoding == "target":
            X_train, X_test = self.apply_target_encoding(
                X=X_train, X_val=X_test, y=y_train
            )

        if sampling is not None:
            X_train, y_train = self.apply_sampling(
                X=X_train, y=y_train, sampling=sampling, sampling_factor=factor
            )

        return (
            X_train.drop([self.group_col], axis=1),
            y_train,
            X_test.drop([self.group_col], axis=1),
            y_test,
        )

    def cv_folds(
        self,
        df: pd.DataFrame,
        seed: Optional[int] = 0,
        n_folds: Optional[int] = 10,
        sampling: Union[str, None] = None,
        factor: Union[float, None] = None,
    ) -> Tuple[list, list]:
        """Performs cross-validation with group constraints.

        Applies optional resampling strategies.

        Args:
            df (pd.DataFrame): Input DataFrame.
            seed (Optional[int]): Random seed for reproducibility. Defaults to 0.
            n_folds (Optional[[int]): Number of folds for cross-validation.
                Defaults to 10.
            sampling (str, optional): Sampling method to apply (e.g.,
                'upsampling', 'downsampling', 'smote').
            factor (float, optional): Factor for resampling, applied to upsample,
                downsample, or SMOTE.


        Returns:
            Tuple: Tuple containing outer splits and cross-validation fold indices.

        Raises:
            ValueError: If required columns are missing or folds are inconsistent.
            TypeError: If the input DataFrame is not a pandas DataFrame.
        """
        np.random.default_rng(seed=seed)

        self.validate_dataframe(df=df, required_columns=[self.y, self.group_col])
        self.validate_n_folds(n_folds=n_folds)
        train_df, _ = self.split_train_test_df(df=df)
        gkf = GroupKFold(n_splits=n_folds)

        cv_folds_indices = []
        outer_splits = []
        original_validation_data = []

        for train_idx, test_idx in gkf.split(train_df, groups=train_df[self.group_col]):
            X_train_fold = train_df.iloc[train_idx].drop([self.y], axis=1)
            y_train_fold = train_df.iloc[train_idx][self.y]
            X_test_fold = train_df.iloc[test_idx].drop([self.y], axis=1)
            y_test_fold = train_df.iloc[test_idx][self.y]

            original_validation_data.append(
                train_df.iloc[test_idx].drop([self.y], axis=1).reset_index(drop=True)
            )

            if sampling is not None:
                X_train_fold, y_train_fold = self.apply_sampling(
                    X=X_train_fold,
                    y=y_train_fold,
                    sampling=sampling,
                    sampling_factor=factor,
                    random_state=seed,
                )

            cv_folds_indices.append((train_idx, test_idx))
            outer_splits.append(
                ((X_train_fold, y_train_fold), (X_test_fold, y_test_fold))
            )

        for original_test_data, (_, (X_test_fold, _)) in zip(
            original_validation_data, outer_splits, strict=False
        ):
            if not original_test_data.equals(X_test_fold.reset_index(drop=True)):
                raise ValueError(
                    "Validation folds' data not consistent after applying sampling "
                    "strategies."
                )
        if self.encoding == "target":
            outer_splits_t = []

            for (X_t, y_t), (X_val, y_val) in outer_splits:
                X_t, X_val = self.apply_target_encoding(X=X_t, X_val=X_val, y=y_t)
                if sampling == "smote":
                    X_t, y_t = self.apply_sampling(
                        X=X_t,
                        y=y_t,
                        sampling=sampling,
                        sampling_factor=factor,
                        random_state=seed,
                    )

                outer_splits_t.append(((X_t, y_t), (X_val, y_val)))
            outer_splits = outer_splits_t

        return outer_splits, cv_folds_indices

__init__(classification, encoding)

Initializes the Resampler class.

Source code in periomod/resampling/_resampler.py
74
75
76
def __init__(self, classification: str, encoding: str) -> None:
    """Initializes the Resampler class."""
    super().__init__(classification=classification, encoding=encoding)

cv_folds(df, seed=0, n_folds=10, sampling=None, factor=None)

Performs cross-validation with group constraints.

Applies optional resampling strategies.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
seed Optional[int]

Random seed for reproducibility. Defaults to 0.

0
n_folds Optional[[int]

Number of folds for cross-validation. Defaults to 10.

10
sampling str

Sampling method to apply (e.g., 'upsampling', 'downsampling', 'smote').

None
factor float

Factor for resampling, applied to upsample, downsample, or SMOTE.

None

Returns:

Name Type Description
Tuple Tuple[list, list]

Tuple containing outer splits and cross-validation fold indices.

Raises:

Type Description
ValueError

If required columns are missing or folds are inconsistent.

TypeError

If the input DataFrame is not a pandas DataFrame.

Source code in periomod/resampling/_resampler.py
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
def cv_folds(
    self,
    df: pd.DataFrame,
    seed: Optional[int] = 0,
    n_folds: Optional[int] = 10,
    sampling: Union[str, None] = None,
    factor: Union[float, None] = None,
) -> Tuple[list, list]:
    """Performs cross-validation with group constraints.

    Applies optional resampling strategies.

    Args:
        df (pd.DataFrame): Input DataFrame.
        seed (Optional[int]): Random seed for reproducibility. Defaults to 0.
        n_folds (Optional[[int]): Number of folds for cross-validation.
            Defaults to 10.
        sampling (str, optional): Sampling method to apply (e.g.,
            'upsampling', 'downsampling', 'smote').
        factor (float, optional): Factor for resampling, applied to upsample,
            downsample, or SMOTE.


    Returns:
        Tuple: Tuple containing outer splits and cross-validation fold indices.

    Raises:
        ValueError: If required columns are missing or folds are inconsistent.
        TypeError: If the input DataFrame is not a pandas DataFrame.
    """
    np.random.default_rng(seed=seed)

    self.validate_dataframe(df=df, required_columns=[self.y, self.group_col])
    self.validate_n_folds(n_folds=n_folds)
    train_df, _ = self.split_train_test_df(df=df)
    gkf = GroupKFold(n_splits=n_folds)

    cv_folds_indices = []
    outer_splits = []
    original_validation_data = []

    for train_idx, test_idx in gkf.split(train_df, groups=train_df[self.group_col]):
        X_train_fold = train_df.iloc[train_idx].drop([self.y], axis=1)
        y_train_fold = train_df.iloc[train_idx][self.y]
        X_test_fold = train_df.iloc[test_idx].drop([self.y], axis=1)
        y_test_fold = train_df.iloc[test_idx][self.y]

        original_validation_data.append(
            train_df.iloc[test_idx].drop([self.y], axis=1).reset_index(drop=True)
        )

        if sampling is not None:
            X_train_fold, y_train_fold = self.apply_sampling(
                X=X_train_fold,
                y=y_train_fold,
                sampling=sampling,
                sampling_factor=factor,
                random_state=seed,
            )

        cv_folds_indices.append((train_idx, test_idx))
        outer_splits.append(
            ((X_train_fold, y_train_fold), (X_test_fold, y_test_fold))
        )

    for original_test_data, (_, (X_test_fold, _)) in zip(
        original_validation_data, outer_splits, strict=False
    ):
        if not original_test_data.equals(X_test_fold.reset_index(drop=True)):
            raise ValueError(
                "Validation folds' data not consistent after applying sampling "
                "strategies."
            )
    if self.encoding == "target":
        outer_splits_t = []

        for (X_t, y_t), (X_val, y_val) in outer_splits:
            X_t, X_val = self.apply_target_encoding(X=X_t, X_val=X_val, y=y_t)
            if sampling == "smote":
                X_t, y_t = self.apply_sampling(
                    X=X_t,
                    y=y_t,
                    sampling=sampling,
                    sampling_factor=factor,
                    random_state=seed,
                )

            outer_splits_t.append(((X_t, y_t), (X_val, y_val)))
        outer_splits = outer_splits_t

    return outer_splits, cv_folds_indices

split_train_test_df(df, seed=0, test_size=0.2)

Splits the dataset into train_df and test_df based on group identifiers.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
seed int

Random seed for splitting. Defaults to 0.

0
test_size Optional[float]

Size of grouped train test split. Defaults to 0.2.

0.2

Returns:

Name Type Description
Tuple Tuple[DataFrame, DataFrame]

Tuple containing the training and test DataFrames (train_df, test_df).

Raises:

Type Description
ValueError

If required columns are missing from the input DataFrame.

TypeError

If the input DataFrame is not a pandas DataFrame.

Source code in periomod/resampling/_resampler.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
def split_train_test_df(
    self,
    df: pd.DataFrame,
    seed: int = 0,
    test_size: Optional[float] = 0.2,
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Splits the dataset into train_df and test_df based on group identifiers.

    Args:
        df (pd.DataFrame): Input DataFrame.
        seed (int): Random seed for splitting. Defaults to 0.
        test_size (Optional[float]): Size of grouped train test split.
            Defaults to 0.2.

    Returns:
        Tuple: Tuple containing the training and test DataFrames
            (train_df, test_df).

    Raises:
        ValueError: If required columns are missing from the input DataFrame.
        TypeError: If the input DataFrame is not a pandas DataFrame.
    """
    self.validate_dataframe(df=df, required_columns=[self.y, self.group_col])

    gss = GroupShuffleSplit(
        n_splits=1,
        test_size=test_size,
        random_state=seed,
    )
    train_idx, test_idx = next(gss.split(df, groups=df[self.group_col]))

    train_df = df.iloc[train_idx].reset_index(drop=True)
    test_df = df.iloc[test_idx].reset_index(drop=True)

    train_patient_ids = set(train_df[self.group_col])
    test_patient_ids = set(test_df[self.group_col])
    if not train_patient_ids.isdisjoint(test_patient_ids):
        raise ValueError(
            "Overlapping group values between the train and test sets."
        )

    return train_df, test_df

split_x_y(train_df, test_df, sampling=None, factor=None)

Splits the train and test DataFrames into feature and label sets.

Splits into (X_train, y_train, X_test, y_test).

Parameters:

Name Type Description Default
train_df DataFrame

The training DataFrame.

required
test_df DataFrame

The testing DataFrame.

required
sampling str

Resampling method to apply (e.g., 'upsampling', 'downsampling', 'smote'), defaults to None.

None
factor float

Factor for sampling, defaults to None.

None

Returns:

Name Type Description
Tuple Tuple[DataFrame, Series, DataFrame, Series]

Tuple containing feature and label sets (X_train, y_train, X_test, y_test).

Raises:

Type Description
ValueError

If required columns are missing or sampling method is invalid.

Source code in periomod/resampling/_resampler.py
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
def split_x_y(
    self,
    train_df: pd.DataFrame,
    test_df: pd.DataFrame,
    sampling: Union[str, None] = None,
    factor: Union[float, None] = None,
) -> Tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
    """Splits the train and test DataFrames into feature and label sets.

    Splits into (X_train, y_train, X_test, y_test).

    Args:
        train_df (pd.DataFrame): The training DataFrame.
        test_df (pd.DataFrame): The testing DataFrame.
        sampling (str, optional): Resampling method to apply (e.g.,
            'upsampling', 'downsampling', 'smote'), defaults to None.
        factor (float, optional): Factor for sampling, defaults to None.

    Returns:
        Tuple: Tuple containing feature and label sets
            (X_train, y_train, X_test, y_test).

    Raises:
        ValueError: If required columns are missing or sampling method is invalid.
    """
    X_train = train_df.drop([self.y], axis=1)
    y_train = train_df[self.y]
    X_test = test_df.drop([self.y], axis=1)
    y_test = test_df[self.y]

    if self.encoding == "target":
        X_train, X_test = self.apply_target_encoding(
            X=X_train, X_val=X_test, y=y_train
        )

    if sampling is not None:
        X_train, y_train = self.apply_sampling(
            X=X_train, y=y_train, sampling=sampling, sampling_factor=factor
        )

    return (
        X_train.drop([self.group_col], axis=1),
        y_train,
        X_test.drop([self.group_col], axis=1),
        y_test,
    )