brainsig.dataset#
A module that creates the dataset object.
This module provides the Dataset class for preprocessing and preparing data for machine learning tasks, including handling missing data, feature preprocessing, and train/test splitting.
Attributes#
Classes#
A class for preprocessing and preparing datasets for machine learning. |
Module Contents#
- brainsig.dataset.logger#
- class brainsig.dataset.Dataset(df: pandas.DataFrame, target: str | list[str], missing_threshold: float = 0.5, preprocessor: sklearn.compose.ColumnTransformer | None = None, test_size: float = 0.2, random_state: int | None = None, *, verbose: bool = True)#
A class for preprocessing and preparing datasets for machine learning.
This class handles missing data, feature preprocessing (scaling and encoding), and train/test splitting. It’s designed to work with pandas DataFrames and integrates with scikit-learn pipelines.
- Parameters:
df (pd.DataFrame) – Input DataFrame containing features and target variables.
target (str or list of str) – Name(s) of the target variable column(s).
missing_threshold (float, default=0.5) – Threshold for dropping columns with missing data. Columns with a fraction of missing values greater than this threshold will be dropped.
preprocessor (sklearn.compose.ColumnTransformer or None, default=None) – Custom preprocessor for features. If None, a default preprocessor is created that standardizes numeric features and one-hot encodes categorical features.
test_size (float, default=0.2) – Proportion of the dataset to include in the test split. If 0, no train/test split is performed and all data is stored in X and y attributes.
random_state (int or None, default=None) – Random seed for reproducibility of train/test split.
verbose (bool, default=True) – If True, print information about dropped columns and rows.
- original_df#
Copy of the original input DataFrame.
- Type:
pd.DataFrame
- dropped_summary#
Summary of dropped data with keys ‘all_missing_cols’, ‘high_missing_cols’, and ‘rows_dropped’.
- Type:
- preprocessor#
The fitted preprocessor used for feature transformation.
- Type:
sklearn.compose.ColumnTransformer
- feature_names#
Names of features after preprocessing.
- Type:
np.ndarray
- X_train#
Training features (only if test_size > 0).
- Type:
np.ndarray
- X_test#
Test features (only if test_size > 0).
- Type:
np.ndarray
- y_train#
Training targets (only if test_size > 0).
- Type:
np.ndarray
- y_test#
Test targets (only if test_size > 0).
- Type:
np.ndarray
- X#
All features (only if test_size = 0).
- Type:
np.ndarray
- y#
All targets (only if test_size = 0).
- Type:
np.ndarray
Examples
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'age': [25, 30, 35, 40], ... 'income': [50000, 60000, 70000, 80000], ... 'outcome': ['A', 'B', 'A', 'B'] ... }) >>> dataset = Dataset(df, target='outcome', test_size=0.25, random_state=42) >>> print(dataset.X_train.shape)
- original_df#
- target#