brainsig.dataset#

A module that creates the dataset object.

This module provides the Dataset class for preprocessing and preparing data for machine learning tasks, including handling missing data, feature preprocessing, and train/test splitting.

Attributes#

logger

Classes#

Dataset

A class for preprocessing and preparing datasets for machine learning.

Module Contents#

brainsig.dataset.logger#

class brainsig.dataset.Dataset(df: pandas.DataFrame, target: str | list[str], missing_threshold: float = 0.5, preprocessor: sklearn.compose.ColumnTransformer | None = None, test_size: float = 0.2, random_state: int | None = None, *, verbose: bool = True)#

A class for preprocessing and preparing datasets for machine learning.

This class handles missing data, feature preprocessing (scaling and encoding), and train/test splitting. It’s designed to work with pandas DataFrames and integrates with scikit-learn pipelines.

Parameters:

df (pd.DataFrame) – Input DataFrame containing features and target variables.
target (str or list of str) – Name(s) of the target variable column(s).
missing_threshold (float, default=0.5) – Threshold for dropping columns with missing data. Columns with a fraction of missing values greater than this threshold will be dropped.
preprocessor (sklearn.compose.ColumnTransformer or None, default=None) – Custom preprocessor for features. If None, a default preprocessor is created that standardizes numeric features and one-hot encodes categorical features.
test_size (float, default=0.2) – Proportion of the dataset to include in the test split. If 0, no train/test split is performed and all data is stored in X and y attributes.
random_state (int or None, default=None) – Random seed for reproducibility of train/test split.
verbose (bool, default=True) – If True, print information about dropped columns and rows.

original_df#

Copy of the original input DataFrame.

Type:: pd.DataFrame

target#

Name(s) of target variable(s).

Type:: str or list of str

dropped_summary#

Summary of dropped data with keys ‘all_missing_cols’, ‘high_missing_cols’, and ‘rows_dropped’.

Type:: dict

preprocessor#

The fitted preprocessor used for feature transformation.

Type:: sklearn.compose.ColumnTransformer

feature_names#

Names of features after preprocessing.

Type:: np.ndarray

target_labels#

Dictionary mapping target column names to their unique class labels.

Type:: dict

X_train#

Training features (only if test_size > 0).

Type:: np.ndarray

X_test#

Test features (only if test_size > 0).

Type:: np.ndarray

y_train#

Training targets (only if test_size > 0).

Type:: np.ndarray

y_test#

Test targets (only if test_size > 0).

Type:: np.ndarray

X#

All features (only if test_size = 0).

Type:: np.ndarray

y#

All targets (only if test_size = 0).

Type:: np.ndarray

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'age': [25, 30, 35, 40],
...     'income': [50000, 60000, 70000, 80000],
...     'outcome': ['A', 'B', 'A', 'B']
... })
>>> dataset = Dataset(df, target='outcome', test_size=0.25, random_state=42)
>>> print(dataset.X_train.shape)

original_df#

target#