brainsig.dataset#

A module that creates the dataset object.

This module provides the Dataset class for preprocessing and preparing data for machine learning tasks, including handling missing data, feature preprocessing, and train/test splitting.

Attributes#

Classes#

Dataset

A class for preprocessing and preparing datasets for machine learning.

Module Contents#

brainsig.dataset.logger#
class brainsig.dataset.Dataset(df: pandas.DataFrame, target: str | list[str], missing_threshold: float = 0.5, preprocessor: sklearn.compose.ColumnTransformer | None = None, test_size: float = 0.2, random_state: int | None = None, *, verbose: bool = True)#

A class for preprocessing and preparing datasets for machine learning.

This class handles missing data, feature preprocessing (scaling and encoding), and train/test splitting. It’s designed to work with pandas DataFrames and integrates with scikit-learn pipelines.

Parameters:
  • df (pd.DataFrame) – Input DataFrame containing features and target variables.

  • target (str or list of str) – Name(s) of the target variable column(s).

  • missing_threshold (float, default=0.5) – Threshold for dropping columns with missing data. Columns with a fraction of missing values greater than this threshold will be dropped.

  • preprocessor (sklearn.compose.ColumnTransformer or None, default=None) – Custom preprocessor for features. If None, a default preprocessor is created that standardizes numeric features and one-hot encodes categorical features.

  • test_size (float, default=0.2) – Proportion of the dataset to include in the test split. If 0, no train/test split is performed and all data is stored in X and y attributes.

  • random_state (int or None, default=None) – Random seed for reproducibility of train/test split.

  • verbose (bool, default=True) – If True, print information about dropped columns and rows.

original_df#

Copy of the original input DataFrame.

Type:

pd.DataFrame

target#

Name(s) of target variable(s).

Type:

str or list of str

dropped_summary#

Summary of dropped data with keys ‘all_missing_cols’, ‘high_missing_cols’, and ‘rows_dropped’.

Type:

dict

preprocessor#

The fitted preprocessor used for feature transformation.

Type:

sklearn.compose.ColumnTransformer

feature_names#

Names of features after preprocessing.

Type:

np.ndarray

target_labels#

Dictionary mapping target column names to their unique class labels.

Type:

dict

X_train#

Training features (only if test_size > 0).

Type:

np.ndarray

X_test#

Test features (only if test_size > 0).

Type:

np.ndarray

y_train#

Training targets (only if test_size > 0).

Type:

np.ndarray

y_test#

Test targets (only if test_size > 0).

Type:

np.ndarray

X#

All features (only if test_size = 0).

Type:

np.ndarray

y#

All targets (only if test_size = 0).

Type:

np.ndarray

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'age': [25, 30, 35, 40],
...     'income': [50000, 60000, 70000, 80000],
...     'outcome': ['A', 'B', 'A', 'B']
... })
>>> dataset = Dataset(df, target='outcome', test_size=0.25, random_state=42)
>>> print(dataset.X_train.shape)
original_df#
target#