brainsig.dataset ================ .. py:module:: brainsig.dataset .. autoapi-nested-parse:: A module that creates the dataset object. This module provides the Dataset class for preprocessing and preparing data for machine learning tasks, including handling missing data, feature preprocessing, and train/test splitting. Attributes ---------- .. autoapisummary:: brainsig.dataset.logger Classes ------- .. autoapisummary:: brainsig.dataset.Dataset Module Contents --------------- .. py:data:: logger .. py:class:: Dataset(df: pandas.DataFrame, target: str | list[str], missing_threshold: float = 0.5, preprocessor: sklearn.compose.ColumnTransformer | None = None, test_size: float = 0.2, random_state: int | None = None, *, verbose: bool = True) A class for preprocessing and preparing datasets for machine learning. This class handles missing data, feature preprocessing (scaling and encoding), and train/test splitting. It's designed to work with pandas DataFrames and integrates with scikit-learn pipelines. :param df: Input DataFrame containing features and target variables. :type df: pd.DataFrame :param target: Name(s) of the target variable column(s). :type target: str or list of str :param missing_threshold: Threshold for dropping columns with missing data. Columns with a fraction of missing values greater than this threshold will be dropped. :type missing_threshold: float, default=0.5 :param preprocessor: Custom preprocessor for features. If None, a default preprocessor is created that standardizes numeric features and one-hot encodes categorical features. :type preprocessor: sklearn.compose.ColumnTransformer or None, default=None :param test_size: Proportion of the dataset to include in the test split. If 0, no train/test split is performed and all data is stored in `X` and `y` attributes. :type test_size: float, default=0.2 :param random_state: Random seed for reproducibility of train/test split. :type random_state: int or None, default=None :param verbose: If True, print information about dropped columns and rows. :type verbose: bool, default=True .. attribute:: original_df Copy of the original input DataFrame. :type: pd.DataFrame .. attribute:: target Name(s) of target variable(s). :type: str or list of str .. attribute:: dropped_summary Summary of dropped data with keys 'all_missing_cols', 'high_missing_cols', and 'rows_dropped'. :type: dict .. attribute:: preprocessor The fitted preprocessor used for feature transformation. :type: sklearn.compose.ColumnTransformer .. attribute:: feature_names Names of features after preprocessing. :type: np.ndarray .. attribute:: target_labels Dictionary mapping target column names to their unique class labels. :type: dict .. attribute:: X_train Training features (only if test_size > 0). :type: np.ndarray .. attribute:: X_test Test features (only if test_size > 0). :type: np.ndarray .. attribute:: y_train Training targets (only if test_size > 0). :type: np.ndarray .. attribute:: y_test Test targets (only if test_size > 0). :type: np.ndarray .. attribute:: X All features (only if test_size = 0). :type: np.ndarray .. attribute:: y All targets (only if test_size = 0). :type: np.ndarray .. rubric:: Examples >>> import pandas as pd >>> df = pd.DataFrame({ ... 'age': [25, 30, 35, 40], ... 'income': [50000, 60000, 70000, 80000], ... 'outcome': ['A', 'B', 'A', 'B'] ... }) >>> dataset = Dataset(df, target='outcome', test_size=0.25, random_state=42) >>> print(dataset.X_train.shape) .. py:attribute:: original_df .. py:attribute:: target