brainsig.dataset
================

.. py:module:: brainsig.dataset

.. autoapi-nested-parse::

   A module that creates the dataset object.

   This module provides the Dataset class for preprocessing and preparing data
   for machine learning tasks, including handling missing data, feature preprocessing,
   and train/test splitting.


Attributes
----------

.. autoapisummary::

   brainsig.dataset.logger


Classes
-------

.. autoapisummary::

   brainsig.dataset.Dataset


Module Contents
---------------

.. py:data:: logger

.. py:class:: Dataset(df: pandas.DataFrame, target: str | list[str], missing_threshold: float = 0.5, preprocessor: sklearn.compose.ColumnTransformer | None = None, test_size: float = 0.2, random_state: int | None = None, *, verbose: bool = True)

   A class for preprocessing and preparing datasets for machine learning.

   This class handles missing data, feature preprocessing (scaling and encoding),
   and train/test splitting. It's designed to work with pandas DataFrames and
   integrates with scikit-learn pipelines.

   :param df: Input DataFrame containing features and target variables.
   :type df: pd.DataFrame
   :param target: Name(s) of the target variable column(s).
   :type target: str or list of str
   :param missing_threshold: Threshold for dropping columns with missing data. Columns with a fraction
                             of missing values greater than this threshold will be dropped.
   :type missing_threshold: float, default=0.5
   :param preprocessor: Custom preprocessor for features. If None, a default preprocessor is created
                        that standardizes numeric features and one-hot encodes categorical features.
   :type preprocessor: sklearn.compose.ColumnTransformer or None, default=None
   :param test_size: Proportion of the dataset to include in the test split. If 0, no train/test
                     split is performed and all data is stored in `X` and `y` attributes.
   :type test_size: float, default=0.2
   :param random_state: Random seed for reproducibility of train/test split.
   :type random_state: int or None, default=None
   :param verbose: If True, print information about dropped columns and rows.
   :type verbose: bool, default=True

   .. attribute:: original_df

      Copy of the original input DataFrame.

      :type: pd.DataFrame

   .. attribute:: target

      Name(s) of target variable(s).

      :type: str or list of str

   .. attribute:: dropped_summary

      Summary of dropped data with keys 'all_missing_cols', 'high_missing_cols',
      and 'rows_dropped'.

      :type: dict

   .. attribute:: preprocessor

      The fitted preprocessor used for feature transformation.

      :type: sklearn.compose.ColumnTransformer

   .. attribute:: feature_names

      Names of features after preprocessing.

      :type: np.ndarray

   .. attribute:: target_labels

      Dictionary mapping target column names to their unique class labels.

      :type: dict

   .. attribute:: X_train

      Training features (only if test_size > 0).

      :type: np.ndarray

   .. attribute:: X_test

      Test features (only if test_size > 0).

      :type: np.ndarray

   .. attribute:: y_train

      Training targets (only if test_size > 0).

      :type: np.ndarray

   .. attribute:: y_test

      Test targets (only if test_size > 0).

      :type: np.ndarray

   .. attribute:: X

      All features (only if test_size = 0).

      :type: np.ndarray

   .. attribute:: y

      All targets (only if test_size = 0).

      :type: np.ndarray

   .. rubric:: Examples

   >>> import pandas as pd
   >>> df = pd.DataFrame({
   ...     'age': [25, 30, 35, 40],
   ...     'income': [50000, 60000, 70000, 80000],
   ...     'outcome': ['A', 'B', 'A', 'B']
   ... })
   >>> dataset = Dataset(df, target='outcome', test_size=0.25, random_state=42)
   >>> print(dataset.X_train.shape)


   .. py:attribute:: original_df


   .. py:attribute:: target