leaspy.io.data

Submodules

Attributes

DataframeDataReaderFactoryInput

Classes

`AbstractDataframeDataReader`	Methods to convert `pandas.DataFrame` to Leaspy-compliant data containers.
`Data`	Main data container for a collection of individuals
`Dataset`	Data container based on `torch.Tensor`, used to run algorithms.
`EventDataframeDataReader`	Methods to convert `pandas.DataFrame` to Leaspy-compliant data containers for event data only.
`DataframeDataReaderNames`	Enumeration defining the possible names for observation models.
`IndividualData`	Container for an individual's data
`JointDataframeDataReader`	Methods to convert `pandas.DataFrame` to Leaspy-compliant data containers for event data and longitudinal data.
`VisitDataframeDataReader`	Methods to convert `pandas.DataFrame` to Leaspy-compliant data containers for longitudinal data only.

Functions

dataframe_data_reader_factory(reader, **kwargs)

Factory for observation models.

Package Contents

class AbstractDataframeDataReader

Methods to convert pandas.DataFrame to Leaspy-compliant data containers.

Raises:

LeaspyDataInputError

time_rounding_digits = 6

individuals: dict[leaspy.utils.typing.IDType, IndividualData]

iter_to_idx: dict[int, leaspy.utils.typing.IDType]

n_individuals: int = 0

read(df, *, drop_full_nan=True, sort_index=False, warn_empty_column=True)

The method that effectively reads the input dataframe (automatically called in __init__).

Parameters:

dfpandas.DataFrame: The dataframe to read.
drop_full_nanbool: Should we drop rows full of nans? (except index)
sort_indexbool: Should we lexsort index? (Keep False as default so not to break many of the downstream tests that check order…)
warn_empty_columnbool: Should we warn when there are empty columns?

Parameters:

df (DataFrame)
drop_full_nan (bool)
sort_index (bool)
warn_empty_column (bool)

Return type:

None

class Data

Bases: collections.abc.Iterable

Main data container for a collection of individuals

It can be iterated over and sliced, both of these operations being applied to the underlying individuals attribute.

Attributes:

individualsDict [IDType , IndividualData]: Included individuals and their associated data
iter_to_idxDict [int, IDType]: Maps an integer index to the associated individual ID
headersList [FeatureType]: Feature names
dimensionint: Number of features
n_individualsint: Number of individuals
n_visitsint: Total number of visits
cofactorsList [FeatureType]: Feature names corresponding to cofactors
event_time_namestr: Name of the header that store the time at event in the original dataframe
event_bool_namestr: Name of the header that store the bool at event (censored or observed) in the original dataframe

individuals: dict[leaspy.utils.typing.IDType, IndividualData]

iter_to_idx: dict[int, leaspy.utils.typing.IDType]

headers: list[leaspy.utils.typing.FeatureType] | None = None

event_time_name: str | None = None

event_bool_name: str | None = None

property dimension: int | None

Number of features

Returns:

int or None:: Number of features in the dataset. If no features are present, returns None.

Return type:

Optional[int]

property n_individuals: int

Number of individuals

Returns:

int:: Number of individuals in the dataset.

Return type:

int

property n_visits: int

Total number of visits

Returns:

int:: Total number of visits in the dataset.

Return type:

int

property cofactors: list[leaspy.utils.typing.FeatureType]

Feature names corresponding to cofactors

Returns:

List [FeatureType]:: List of feature names corresponding to cofactors.

Return type:

list[leaspy.utils.typing.FeatureType]

load_cofactors(df, *, cofactors=None)

Load cofactors from a pandas.DataFrame to the Data object

Parameters:

dfpandas.DataFrame: The dataframe where the cofactors are stored. Its index should be ID, the identifier of subjects and it should uniquely index the dataframe (i.e. one row per individual).
cofactorsList [FeatureType], optional: Names of the column(s) of dataframe which shall be loaded as cofactors. If None, all the columns from the input dataframe will be loaded as cofactors. Default: None

Parameters:

df (DataFrame)
cofactors (Optional[list[leaspy.utils.typing.FeatureType]])

Return type:

None

static from_csv_file(path, data_type='visit', *, pd_read_csv_kws={}, facto_kws={}, **df_reader_kws)

Create a Data object from a CSV file.

Parameters:

pathstr: Path to the CSV file to load (with extension)
data_typestr: Type of data to read. Can be ‘visit’ or ‘event’.
pd_read_csv_kwsdict: Keyword arguments that are sent to pandas.read_csv()
facto_kwsdict: Keyword arguments
**df_reader_kws: Keyword arguments that are sent to AbstractDataframeDataReader to dataframe_data_reader_factory()

Returns:

Data:: A Data object containing the data from the CSV file.

Parameters:

path (str)
data_type (str)
pd_read_csv_kws (dict)
facto_kws (dict)

Return type:

Data

to_dataframe(*, cofactors=None, reset_index=True)

Convert the Data object to a pandas.DataFrame

Parameters:

cofactorsList [FeatureType] or int, optional: Cofactors to include in the DataFrame. If None (default), no cofactors are included. If “all”, all the available cofactors are included. Default: None
reset_indexbool, optional: Whether to reset index levels in output. Default: True

Returns:

pandas.DataFrame:: A DataFrame containing the individuals’ ID, timepoints and associated observations (optional - and cofactors).

Raises:

LeaspyDataInputError: If the Data object does not contain any cofactors.
LeaspyTypeError: If the cofactors argument is not of a valid type.

Parameters:

cofactors (Optional[Union[list[leaspy.utils.typing.FeatureType], str]])
reset_index (bool)

Return type:

DataFrame

static from_dataframe(df, data_type='visit', factory_kws={}, **kws)

Create a Data object from a DataFrame.

Parameters:

dfpandas.DataFrame: Dataframe containing ID, TIME and features.
data_typestr: Type of data to read. Can be ‘visit’, ‘event’, ‘joint’
factory_kwsDict: Keyword arguments that are sent to dataframe_data_reader_factory()
**kws: Keyword arguments that are sent to DataframeDataReader

Returns:

Data

Parameters:

df (DataFrame)
data_type (str)
factory_kws (dict)

Return type:

Data

static from_individual_values(indices, timepoints=None, values=None, headers=None, event_time_name=None, event_bool_name=None, event_time=None, event_bool=None)

Construct Data from a collection of individual data points

Parameters:

indicesList [IDType]: List of the individuals’ unique ID
timepointsList [List [float]]: For each individual i, list of timepoints associated with the observations. The number of such timepoints is noted n_timepoints_i
valuesList [array-like [float, 2D]]: For each individual i, two-dimensional array-like object containing observed data points. Its expected shape is (n_timepoints_i, n_features)
headersList [FeatureType]: Feature names. The number of features is noted n_features

Returns:

Data:: A Data object containing the individuals and their data.

Parameters:

indices (list[leaspy.utils.typing.IDType])
timepoints (Optional[list[list[float]]])
values (Optional[list[list[list[float]]]])
headers (Optional[list[leaspy.utils.typing.FeatureType]])
event_time_name (Optional[str])
event_bool_name (Optional[str])
event_time (Optional[list[list[float]]])
event_bool (Optional[list[list[int]]])

Return type:

Data

static from_individuals(individuals, headers=None, event_time_name=None, event_bool_name=None)

Construct Data from a list of individuals

Parameters:

individualsList [IndividualData]: List of individuals
headersList [FeatureType]: List of feature names

Returns:

Data:: A Data object containing the individuals and their data.

Parameters:

individuals (list[IndividualData])
headers (Optional[list[leaspy.utils.typing.FeatureType]])
event_time_name (Optional[str])
event_bool_name (Optional[str])

Return type:

Data

extract_longitudinal_only()

Extract longitudinal data from the Data object

Returns:

Data:: A Data object containing only longitudinal data.

Raises:

LeaspyDataInputError: If the Data object does not contain any longitudinal data.

Return type:

Data

class Dataset(data, *, no_warning=False)

Data container based on torch.Tensor, used to run algorithms.

Parameters:

dataData: Create Dataset from Data object
no_warningbool (default False): Whether to deactivate warnings that are emitted by methods of this dataset instance. We may want to deactivate them because we rebuild a dataset per individual in scipy minimize. Indeed, all relevant warnings certainly occurred for the overall dataset.

Attributes:

headerslist[str]: Features names
dimensionint: Number of features
n_individualsint: Number of individuals
indiceslist[ID]: Order of patients
event_time: torch.FloatTensor: Time of an event, if the event is censored, the time correspond to the last patient observation
event_bool: torch.BoolTensor: Boolean to indicate if an event is censored or not: 1 observed, 0 censored
n_visits_per_individuallist[int]: Number of visits per individual
n_visits_maxint: Maximum number of visits for one individual
n_visitsint: Total number of visits
n_observations_per_ind_per_fttorch.LongTensor, shape (n_individuals, dimension): Number of observations (not taking into account missing values) per individual per feature
n_observations_per_fttorch.LongTensor, shape (dimension,): Total number of observations per feature
n_observationsint: Total number of observations
timepointstorch.FloatTensor, shape (n_individuals, n_visits_max): Ages of patients at their different visits
valuestorch.FloatTensor, shape (n_individuals, n_visits_max, dimension): Values of patients for each visit for each feature
masktorch.FloatTensor, shape (n_individuals, n_visits_max, dimension): Binary mask associated to values. If 1: value is meaningful If 0: value is meaningless (either was nan or does not correspond to a real visit - only here for padding)
L2_norm_per_fttorch.FloatTensor, shape (dimension,): Sum of all non-nan squared values, feature per feature
L2_normscalar torch.FloatTensor: Sum of all non-nan squared values
no_warningbool (default False): Whether to deactivate warnings that are emitted by methods of this dataset instance. We may want to deactivate them because we rebuild a dataset per individual in scipy minimize. Indeed, all relevant warnings certainly occurred for the overall dataset.
_one_hot_encodingDict[sf: bool, torch.LongTensor]: Values of patients for each visit for each feature, but tensorized into a one-hot encoding (pdf or sf) Shapes of tensors are (n_individuals, n_visits_max, dimension, max_ordinal_level [-1 when sf=True])

Raises:

LeaspyInputError: if data, model or algo are not compatible together.

Parameters:

data (Data)
no_warning (bool)

n_individuals

indices

headers: list[leaspy.utils.typing.FeatureType]

dimension: int

n_visits: int

timepoints: torch.FloatTensor | None = None

values: torch.FloatTensor | None = None

mask: torch.FloatTensor | None = None

n_observations: int | None = None

n_observations_per_ft: torch.LongTensor | None = None

n_observations_per_ind_per_ft: torch.LongTensor | None = None

n_visits_per_individual: list[int] | None = None

n_visits_max: int | None = None

event_time_name: str | None

event_bool_name: str | None

event_time: torch.FloatTensor | None = None

event_bool: torch.IntTensor | None = None

L2_norm_per_ft: torch.FloatTensor | None = None

L2_norm: torch.FloatTensor | None = None

no_warning = False

get_times_patient(i)

Get ages for patient number i

Parameters:

iint: The index of the patient (<!> not its identifier)

Returns:

torch.Tensor, shape (n_obs_of_patient,): Contains float

Parameters:

i (int)

Return type:

torch.FloatTensor

get_event_patient(idx_patient)

Get ages at event for patient number idx_patient

Parameters:

idx_patientint: The index of the patient (<!> not its identifier)

Returns:

torch.Tensor, shape (n_obs_of_patient,): Contains float

Parameters:

idx_patient (int)

Return type:

tuple[Tensor, Tensor]

get_values_patient(i, *, adapt_for_model=None)

Get values for patient number i, with nans.

Parameters:

iint

The index of the patient (<!> not its identifier)

adapt_for_modelNone (default) or AbstractModel

The values returned are suited for this model. In particular:

For model with noise_model=’ordinal’ will return one-hot-encoded values [P(X = l), l=0..ordinal_max_level]

For model with noise_model=’ordinal_ranking’ will return survival function values [P(X > l), l=0..ordinal_max_level-1]

If None, we return the raw values, whatever the model is.

Returns:

torch.Tensor, shape (n_obs_of_patient, dimension [, extra_dimension_for_ordinal_models]): Contains float or nans

Parameters:

i (int)

Return type:

torch.FloatTensor

to_pandas(apply_headers=False)

Convert dataset to a DataFrame with [‘ID’, ‘TIME’] index, with all covariates, events and repeated measures if apply_headers is False, and only the repeated measures otherwise.

Parameters:

apply_headersbool: Enable to select only the columns that are needed for leaspy fit (headers attribute)

Returns:

pandas.DataFrame

Parameters:

apply_headers (bool)

Return type:

DataFrame

move_to_device(device)

Moves the dataset to the specified device.

Parameters:

devicetorch.device

Parameters:

device (device)

Return type:

None

get_one_hot_encoding(*, sf, ordinal_infos)

Builds the one-hot encoding of ordinal data once and for all and returns it.

Parameters:

sfbool: Whether the vector should be the survival function [1(X > l), l=0..max_level-1] instead of the probability density function [1(X=l), l=0..max_level]
ordinal_infosdict[str, Any]: All the hyperparameters concerning ordinal modelling (in particular maximum level per features)

Returns:

One-hot encoding of data values.

Parameters:

sf (bool)
ordinal_infos (leaspy.utils.typing.KwargsType)

class EventDataframeDataReader(*, event_time_name='EVENT_TIME', event_bool_name='EVENT_BOOL', nb_events=None)

Bases: leaspy.io.data.abstract_dataframe_data_reader.AbstractDataframeDataReader

Methods to convert pandas.DataFrame to Leaspy-compliant data containers for event data only.

Parameters:

event_time_name: str: Name of the columns in dataframe that contains the time of event
event_bool_name: str: Name of the columns in dataframe that contains if the event is censored of not

Raises:

LeaspyDataInputError

Parameters:

event_time_name (str)
event_bool_name (str)
nb_events (Optional[int])

event_time_name = 'EVENT_TIME'

event_bool_name = 'EVENT_BOOL'

nb_events = None

DataframeDataReaderFactoryInput

class DataframeDataReaderNames(*args, **kwds)

Bases: enum.Enum

Enumeration defining the possible names for observation models.

EVENT = 'event'

VISIT = 'visit'

JOINT = 'joint'

classmethod from_string(reader_name)

Parameters:: reader_name (str)

dataframe_data_reader_factory(reader, **kwargs)

Factory for observation models.

Parameters:

modelstr or ObservationModel or dict [ str, …]

If an instance of a subclass of ObservationModel, returns the instance.
If a string, then returns a new instance of the appropriate class (with optional parameters kws).
If a dictionary, it must contain the ‘name’ key and other initialization parameters.

**kwargs

Optional parameters for initializing the requested observation model when a string.

Returns:

ObservationModel: The desired observation model.

Raises:

LeaspyModelInputError: If model is not supported.

Parameters:

reader (DataframeDataReaderFactoryInput)

Return type:

AbstractDataframeDataReader

class IndividualData(idx)

Container for an individual’s data

Parameters:

idxIDType: Unique ID

Attributes:

idxIDType: Unique ID
timepointsnp.ndarray[float, 1D]: Timepoints associated with the observations
observationsnp.ndarray[float, 2D]: Observed data points. Shape is (n_timepoints, n_features)
cofactorsDict[FeatureType, Any]: Cofactors in the form {cofactor_name: cofactor_value}
event_time: Float: Time of an event, if the event is censored, the time correspond to the last patient observation
event_bool: bool: Boolean to indicate if an event is censored or not: 1 observed, 0 censored

Parameters:

idx (leaspy.utils.typing.IDType)

idx: leaspy.utils.typing.IDType

timepoints: ndarray = None

observations: ndarray = None

event_time: ndarray | None = None

event_bool: ndarray | None = None

cofactors: dict[leaspy.utils.typing.FeatureType, Any]

add_observations(timepoints, observations)

Include new observations and associated timepoints

Parameters:

timepointsarray-like[float, 1D]: Timepoints associated with the observations to include
observationsarray-like[float, 2D]: Observations to include

Raises:

LeaspyDataInputError

Parameters:

timepoints (list[float])
observations (list[list[float]])

Return type:

None

add_event(event_time, event_bool)

Include event time and associated censoring bool

Parameters:

event_timefloat: Time of the event
event_boolfloat: 0 if censored (not observed) and 1 if observed

Parameters:

event_time (list[float])
event_bool (list[bool])

Return type:

None

add_cofactors(cofactors)

Include new cofactors

Parameters:

cofactorsDict[FeatureType, Any]: Cofactors to include, in the form {name: value}

Raises:

LeaspyDataInputError
LeaspyTypeError

Parameters:

cofactors (dict[leaspy.utils.typing.FeatureType, Any])

Return type:

None

to_frame(headers, event_time_name, event_bool_name)

Parameters:

headers (list)
event_time_name (str)
event_bool_name (str)

Return type:

DataFrame

class JointDataframeDataReader(*, event_time_name='EVENT_TIME', event_bool_name='EVENT_BOOL', nb_events=None)

Bases: leaspy.io.data.abstract_dataframe_data_reader.AbstractDataframeDataReader

Methods to convert pandas.DataFrame to Leaspy-compliant data containers for event data and longitudinal data.

Parameters:

event_time_name: str: Name of the columns in dataframe that contains the time of event
event_bool_name: str: Name of the columns in dataframe that contains if the event is censored of not

Raises:

LeaspyDataInputError

Parameters:

event_time_name (str)
event_bool_name (str)
nb_events (Optional[int])

tol_diff = 0.001

visit_reader

event_reader

property event_time_name: str

Name of the event time column in dataset

Return type:: str

property event_bool_name: str

Name of the event bool column in dataset

Return type:: str

property dimension: int | None

Number of longitudinal outcomes in dataset.

Return type:: Optional[int]

property long_outcome_names: list[leaspy.utils.typing.FeatureType]

Name of the longitudinal outcomes in dataset

Return type:: list[leaspy.utils.typing.FeatureType]

property n_visits: int

Number of visit in the dataset

Return type:: int

class VisitDataframeDataReader

Bases: leaspy.io.data.abstract_dataframe_data_reader.AbstractDataframeDataReader

Methods to convert pandas.DataFrame to Leaspy-compliant data containers for longitudinal data only. Raises —— LeaspyDataInputError

property dimension: int | None

Number of longitudinal outcomes in dataset.

Returns:

: int: Number of longitudinal outcomes in dataset

Return type:

Optional[int]