Detailed API

The human lymph system (or rather parts of it) are modelled as directed graphs here. Hence, a System consists of multiple Node and Edge instances, which are represented by a python class each.

Recently, we added the convenience class BilateralSystem that automatically creates a symmetric graph for the ipsilateral and contralateral network. It also allows to fix sperad parameters to be set symmetrically.

Lymph system

class lymph.System(graph={})[source]

Class that models metastatic progression in a lymphatic system by representing it as a directed graph. The progression itself can be modelled via hidden Markov models (HMM) or Bayesian networks (BN).

Parameters

graph (dict) – Every key in this dictionary is a 2-tuple containing the type of the Node (“tumor” or “lnl”) and its name (arbitrary string). The corresponding value is a list of names this node should be connected to via an Edge

_evolve(t_first=0, t_last=None)[source]

Evolve hidden Markov model based system over time steps. Compute \(p(S \mid t)\) where \(S\) is a distinct state and \(t\) is the time.

Parameters
  • t_first (int) – First time-step that should be in the list of returned involvement probabilities.

  • t_last (Optional[int]) – Last time step to consider. This function computes involvement probabilities for all \(t\) in between t_frist and t_last. If t_first == t_last, “math:p(S mid t) is computed only at that time.

Return type

ndarray

Returns

A matrix with the values \(p(S \mid t)\) for each time-step.

_gen_C(table, delete_ones=True, aggregate_duplicates=True)[source]

Generate matrix \(\mathbf{C}\) that marginalizes over complete observations when a patient’s diagnose is incomplete.

Parameters
  • table (ndarray) – 2D array where rows represent patients (of the same T-stage) and columns are LNL involvements.

  • delete_ones (bool) – If True, columns in the \(\mathbf{C}\) matrix that contain only ones (meaning the respective diagnose is completely unknown) are removed, since they only add zeros to the log-likelihood.

  • aggregate_duplicates (bool) – If True, the number of occurences of diagnoses in the \(\mathbf{C}\) matrix is counted and collected in a vector \(\mathbf{f}\). The duplicate columns are then deleted.

Return type

ndarray

Returns

Matrix of ones and zeros that can be used to marginalize over possible diagnoses.

find_edge(startname, endname)[source]

Finds and returns the edge instance which has a parent node named startname and ends with node endname.

Return type

Optional[Edge]

find_node(name)[source]

Finds and returns a node with name name.

Return type

Optional[Node]

get_graph()[source]

Lists the graph as it was provided when the system was created.

Return type

dict

list_edges()[source]

Lists all edges of the system with its corresponding start and end nodes.

Return type

List[Edge]

load_data(data, t_stages=None, modality_spsn=None, mode='HMM', gen_C_kwargs={'aggregate_duplicates': True, 'delete_ones': True})[source]

Transform tabular patient data (pd.DataFrame) into internal representation, consisting of one or several matrices \(\mathbf{C}_{T}\) that can marginalize over possible diagnoses.

Parameters
  • data (DataFrame) – Table with rows of patients. Must have a two-level MultiIndex where the top-level has categories ‘Info’ and the name of the available diagnostic modalities. Under ‘Info’, the second level is only ‘T-stage’, while under the modality, the names of the diagnosed lymph node levels are given as the columns.

  • t_stages (Optional[List[int]]) – List of T-stages that should be included in the learning process. If ommitted, the list of T-stages is extracted from the DataFrame

  • modality_spsn (Optional[Dict[str, List[float]]]) – Dictionary of specificity \(s_P\) and \(s_N\) (in that order) for each observational/diagnostic modality. Can be ommitted if the modalities where already defined.

  • mode (str) – “HMM” for hidden Markov model and “BN” for Bayesian net.

  • gen_C_kwargs (dict) – Keyword arguments for the _gen_C(). For efficiency, both delete_ones and aggregate_duplicates should be set to one, resulting in a smaller \(\mathbf{C}\) matrix and an additional count vector \(\mathbf{f}\).

log_likelihood(spread_probs, t_stages=None, diag_times=None, max_t=10, time_dists=None, mode='HMM')[source]

Compute log-likelihood of (already stored) data, given the spread probabilities and either a discrete diagnose time or a distribution to use for marginalization over diagnose times.

Parameters
  • spread_probs (ndarray) – Spread probabiltites from the tumor to the LNLs, as well as from (already involved) LNLs to downsream LNLs.

  • t_stages (Optional[List[Any]]) – List of T-stages that are also used in the data to denote how advanced the primary tumor of the patient is. This does not need to correspond to the clinical T-stages ‘T1’, ‘T2’ and so on, but can also be more abstract like ‘early’, ‘late’ etc. If not given, this will be inferred from the loaded data.

  • diag_times (Optional[Dict[Any, int]]) – For each T-stage, one can specify with what time step the likelihood should be computed. If this is set to None, and a distribution over diagnose times time_dists is provided, the function marginalizes over diagnose times.

  • max_t (Optional[int]) – Latest possible diagnose time. This is only used to return -np.inf in case one of the diag_times exceeds this value.

  • time_dists (Optional[Dict[Any, ndarray]]) – Distribution over diagnose times that can be used to compute the likelihood of the data, given the spread probabilities, but marginalized over the time of diagnosis. If set to None, a diagnose time must be explicitly set for each T-stage.

  • mode (str) – Compute the likelihood using the Bayesian network (“BN”) or the hidden Markv model (“HMM”). When using the Bayesian net, the inputs t_stages, diag_times, max_t and time_dists are ignored.

Return type

float

Returns

The log-likelihood \(\log{p(D \mid \theta)}\) where \(D\) is the data and \(\theta\) is the tuple of spread probabilities and diagnose times or distributions over diagnose times.

marginal_log_likelihood(theta, t_stages=None, time_dists={})[source]

Compute the likelihood of the (already stored) data, given the spread parameters, marginalized over time of diagnosis via time distributions.

Parameters
  • theta (ndarray) – Set of parameters, consisting of the base probabilities \(b\) (as many as the system has nodes) and the transition probabilities \(t\) (as many as the system has edges).

  • t_stages (Optional[List[Any]]) – List of T-stages that should be included in the learning process.

  • time_dists (Dict[Any, ndarray]) – Distribution over the probability of diagnosis at different times \(t\) given T-stage.

Return type

float

Returns

The log-likelihood of the data, given te spread parameters.

property modalities

Return specificity & sensitivity stored in this System.

obs_prob(diagnoses_dict, log=False)[source]

Computes the probability to see certain diagnoses, given the system’s current state.

Parameters
  • diagnoses_dict (Dict[str, List[int]]) – Dictionary of diagnoses (one for each diagnostic modality). A diagnose must be an array of integers that is as long as the the system has LNLs.

  • log (bool) – If True, the log probability is computed. (default: False)

Return type

float

Returns

The probability to see the given diagnoses.

risk(spread_probs=None, inv=None, diagnoses={}, diag_time=None, time_dist=None, mode='HMM')[source]

Compute risk(s) of involvement given a specific (but potentially incomplete) diagnosis.

Parameters
  • spread_probs (Optional[ndarray]) – Set of new spread parameters. If not given (None), the currently set parameters will be used.

  • inv (Optional[ndarray]) – Specific hidden involvement one is interested in. If only parts of the state are of interest, the remainder can be masked with values None. If specified, the functions returns a single risk.

  • diagnoses (Dict[str, ndarray]) – Dictionary that can hold a potentially incomplete (mask with None) diagnose for every available modality. Leaving out available modalities will assume a completely missing diagnosis.

  • diag_time (Optional[int]) – Time of diagnosis. Either this or the time_dist to marginalize over diagnose times must be given.

  • time_dist (Optional[ndarray]) – Distribution to marginalize over diagnose times. Either this, or the diag_time must be given.

  • mode (str) – Set to "HMM" for the hidden Markov model risk (requires the time_dist) or to "BN" for the Bayesian network version.

Return type

Union[float, ndarray]

Returns

A single probability value if inv is specified and an array with probabilities for all possible hidden states otherwise.

property spread_probs: numpy.ndarray

Return the spread probabilities of the Edge instances in the network in the order they appear in the graph.

Return type

ndarray

property state

Return the currently set state of the system.

time_log_likelihood(theta, t_stages, max_t=10)[source]

Compute likelihood given the spread parameters and the time of diagnosis for each T-stage.

Parameters
  • theta (ndarray) – Set of parameters, consisting of the spread probabilities (as many as the system has Edge instances) and the time of diagnosis for all T-stages.

  • t_stages (List[Any]) – keywords of T-stages that are present in the dictionary of C matrices and the previously loaded dataset.

  • max_t (int) – Largest accepted time-point.

Return type

float

Returns

The likelihood of the data, given the spread parameters as well as the diagnose time for each T-stage.

trans_prob(newstate, log=False, acquire=False)[source]

Computes the probability to transition to newstate, given its current state.

Parameters
  • newstate (List[int]) – List of new states for each LNL in the lymphatic system. The transition probability \(t\) will be computed from the current states to these states.

  • log (bool) – if True, the log-probability is computed. (default: False)

  • acquire (bool) – if True, after computing and returning the probability, the system updates its own state to be newstate. (default: False)

Return type

float

Returns

Transition probability \(t\).

Bilateral lymph system

class lymph.BilateralSystem(graph={}, base_symmetric=False, trans_symmetric=True)[source]

Class that models metastatic progression in a lymphatic system bilaterally by creating two System instances that are symmetric in their connections. The parameters describing the spread probabilities however need not be symmetric.

Parameters
  • graph (dict) – Dictionary of the same kind as for initialization of System. This graph will be passed to the constructors of two System attributes of this class.

  • base_symmetric (bool) – If True, the spread probabilities of the two sides from the tumor(s) to the LNLs will be set symmetrically.

  • trans_symmetric (bool) – If True, the spread probabilities among the LNLs will be set symmetrically.

See also

System: Two instances of this class are created as attributes.

binom_marg_log_likelihood(theta, t_stages, max_t=10)[source]

Compute marginal log-likelihood using binomial distributions to sum over the diagnose times.

Parameters
  • theta (ndarray) – Set of parameters, consisting of the spread probabilities (as many as the system has Edge instances) and the binomial distribution’s \(p\) parameters.

  • t_stages (List[Any]) – keywords of T-stages that are present in the dictionary of C matrices and the previously loaded dataset.

  • max_t (int) – Latest accepted time-point.

Return type

float

Returns

The log-likelihood of the (already stored) data, given the spread prbabilities as well as the parameters for binomial distribtions used to marginalize over diagnose times.

load_data(data, t_stages=None, modality_spsn=None, mode='HMM')[source]
Parameters

data (DataFrame) – Table with rows of patients. Columns must have three levels. The first column is (‘Info’, ‘tumor’, ‘T-stage’). The rest of the columns are separated by modality names on the top level, then subdivided into ‘ipsi’ & ‘contra’ by the second level and finally, in the third level, the names of the lymph node level are given.

See also

System.load_data(): Data loading method of unilateral system.

System._gen_C(): Generate marginalization matrix.

log_likelihood(spread_probs, t_stages=None, diag_times=None, max_t=10, time_dists=None, model='HMM')[source]

Compute log-likelihood of (already stored) data, given the spread probabilities and either a discrete diagnose time or a distribution to use for marginalization over diagnose times.

Parameters
  • spread_probs (ndarray) – Spread probabiltites from the tumor to the LNLs, as well as from (already involved) LNLs to downsream LNLs.

  • t_stages (Optional[List[Any]]) – List of T-stages that are also used in the data to denote how advanced the primary tumor of the patient is. This does not need to correspond to the clinical T-stages ‘T1’, ‘T2’ and so on, but can also be more abstract like ‘early’, ‘late’ etc.

  • diag_times (Optional[Dict[Any, int]]) – For each T-stage, one can specify with what time step the likelihood should be computed. If this is set to None, and a distribution over diagnose times time_dists is provided, the function marginalizes over diagnose times.

  • max_t (Optional[int]) – Latest possible diagnose time. This is only used to return -np.inf in case one of the diag_times exceeds this value.

  • time_dists (Optional[Dict[Any, ndarray]]) – Distribution over diagnose times that can be used to compute the likelihood of the data, given the spread probabilities, but marginalized over the time of diagnosis. If set to None, a diagnose time must be explicitly set for each T-stage.

  • mode – Compute the likelihood using the Bayesian network (“BN”) or the hidden Markv model (“HMM”). When using the Bayesian net, the inputs t_stages, diag_times, max_t and time_dists are ignored.

Returns

The log-likelihood \(\log{p(D \mid \theta)}\) where \(D\) is the data and \(\theta\) is the tuple of spread probabilities and diagnose times or distributions over diagnose times.

See also

System.log_likelihood(): The log-likelihood function of the

unilateral system.

marginal_log_likelihood(theta, t_stages=None, time_dists={})[source]

Compute the likelihood of the (already stored) data, given the spread parameters, marginalized over time of diagnosis via time distributions. Wraps the log_likelihood() method.

Parameters
  • theta (ndarray) – Set of parameters, consisting of the base probabilities \(b\) (as many as the system has nodes) and the transition probabilities \(t\) (as many as the system has edges).

  • t_stages (Optional[List[Any]]) – List of T-stages that should be included in the learning process.

  • time_dists (dict) – Distribution over the probability of diagnosis at different times \(t\) given T-stage.

Return type

float

Returns

The log-likelihood of a parameter sample.

See also

log_likelihood(): Simply calls the actual likelihood function

where it sets the diag_times to None.

property modalities

Compute the two system’s observation matrices \(\mathbf{B}^{\text{i}}\) and \(\mathbf{B}^{\text{c}}\).

See also

System.set_modalities(): Setting modalities in unilateral System.

risk(spread_probs=None, inv={'contra': None, 'ipsi': None}, diagnoses={'contra': {}, 'ipsi': {}}, diag_time=None, time_dist=None, mode='HMM')[source]

Compute risk of ipsi- & contralateral involvement given specific (but potentially incomplete) diagnoses for each side of the neck.

Parameters
  • spread_probs (Optional[ndarray]) – Set of new spread parameters. If not given (None), the currently set parameters will be used.

  • inv (Dict[str, Optional[ndarray]]) – Dictionary that can have the keys "ipsi" and "contra" with the respective values being the involvements of interest. If (for one side or both) no involvement of interest is given, it’ll be marginalized. The array themselves may contain True, False or None for each LNL corresponding to the risk for involvement, no involvement and “not interested”.

  • diagnoses (Dict[str, Dict]) – Dictionary that itself may contain two dictionaries. One with key “ipsi” and one with key “contra”. The respective value is then a dictionary that can hold a potentially incomplete (mask with None) diagnose for every available modality. Leaving out available modalities will assume a completely missing diagnosis.

  • diag_time (Optional[int]) – Time of diagnosis. Either this or the time_dist to marginalize over diagnose times must be given.

  • time_dist (Optional[ndarray]) – Distribution to marginalize over diagnose times. Either this, or the diag_time must be given.

  • mode (str) – Set to "HMM" for the hidden Markov model risk (requires the time_dist) or to "BN" for the Bayesian network version.

Return type

float

property spread_probs: numpy.ndarray

Return the spread probabilities of the Edge instances in the network. Length and structure of theta depends on the set symmetries of the network.

See also

theta(): Setting the spread probabilities and symmetries.

Return type

ndarray

property state: numpy.ndarray

Return the currently state (healthy or involved) of all LNLs in the system.

Return type

ndarray

time_log_likelihood(theta, t_stages, max_t=10)[source]

Compute likelihood given the spread parameters and the time of diagnosis for each T-stage. Wraps the \(log_likelihood\) method.

Parameters
  • theta (ndarray) – Set of parameters, consisting of the spread probabilities (as many as the system has Edge instances) and the time of diagnosis for all T-stages.

  • t_stages (List[Any]) – keywords of T-stages that are present in the dictionary of C matrices and the previously loaded dataset.

  • max_t (int) – Latest accepted time-point.

Return type

float

Returns

The likelihood of the data, given the spread parameters as well as the diagnose time for each T-stage.

See also

\(log_likelihood\): The theta argument of this function is

split into spread_probs and diag_times, which are then passed to the actual likelihood function.

Edge

Represents a lymphatic drainage pathway and therefore are spread probability.

class lymph.Edge(start, end, t=0.0)[source]

Class for the connections between lymph node levels (LNLs) represented by the Node class.

Parameters
  • start (Node) – Parent node

  • end (Node) – Child node

  • t (float) – Transition probability in case start-Node has state 1 (microscopic involvement).

Node

Represents a lymph node level (LNL) or rather a random variable associated with it. It encodes the microscopic involvement of the LNL and - if involved - might spread along outgoing edges.

class lymph.Node(name, state=0, typ=None)[source]

Class for lymph node levels (LNLs) in a lymphatic system.

Parameters
  • name (str) – Name of the node

  • state (int) – Current state this LNL is in. Can be in {0, 1}

  • typ (Optional[str]) – Can be either “lnl”, “tumor” or None. If it is the latter, the type will be inferred from the name of the node. A node starting with a t (case-insensitive), then it will be a tumor node and a lymph node levle (lnl) otherwise. (default: None)

bn_prob(log=False)[source]

Computes the conditional probability of a node being in the state it is in, given its parents are in the states they are in.

Parameters

log (bool) – If True, returns the log-probability. (default: False)

Return type

float

Returns

The conditional (log-)probability.

obs_prob(obs, obstable=array([[1.0, 0.0], [0.0, 1.0]]), log=False)[source]

Compute the probability of observing a certain diagnose, given its current state.

Parameters
  • obs (int) – Diagnose/observation for the node.

  • obstable (ndarray) – 2x2 matrix containing info about sensitivity and specificty of the observational/diagnostic modality from which obs was obtained.

  • log (bool) – If True, method returns the log-prob.

Return type

float

Returns

The probability of observing the given diagnose.

trans_prob(log=False)[source]

Compute the transition probabilities from the current state to all other possible states (which is only two).

Parameters

log (bool) – If True method returns the log-probability. (default: False)

Return type

float

Returns

The transition probabilities from current state to all two other states.