Detailed API
The human lymph system (or rather parts of it) are modelled as directed graphs here. Hence, a System
consists of multiple Node
and Edge
instances, which are represented by a python class each.
Recently, we added the convenience class BilateralSystem
that automatically creates a symmetric graph for the ipsilateral and contralateral network. It also allows to fix sperad parameters to be set symmetrically.
Lymph system
- class lymph.System(graph={})[source]
Class that models metastatic progression in a lymphatic system by representing it as a directed graph. The progression itself can be modelled via hidden Markov models (HMM) or Bayesian networks (BN).
- Parameters
graph (
dict
) – Every key in this dictionary is a 2-tuple containing the type of theNode
(“tumor” or “lnl”) and its name (arbitrary string). The corresponding value is a list of names this node should be connected to via anEdge
- _evolve(t_first=0, t_last=None)[source]
Evolve hidden Markov model based system over time steps. Compute \(p(S \mid t)\) where \(S\) is a distinct state and \(t\) is the time.
- Parameters
t_first (
int
) – First time-step that should be in the list of returned involvement probabilities.t_last (
Optional
[int
]) – Last time step to consider. This function computes involvement probabilities for all \(t\) in between t_frist and t_last. If t_first == t_last, “math:p(S mid t) is computed only at that time.
- Return type
ndarray
- Returns
A matrix with the values \(p(S \mid t)\) for each time-step.
- _gen_C(table, delete_ones=True, aggregate_duplicates=True)[source]
Generate matrix \(\mathbf{C}\) that marginalizes over complete observations when a patient’s diagnose is incomplete.
- Parameters
table (
ndarray
) – 2D array where rows represent patients (of the same T-stage) and columns are LNL involvements.delete_ones (
bool
) – IfTrue
, columns in the \(\mathbf{C}\) matrix that contain only ones (meaning the respective diagnose is completely unknown) are removed, since they only add zeros to the log-likelihood.aggregate_duplicates (
bool
) – IfTrue
, the number of occurences of diagnoses in the \(\mathbf{C}\) matrix is counted and collected in a vector \(\mathbf{f}\). The duplicate columns are then deleted.
- Return type
ndarray
- Returns
Matrix of ones and zeros that can be used to marginalize over possible diagnoses.
- find_edge(startname, endname)[source]
Finds and returns the edge instance which has a parent node named
startname
and ends with nodeendname
.- Return type
Optional
[Edge
]
- get_graph()[source]
Lists the graph as it was provided when the system was created.
- Return type
dict
- list_edges()[source]
Lists all edges of the system with its corresponding start and end nodes.
- Return type
List
[Edge
]
- load_data(data, t_stages=None, modality_spsn=None, mode='HMM', gen_C_kwargs={'aggregate_duplicates': True, 'delete_ones': True})[source]
Transform tabular patient data (
pd.DataFrame
) into internal representation, consisting of one or several matrices \(\mathbf{C}_{T}\) that can marginalize over possible diagnoses.- Parameters
data (
DataFrame
) – Table with rows of patients. Must have a two-levelMultiIndex
where the top-level has categories ‘Info’ and the name of the available diagnostic modalities. Under ‘Info’, the second level is only ‘T-stage’, while under the modality, the names of the diagnosed lymph node levels are given as the columns.t_stages (
Optional
[List
[int
]]) – List of T-stages that should be included in the learning process. If ommitted, the list of T-stages is extracted from theDataFrame
modality_spsn (
Optional
[Dict
[str
,List
[float
]]]) – Dictionary of specificity \(s_P\) and \(s_N\) (in that order) for each observational/diagnostic modality. Can be ommitted if the modalities where already defined.mode (
str
) – “HMM” for hidden Markov model and “BN” for Bayesian net.gen_C_kwargs (
dict
) – Keyword arguments for the_gen_C()
. For efficiency, bothdelete_ones
andaggregate_duplicates
should be set to one, resulting in a smaller \(\mathbf{C}\) matrix and an additional count vector \(\mathbf{f}\).
- log_likelihood(spread_probs, t_stages=None, diag_times=None, max_t=10, time_dists=None, mode='HMM')[source]
Compute log-likelihood of (already stored) data, given the spread probabilities and either a discrete diagnose time or a distribution to use for marginalization over diagnose times.
- Parameters
spread_probs (
ndarray
) – Spread probabiltites from the tumor to the LNLs, as well as from (already involved) LNLs to downsream LNLs.t_stages (
Optional
[List
[Any
]]) – List of T-stages that are also used in the data to denote how advanced the primary tumor of the patient is. This does not need to correspond to the clinical T-stages ‘T1’, ‘T2’ and so on, but can also be more abstract like ‘early’, ‘late’ etc. If not given, this will be inferred from the loaded data.diag_times (
Optional
[Dict
[Any
,int
]]) – For each T-stage, one can specify with what time step the likelihood should be computed. If this is set to None, and a distribution over diagnose times time_dists is provided, the function marginalizes over diagnose times.max_t (
Optional
[int
]) – Latest possible diagnose time. This is only used to return -np.inf in case one of the diag_times exceeds this value.time_dists (
Optional
[Dict
[Any
,ndarray
]]) – Distribution over diagnose times that can be used to compute the likelihood of the data, given the spread probabilities, but marginalized over the time of diagnosis. If set to None, a diagnose time must be explicitly set for each T-stage.mode (
str
) – Compute the likelihood using the Bayesian network (“BN”) or the hidden Markv model (“HMM”). When using the Bayesian net, the inputs t_stages, diag_times, max_t and time_dists are ignored.
- Return type
float
- Returns
The log-likelihood \(\log{p(D \mid \theta)}\) where \(D\) is the data and \(\theta\) is the tuple of spread probabilities and diagnose times or distributions over diagnose times.
- marginal_log_likelihood(theta, t_stages=None, time_dists={})[source]
Compute the likelihood of the (already stored) data, given the spread parameters, marginalized over time of diagnosis via time distributions.
- Parameters
theta (
ndarray
) – Set of parameters, consisting of the base probabilities \(b\) (as many as the system has nodes) and the transition probabilities \(t\) (as many as the system has edges).t_stages (
Optional
[List
[Any
]]) – List of T-stages that should be included in the learning process.time_dists (
Dict
[Any
,ndarray
]) – Distribution over the probability of diagnosis at different times \(t\) given T-stage.
- Return type
float
- Returns
The log-likelihood of the data, given te spread parameters.
- obs_prob(diagnoses_dict, log=False)[source]
Computes the probability to see certain diagnoses, given the system’s current state.
- Parameters
diagnoses_dict (
Dict
[str
,List
[int
]]) – Dictionary of diagnoses (one for each diagnostic modality). A diagnose must be an array of integers that is as long as the the system has LNLs.log (
bool
) – IfTrue
, the log probability is computed. (default:False
)
- Return type
float
- Returns
The probability to see the given diagnoses.
- risk(spread_probs=None, inv=None, diagnoses={}, diag_time=None, time_dist=None, mode='HMM')[source]
Compute risk(s) of involvement given a specific (but potentially incomplete) diagnosis.
- Parameters
spread_probs (
Optional
[ndarray
]) – Set of new spread parameters. If not given (None
), the currently set parameters will be used.inv (
Optional
[ndarray
]) – Specific hidden involvement one is interested in. If only parts of the state are of interest, the remainder can be masked with valuesNone
. If specified, the functions returns a single risk.diagnoses (
Dict
[str
,ndarray
]) – Dictionary that can hold a potentially incomplete (mask withNone
) diagnose for every available modality. Leaving out available modalities will assume a completely missing diagnosis.diag_time (
Optional
[int
]) – Time of diagnosis. Either this or the time_dist to marginalize over diagnose times must be given.time_dist (
Optional
[ndarray
]) – Distribution to marginalize over diagnose times. Either this, or the diag_time must be given.mode (
str
) – Set to"HMM"
for the hidden Markov model risk (requires thetime_dist
) or to"BN"
for the Bayesian network version.
- Return type
Union
[float
,ndarray
]- Returns
A single probability value if
inv
is specified and an array with probabilities for all possible hidden states otherwise.
- property spread_probs: numpy.ndarray
Return the spread probabilities of the
Edge
instances in the network in the order they appear in the graph.- Return type
ndarray
- property state
Return the currently set state of the system.
- time_log_likelihood(theta, t_stages, max_t=10)[source]
Compute likelihood given the spread parameters and the time of diagnosis for each T-stage.
- Parameters
theta (
ndarray
) – Set of parameters, consisting of the spread probabilities (as many as the system hasEdge
instances) and the time of diagnosis for all T-stages.t_stages (
List
[Any
]) – keywords of T-stages that are present in the dictionary of C matrices and the previously loaded dataset.max_t (
int
) – Largest accepted time-point.
- Return type
float
- Returns
The likelihood of the data, given the spread parameters as well as the diagnose time for each T-stage.
- trans_prob(newstate, log=False, acquire=False)[source]
Computes the probability to transition to newstate, given its current state.
- Parameters
newstate (
List
[int
]) – List of new states for each LNL in the lymphatic system. The transition probability \(t\) will be computed from the current states to these states.log (
bool
) – ifTrue
, the log-probability is computed. (default:False
)acquire (
bool
) – ifTrue
, after computing and returning the probability, the system updates its own state to benewstate
. (default:False
)
- Return type
float
- Returns
Transition probability \(t\).
Bilateral lymph system
- class lymph.BilateralSystem(graph={}, base_symmetric=False, trans_symmetric=True)[source]
Class that models metastatic progression in a lymphatic system bilaterally by creating two
System
instances that are symmetric in their connections. The parameters describing the spread probabilities however need not be symmetric.- Parameters
graph (
dict
) – Dictionary of the same kind as for initialization ofSystem
. This graph will be passed to the constructors of twoSystem
attributes of this class.base_symmetric (
bool
) – IfTrue
, the spread probabilities of the two sides from the tumor(s) to the LNLs will be set symmetrically.trans_symmetric (
bool
) – IfTrue
, the spread probabilities among the LNLs will be set symmetrically.
See also
System
: Two instances of this class are created as attributes.- binom_marg_log_likelihood(theta, t_stages, max_t=10)[source]
Compute marginal log-likelihood using binomial distributions to sum over the diagnose times.
- Parameters
theta (
ndarray
) – Set of parameters, consisting of the spread probabilities (as many as the system hasEdge
instances) and the binomial distribution’s \(p\) parameters.t_stages (
List
[Any
]) – keywords of T-stages that are present in the dictionary of C matrices and the previously loaded dataset.max_t (
int
) – Latest accepted time-point.
- Return type
float
- Returns
The log-likelihood of the (already stored) data, given the spread prbabilities as well as the parameters for binomial distribtions used to marginalize over diagnose times.
- load_data(data, t_stages=None, modality_spsn=None, mode='HMM')[source]
- Parameters
data (
DataFrame
) – Table with rows of patients. Columns must have three levels. The first column is (‘Info’, ‘tumor’, ‘T-stage’). The rest of the columns are separated by modality names on the top level, then subdivided into ‘ipsi’ & ‘contra’ by the second level and finally, in the third level, the names of the lymph node level are given.
See also
System.load_data()
: Data loading method of unilateral system.System._gen_C()
: Generate marginalization matrix.
- log_likelihood(spread_probs, t_stages=None, diag_times=None, max_t=10, time_dists=None, model='HMM')[source]
Compute log-likelihood of (already stored) data, given the spread probabilities and either a discrete diagnose time or a distribution to use for marginalization over diagnose times.
- Parameters
spread_probs (
ndarray
) – Spread probabiltites from the tumor to the LNLs, as well as from (already involved) LNLs to downsream LNLs.t_stages (
Optional
[List
[Any
]]) – List of T-stages that are also used in the data to denote how advanced the primary tumor of the patient is. This does not need to correspond to the clinical T-stages ‘T1’, ‘T2’ and so on, but can also be more abstract like ‘early’, ‘late’ etc.diag_times (
Optional
[Dict
[Any
,int
]]) – For each T-stage, one can specify with what time step the likelihood should be computed. If this is set to None, and a distribution over diagnose times time_dists is provided, the function marginalizes over diagnose times.max_t (
Optional
[int
]) – Latest possible diagnose time. This is only used to return -np.inf in case one of the diag_times exceeds this value.time_dists (
Optional
[Dict
[Any
,ndarray
]]) – Distribution over diagnose times that can be used to compute the likelihood of the data, given the spread probabilities, but marginalized over the time of diagnosis. If set to None, a diagnose time must be explicitly set for each T-stage.mode – Compute the likelihood using the Bayesian network (“BN”) or the hidden Markv model (“HMM”). When using the Bayesian net, the inputs t_stages, diag_times, max_t and time_dists are ignored.
- Returns
The log-likelihood \(\log{p(D \mid \theta)}\) where \(D\) is the data and \(\theta\) is the tuple of spread probabilities and diagnose times or distributions over diagnose times.
See also
System.log_likelihood()
: The log-likelihood function of theunilateral system.
- marginal_log_likelihood(theta, t_stages=None, time_dists={})[source]
Compute the likelihood of the (already stored) data, given the spread parameters, marginalized over time of diagnosis via time distributions. Wraps the
log_likelihood()
method.- Parameters
theta (
ndarray
) – Set of parameters, consisting of the base probabilities \(b\) (as many as the system has nodes) and the transition probabilities \(t\) (as many as the system has edges).t_stages (
Optional
[List
[Any
]]) – List of T-stages that should be included in the learning process.time_dists (
dict
) – Distribution over the probability of diagnosis at different times \(t\) given T-stage.
- Return type
float
- Returns
The log-likelihood of a parameter sample.
See also
log_likelihood()
: Simply calls the actual likelihood functionwhere it sets the diag_times to None.
- property modalities
Compute the two system’s observation matrices \(\mathbf{B}^{\text{i}}\) and \(\mathbf{B}^{\text{c}}\).
See also
System.set_modalities()
: Setting modalities in unilateral System.
- risk(spread_probs=None, inv={'contra': None, 'ipsi': None}, diagnoses={'contra': {}, 'ipsi': {}}, diag_time=None, time_dist=None, mode='HMM')[source]
Compute risk of ipsi- & contralateral involvement given specific (but potentially incomplete) diagnoses for each side of the neck.
- Parameters
spread_probs (
Optional
[ndarray
]) – Set of new spread parameters. If not given (None
), the currently set parameters will be used.inv (
Dict
[str
,Optional
[ndarray
]]) – Dictionary that can have the keys"ipsi"
and"contra"
with the respective values being the involvements of interest. If (for one side or both) no involvement of interest is given, it’ll be marginalized. The array themselves may containTrue
,False
orNone
for each LNL corresponding to the risk for involvement, no involvement and “not interested”.diagnoses (
Dict
[str
,Dict
]) – Dictionary that itself may contain two dictionaries. One with key “ipsi” and one with key “contra”. The respective value is then a dictionary that can hold a potentially incomplete (mask withNone
) diagnose for every available modality. Leaving out available modalities will assume a completely missing diagnosis.diag_time (
Optional
[int
]) – Time of diagnosis. Either this or the time_dist to marginalize over diagnose times must be given.time_dist (
Optional
[ndarray
]) – Distribution to marginalize over diagnose times. Either this, or the diag_time must be given.mode (
str
) – Set to"HMM"
for the hidden Markov model risk (requires thetime_dist
) or to"BN"
for the Bayesian network version.
- Return type
float
- property spread_probs: numpy.ndarray
Return the spread probabilities of the
Edge
instances in the network. Length and structure oftheta
depends on the set symmetries of the network.See also
theta()
: Setting the spread probabilities and symmetries.- Return type
ndarray
- property state: numpy.ndarray
Return the currently state (healthy or involved) of all LNLs in the system.
- Return type
ndarray
- time_log_likelihood(theta, t_stages, max_t=10)[source]
Compute likelihood given the spread parameters and the time of diagnosis for each T-stage. Wraps the \(log_likelihood\) method.
- Parameters
theta (
ndarray
) – Set of parameters, consisting of the spread probabilities (as many as the system hasEdge
instances) and the time of diagnosis for all T-stages.t_stages (
List
[Any
]) – keywords of T-stages that are present in the dictionary of C matrices and the previously loaded dataset.max_t (
int
) – Latest accepted time-point.
- Return type
float
- Returns
The likelihood of the data, given the spread parameters as well as the diagnose time for each T-stage.
See also
- \(log_likelihood\): The theta argument of this function is
split into spread_probs and diag_times, which are then passed to the actual likelihood function.
Edge
Represents a lymphatic drainage pathway and therefore are spread probability.
Node
Represents a lymph node level (LNL) or rather a random variable associated with it. It encodes the microscopic involvement of the LNL and - if involved - might spread along outgoing edges.
- class lymph.Node(name, state=0, typ=None)[source]
Class for lymph node levels (LNLs) in a lymphatic system.
- Parameters
name (
str
) – Name of the nodestate (
int
) – Current state this LNL is in. Can be in {0, 1}typ (
Optional
[str
]) – Can be either “lnl”, “tumor” or None. If it is the latter, the type will be inferred from the name of the node. A node starting with a t (case-insensitive), then it will be a tumor node and a lymph node levle (lnl) otherwise. (default: None)
- bn_prob(log=False)[source]
Computes the conditional probability of a node being in the state it is in, given its parents are in the states they are in.
- Parameters
log (
bool
) – IfTrue
, returns the log-probability. (default:False
)- Return type
float
- Returns
The conditional (log-)probability.
- obs_prob(obs, obstable=array([[1.0, 0.0], [0.0, 1.0]]), log=False)[source]
Compute the probability of observing a certain diagnose, given its current state.
- Parameters
obs (
int
) – Diagnose/observation for the node.obstable (
ndarray
) – 2x2 matrix containing info about sensitivity and specificty of the observational/diagnostic modality from which obs was obtained.log (
bool
) – IfTrue
, method returns the log-prob.
- Return type
float
- Returns
The probability of observing the given diagnose.
- trans_prob(log=False)[source]
Compute the transition probabilities from the current state to all other possible states (which is only two).
- Parameters
log (
bool
) – IfTrue
method returns the log-probability. (default:False
)- Return type
float
- Returns
The transition probabilities from current state to all two other states.