Getting started#

A lot of people get diagnosed with squamous cell carcinoma in the head & neck region (HNSCC), which frequently metastasizes via the lymphatic system. We set out to develop a methodology to predict the risk of a new patient having metastases in so-called lymph node levels (LNLs), based on their personal diagnose (e.g. findings from a CT scan) and information of previously diagnosed and treated patients. And that’s exactly what this code enables you to do as well.

As mentioned, this package is meant to be a relatively simple-to-use frontend. The math is done under the hood and one does not need to worry about it a lot. But let’s have a quick look at what we’re doing here.

We will assume that you have already read the section on how to install the module and followed its instructions.

Importing#

First, let’s import the package.

import lymph

Graph#

The model is based on the assumption that one can represent the lymphatic system as a directed graph. The arcs in that graph represent the direction of the lymphatic flow and therefore also the direction of metastatic spread. Hence, the first thing to do is to define a graph that represents the drainage pathways of the lymphatic system aptly.

Here, this is done via a dictionary:

graph_dict = {
    ('tumor', 'T')  : ['I', 'II', 'III', 'IV'], 
    ('lnl'  , 'I')  : ['II'], 
    ('lnl'  , 'II') : ['III'], 
    ('lnl'  , 'III'): ['IV'], 
    ('lnl'  , 'IV') : []
}

Every key in this dictionary is a tuple of the form ({type}, {name}) and represents either a tumor - in which case the {type} must be 'tumor' - or a lymph node level ({type} must be 'lnl'). The value of each of those nodes is then a list of names for nodes it connects to. So, for example the primary tumor ('tumor', 'primary') in the graph above has directed arcs to a and b, while the LNL c does not have any outgoing connections.

We can simply create an instance of System using only this graph and let it report itself:

model = lymph.models.Unilateral.binary(graph_dict=graph_dict)
model
<lymph.models.unilateral.Unilateral at 0x7fb4c8ef8460>

Parameterization#

The parameters are probabilities of spread between the tumor and LNL(s). They represent the probability rate that metastatic spread occurs these edges. In the case of the tumor spreading to LNL II we call this probability base probability rate \(\tilde{b}_2\). For the spread between lymph node levels, we call it transition probability rate, e.g. \(\tilde{t}_{rv}\) from LNL \(r\) to LNL \(v\). The difference to the base probability rate is that it only plays a role if the parent LNL is already ivolved with metastases, while the tumor always spreads, of course.

We can change these spread probability rates either via the method set_parms(), which can also handle parameters not related to the spread probability rates of the arcs between LNLs. The spread parameters can be set either positionally (in which the order of the parameters can be learned from the returned dictionary of the get_params() method) or via keyword arguments.

For example, in our graph above, there is an edge from the tumor T to the LNL II. The corresponding edge is named TtoII and its parameter is called spread. For example:

model.graph.edges["TtoII"].set_params(spread=0.5)
model.graph.edges["TtoII"].get_params()
{'spread': 0.5}

To set this parameter directly from the model, you can do this:

model.set_params(TtoII_spread=0.25)
model.get_params()
{'TtoI_spread': 0.0,
 'TtoII_spread': 0.25,
 'TtoIII_spread': 0.0,
 'TtoIV_spread': 0.0,
 'ItoII_spread': 0.0,
 'IItoIII_spread': 0.0,
 'IIItoIV_spread': 0.0}

You may also want to change all spread parameters at once. To do so, simply call model.set_params(spread=0.42). This will cause the method to distribute the spread parameter to all edges:

model.set_params(spread=0.42)
model.get_params()
{'TtoI_spread': 0.42,
 'TtoII_spread': 0.42,
 'TtoIII_spread': 0.42,
 'TtoIV_spread': 0.42,
 'ItoII_spread': 0.42,
 'IItoIII_spread': 0.42,
 'IIItoIV_spread': 0.42}

Now, you might want to modify only the parameters relating to the tumor or lnl spread. There are dedicated methods for that too:

model.set_tumor_spread_params(spread=0.12)
model.set_lnl_spread_params(spread=0.89)
model.get_params()
{'TtoI_spread': 0.12,
 'TtoII_spread': 0.12,
 'TtoIII_spread': 0.12,
 'TtoIV_spread': 0.12,
 'ItoII_spread': 0.89,
 'IItoIII_spread': 0.89,
 'IIItoIV_spread': 0.89}

Diagnostic Modalities#

To ultimately compute the likelihoods of observations, we need to fix the sensitivities and specificities of the obtained diagnoses. And since we might have multiple diagnostic modalities available, we need to tell the system which of them comes with which specificity and sensitivity. We do this by adding specificity/sensitivity pairs to our model:

model.set_modality("MRI", spec=0.63, sens=0.81)
model.set_modality("PET", spec=0.86, sens=0.79, kind="clinical")
#                                               ^^^^^^^^^^^^^^^
#                                               No effect in binary model,
#                                               but important for trinary.
model.get_all_modalities()
{'MRI': Clinical(spec=0.63, sens=0.81, is_trinary=False),
 'PET': Clinical(spec=0.86, sens=0.79, is_trinary=False)}

One can also specify if the modality is a Clincal or Pathological one. In the binary model case that has no advantage and makes no difference, aside from being maybe a bit more explicit. But when we get to trinary models, it becomes very important.

Now it is also possible to access the confusion matrix of the specified diagnostic modalities:

print(model.get_modality("PET").confusion_matrix)
[[0.86 0.14]
 [0.21 0.79]]

Data / Observations#

To compute the likelihood of a set of probability (rates) given a patient cohort we need such a patient cohort, of course. We can provide it to the system in the form of a pandas DataFrame. Here is an example:

import pandas as pd

dataset_url = "https://raw.githubusercontent.com/rmnldwg/lydata/main/2021-usz-oropharynx/data.csv"
example_cols = [
    ("patient", "#", "age"),
    ("patient", "#", "hpv_status"),
    ("tumor", "1", "t_stage"),
    ("PET", "ipsi", "I"),
    ("PET", "ipsi", "II"),
    ("PET", "ipsi", "III"),
    ("PET", "ipsi", "IV"),
    ("MRI", "ipsi", "I"),
    ("MRI", "ipsi", "II"),
    ("MRI", "ipsi", "III"),
    ("MRI", "ipsi", "IV"),
]

usz_oropharynx = pd.read_csv(dataset_url, header=[0,1,2])
usz_oropharynx[example_cols]
patient tumor PET MRI
# 1 ipsi ipsi
age hpv_status t_stage I II III IV I II III IV
0 59 True 3 False True False False False True False False
1 75 True 2 NaN NaN NaN NaN False False True False
2 87 True 4 True True True True True True True True
3 87 False 4 False True True True False True True True
4 70 True 2 False True False False NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
282 49 False 4 True True True False NaN NaN NaN NaN
283 67 True 4 False True True False False True True False
284 44 False 4 False True True False NaN NaN NaN NaN
285 67 False 4 False True True False NaN NaN NaN NaN
286 76 True 2 False True True False NaN NaN NaN NaN

287 rows × 11 columns

Note

This data has three header-rows. It follows the standard of the data presented in the interactive dashboard LyProX. It can be obtained from the lyDATA repository on GitHub and is described in more detail here.

Let’s keep only the "CT" modality for now.

from lymph.modalities import Clinical


model.replace_all_modalities({"PET": Clinical(spec=0.86, sens=0.79),})
model.get_all_modalities()
{'PET': Clinical(spec=0.86, sens=0.79, is_trinary=False)}

To feed the dataset into the system, we assign the dataset to the attribute patient_data. What the system then does here is creating a diagnose matrix for every T-stage in the data.

model.load_patient_data(usz_oropharynx, side="ipsi")
model.patient_data["_model"]
PET MRI CT ... pathology FNA pCT #
I II III IV I II III IV I II ... IV I II III IV I II III IV t_stage
0 False True False False False True False False NaN NaN ... NaN NaN True True NaN False True False False late
1 NaN NaN NaN NaN False False True False NaN NaN ... NaN NaN True NaN NaN False True False False early
2 True True True True True True True True NaN NaN ... NaN NaN NaN NaN NaN True True True False late
3 False True True True False True True True NaN NaN ... NaN NaN True True True False True True True late
4 False True False False NaN NaN NaN NaN NaN NaN ... NaN NaN True NaN NaN False True False False early
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
282 True True True False NaN NaN NaN NaN True True ... NaN NaN True NaN NaN True True True False late
283 False True True False False True True False NaN NaN ... NaN NaN NaN NaN NaN False True True False late
284 False True True False NaN NaN NaN NaN NaN NaN ... NaN NaN True True NaN False True True False late
285 False True True False NaN NaN NaN NaN NaN NaN ... NaN NaN NaN False NaN False True True False late
286 False True True False NaN NaN NaN NaN NaN NaN ... NaN NaN True NaN NaN False True True False early

287 rows × 25 columns

Note

The data now has an additional top-level header "_model" which stores only the information the model actually needs. In this case, it only stores the ipsilateral CT diagnoses of the LNLs I, II, III, and IV, as well as the mapped T-stage of the patients. Note that from the original T-stages 1, 2, 3, and 4, only “early” and “late” are left. This is the default transformation, but it can be changed by providing a function to the mapping keyword argument in the load_patient_data() method.

Distribution over Diagnose Times#

The last ingredient to set up (at least when using the hidden Markov model) would now be the distribution over diagnose times. Our dataset contains two different T-stages “early” and “late”. One of the underlying assumptions with our model is that earlier T-stage patients have been - on average - diagnosed at an earlier time-point, compared to late T-stage patients. We can reflect that using distributions over the diagnosis time:

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

max_time = model.max_time
time_steps = np.arange(max_time+1)
p = 0.4

early_prior = sp.stats.binom.pmf(time_steps, max_time, p)

plt.plot(time_steps, early_prior, "o-");
plt.xlabel("time step $t$");
plt.ylabel("probability $p$");
_images/641a9d91bdc8c8935b7058981570218f0b72dba0b0150bdb8600b02f4ccbb1e1.png

We can now set a fixed prior for the distribution over diagnose times of early T-stage patients (i.e., patients with T1 and T2 tumors).

model.set_distribution("early", early_prior)
model.get_all_distributions()
{'early': Distribution([0.006046617599999923, 0.040310784, 0.12093235199999999, 0.2149908479999999, 0.25082265600000003, 0.20065812480000006, 0.11147673600000013, 0.042467327999999985, 0.010616832000000008, 0.001572864, 0.00010485760000000013])}

Let’s define a parametrized PMF over diagnose times for patients with late T-stage tumors (T3 and T4) to show this functionality. For that, we first define a parametrized function with the signature

def distribution(support: list[float] | np.ndarray, a=1, b=2, c=3, ...) -> np.ndarray:
    """PMF over diagnose times (``support``) with parameters ``a``, ``b``, and ``c``."""
    ...
    return result

Here, it’s important that the first argument is the support of the probability mass function, i.e., the discrete time-steps from 0 to max_time. Also, all parameters must have default values. Otherwise, there would be cases when such a stored distribution cannot be accessed.

Lastly, if some parameters have bounds, like e.g. the binomial distribution, they should raise a ValueError. This exception is propagated upwards but causes the likelihood method to simply return -np.inf. That way it will be seamlessly rejected during an MCMC sampling round.

Let’s look at a concrete, binomial example:

from scipy.special import factorial

def binom_pmf(k: np.ndarray, n: int, p: float):
    """Binomial PMF"""
    if p > 1. or p < 0.:
        # This value error is important to enable seamless sampling!
        raise ValueError("Binomial prob must be btw. 0 and 1")
    q = (1. - p)
    binom_coeff = factorial(n) / (factorial(k) * factorial(n - k))
    return binom_coeff * p**k * q**(n - k)

def late_binomial(support: np.ndarray, p: float = 0.5) -> np.ndarray:
    """Parametrized binomial distribution."""
    return binom_pmf(k=support, n=support[-1], p=p)

Note

Surprisingly, this is much faster than just using the implementation from scipy using numpy array functions. So, if you want to do sampling with a model, maybe don’t use scipy.stats.

And now we assign this parametric distribution to the model:

model.set_distribution("late", late_binomial)
model.get_all_distributions()
{'early': Distribution([0.006046617599999923, 0.040310784, 0.12093235199999999, 0.2149908479999999, 0.25082265600000003, 0.20065812480000006, 0.11147673600000013, 0.042467327999999985, 0.010616832000000008, 0.001572864, 0.00010485760000000013]),
 'late': Distribution([0.0009765625, 0.009765625, 0.0439453125, 0.1171875, 0.205078125, 0.24609375, 0.205078125, 0.1171875, 0.0439453125, 0.009765625, 0.0009765625])}
model.get_distribution_params(as_flat=False)
{'late': {'p': 0.5}}

Note how the set of adjustable parameters now also contains the p parameter for the late T-stage’s distribution over diagnose times. For the early T-stage, it is not present, because that one was provided as a fixed array.

Likelihood#

With everything set up like this, we can compute the likelihood of seeing the above dataset given a set of base and transition probability (rates).

test_probabilities = [0.02, 0.24, 0.03, 0.2, 0.23, 0.18, 0.18, 0.5]

llh = model.likelihood(test_probabilities, log=True, mode="HMM")

print(f"log-likelihood is {llh}")
log-likelihood is -586.8723971388224

Sampling using emcee#

Now we’ll show how one could do inference using MCMC sampling. Note that this is by far not the only way to learn the model parameters from the data. But it is a quick and useful one.

First we define a couple of parameters for the sampling. Have a look at the documentation of emcee to understand them in more detail.

import emcee

nwalkers, ndim = 100, len(model.get_params())
nsteps = 200
initial = np.random.uniform(size=(nwalkers, ndim))
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[19], line 1
----> 1 import emcee
      3 nwalkers, ndim = 100, len(model.get_params())
      4 nsteps = 200

ModuleNotFoundError: No module named 'emcee'

The we create a sampler with these parameters and finally start sampling.

sampler = emcee.EnsembleSampler(
    nwalkers=nwalkers,
    ndim=ndim,
    log_prob_fn=model.likelihood,
    moves=[(emcee.moves.DEMove(), 0.8), (emcee.moves.DESnookerMove(), 0.2)],
    parameter_names=list(model.get_params().keys()),
)
sampler.run_mcmc(initial, nsteps, progress=False);
sampler.get_chain(discard=int(0.9*nsteps), flat=True).mean(axis=0)