Getting started

This package is meant to be a relatively simple-to-use frontend. The math is done under the hood and one does not need to worry about it a lot. Below are the things that are actually necessary.

Importing

This one is obvious.

[2]:

import lymph

Graph

The model is based on the assumption that one can represent the lymphatic system as a directed graph. Hence, the first thing to do is to define a graph that represents the drainage pathways of the lymphatic system aptly.

Here, this is done via a dictionary:

[3]:

graph = {'T': ['a', 'b'],
         'a': ['b', 'c'],
         'b': ['c'],
         'c': []}

Every key in this dictionary represents either a tumor - in which case its name must start with a t (upper- or lowercase) - or a lymph node level. The value of each of those nodes is a list of nodes it connects to. So, for example the primary tumor T in the graph above has directed arcs to a and b, while the LNL c does not have any outgoing connections.

We can simply create an instance of System using only this graph and let it report itself:

[4]:

simple_lymph_system = lymph.System(graph=graph)
simple_lymph_system.print_graph()

Tumor(s):
        T ---  0.0 % --> a
          +--  0.0 % --> b

LNL(s):
        a ---  0.0 % --> b
          +--  0.0 % --> c
        b ---  0.0 % --> c

The percentages between two nodes represents the probability (rate) that metastatic spread occurs along it. In the case of the tumor spreading to LNL a we call this probability base probability (rate) \(\tilde{b}_a\). For the spread between lymph node levels, we call it transition probability (rate), e.g. \(\tilde{t}_{ab}\). The difference to the base probability (rate) is that it only plays a role if the parent LNL is already ivolved with metastases, while the tumor always spread, of course.

We can change these probability (rates) with a function called set_theta(). The only argument to this function is a list or array of these values in the order in which they appear when we call print_graph():

[5]:

simple_lymph_system.set_theta([0.1, 0.2, 0.4, 0.25, 0.05])
simple_lymph_system.print_graph()

Tumor(s):
        T --- 10.0 % --> a
          +-- 20.0 % --> b

LNL(s):
        a --- 40.0 % --> b
          +-- 25.0 % --> c
        b ---  5.0 % --> c

Reversely, we can also read them out:

[6]:

probabilities = simple_lymph_system.get_theta()
print(probabilities)

[0.1  0.2  0.4  0.25 0.05]

Diagnostic Modalities

To ultimately compute the likelihoods of observations, we need to fix the sensitivities and specificities of the obtained diagnoses. And since we might have multiple diagnostic modalities available, we need to tell the system which of them comes with which specificity and sensitivity. We do this by creating a dictionary of specificity/sensitivity pairs:

[7]:

spsn_dict = {"MRI": [0.63, 0.81],
             "PET": [0.86, 0.79]}

Now we can pass this to the system using the set_modalities function.

[8]:

simple_lymph_system.set_modalities(spsn_dict=spsn_dict)

However, mostly this is going to be done automatically when loading the data and providing the dictionary along with the dataset.

Data / Observations

To compute the likelihood of a set of probability (rates) given a patient cohort we need such a patient cohort, of course. We can provide it to the system in the form of a pandas DataFrame. Here is an example:

[9]:

import pandas as pd

dataset = pd.read_csv("_data/example.csv", header=[0,1])
dataset

[9]:

	Info	pathology
	T-stage	I	II	III	IV
0	early	1	0	0	0
1	early	0	1	0	0
2	early	0	1	0	0
3	early	0	1	0	0
4	early	0	1	0	0
...	...	...	...	...	...
142	early	0	0	0	0
143	early	0	0	0	0
144	early	0	0	0	0
145	early	0	0	0	0
146	early	0	0	0	0

147 rows × 5 columns

Note that this data has two header-rows, defining not only the individual column’s content, but also to which over-arching category they belong. The “Info” category plays a special role here along with its sub-category “T-stage”. It will later tell the system which time prior to use according to a dictionary of these distributions.

The “pathology” section denotes that this dataset contains observations from a pathologic diagnostic modality (neck dissections in this case). How this is termed is irrelevant, as we will be telling the system what to look for. Import is, however, that - if we had multiple diagnostic modalities - they all contain a column for each lymph node level in the system we have set up. Obvioulsy, this dataset here does not match the system set up earlier, so let’s fix that.

[10]:

graph = {'T'  : ['I', 'II', 'III', 'IV'],
         'I'  : ['II'],
         'II' : ['III'],
         'III': ['IV'],
         'IV' : []}

example_system = lymph.System(graph=graph)
example_system.print_graph()

Tumor(s):
        T ---  0.0 % --> I
          +--  0.0 % --> II
          +--  0.0 % --> III
          +--  0.0 % --> IV

LNL(s):
        I ---  0.0 % --> II
        II ---  0.0 % --> III
        III ---  0.0 % --> IV

To feed the dataset into the system, we use the appropriate function. It also takes as an argument all the different T-stages that we want to consider. What the system does here is creating a \(\mathbf{C}\) matrix for every T-stage we provide here. Keep in mind that the T-stages in this list must actually occur in the database.

Similarly, we can provide our specificity/sensitivity pairs for each diagnsotic modality directly here. In this case too the keys of the dictionary must be columns in the dataset.

[12]:

only_pathology = {"pathology": [1., 1.]}

example_system.load_data(dataset,
                         t_stage=["early"],
                         spsn_dict=only_pathology,
                         mode="HMM")

Lastly, the mode parameter determines which model should be used. We have implemented both the Bayesian network (mode = "BN") from (Pouymayou et al., 2019) and the hidden Markov model from our work (mode = "HMM). In case the Bayesian network is chosen, the parameter t_stage has no effect.

Time prior

The last ingerdient to set up (at least when using the hidden Markov model) would now be the time prior. Since this dataset contains only early T-stage patients the exact shape does not matter too much, as long as it is “reasonable”. If we also had late T-stage patients in the cohort, we would need to think about how the two time priors relate to each other.

For now we are going to use binomial distributions for this. Their shape makes intuitive sense: Since the time prior \(p_T(t)\) is a distribution over the probability of diagnosing a patient after \(t\) time steps, given his T-stage \(T\) we would expect that a very early detection of the cancer is similarly unlikely as a very late one.

As mentioned, we need to put a distribution for each distinct T-stage in our cohort into a dictionary:

[14]:

import numpy as np
import scipy as sp
import scipy.stats
import matplotlib.pyplot as plt

t_max = 10
time_steps = np.arange(t_max+1)
p = 0.4

early_prior = sp.stats.binom.pmf(time_steps, t_max, p)

plt.plot(time_steps, early_prior, "o-");
plt.xlabel("time step $t$");
plt.ylabel("probability $p$");

[15]:

time_prior_dict = {}
time_prior_dict["early"] = early_prior

Likelihood

With everything set up like this, we can compute the likelihood of seeing the above dataset given a set of base and transition probability (rates).

[16]:

test_probabilities = [0.02, 0.24, 0.03, 0.2, 0.23, 0.18, 0.18]

llh = example_system.likelihood(test_probabilities,
                                t_stage=["early"],
                                time_prior_dict=time_prior_dict,
                                mode="HMM")

print(f"log-likelihood is {llh:.2f}")

log-likelihood is -331.09

[17]:

example_system.print_graph()

Tumor(s):
        T ---  2.0 % --> I
          +-- 24.0 % --> II
          +--  3.0 % --> III
          +-- 20.0 % --> IV

LNL(s):
        I --- 23.0 % --> II
        II --- 18.0 % --> III
        III --- 18.0 % --> IV

From here it is up to the user what to do with this quantity. Most likely though, one would want to perform MCMC sampling with this.