Code
import numpy as np
import pandas as pd
import plotly.express as px
from itertools import product
from scipy import integrate
import scipy
import matplotlib.pyplot as plt
In the third lecture block for “Finanz und Versicherungstechnik” (Finance & Insurance), we learnt about survival analysis. To solidify our understanding about it, the Professor provided us a lifetable consisting 2 columns of age and qx (probability of mortality for each specific age). The dataset can be found here on page 38.
The goal is to answer the following questions:
import numpy as np
import pandas as pd
import plotly.express as px
from itertools import product
from scipy import integrate
import scipy
import matplotlib.pyplot as plt
The dataset has the following shape:
= pd.read_csv("DAV2008T.csv", sep=";")
raw_data raw_data.shape
(122, 16)
Get all column names:
raw_data.columns
Index(['Alter', 'A', 'E', 'C', 'B', 'F', 'D', 'G', 'M', 'J', 'H', 'N', 'K',
'I', 'O', 'L'],
dtype='object')
Based on this document on page 38, we need only the following data from columns:
= raw_data.loc[:,['Alter', 'A']]
data data.head()
Alter | A | |
---|---|---|
0 | 0 | 0.006113 |
1 | 1 | 0.000423 |
2 | 2 | 0.000343 |
3 | 3 | 0.000275 |
4 | 4 | 0.000220 |
Let’s give the columns more appropriate names:
={'Alter': "age", 'A': "qx"}, inplace=True)
data.rename(columns data.head()
age | qx | |
---|---|---|
0 | 0 | 0.006113 |
1 | 1 | 0.000423 |
2 | 2 | 0.000343 |
3 | 3 | 0.000275 |
4 | 4 | 0.000220 |
Visualize f curve (density function for mortality):
= px.line(data, y="qx", x="age", hover_data=['age', 'qx'], labels=dict(age="Age", qx="f"))
fig fig.show()
Get survival rate (S):
= 1
S "S"] = 0
data.loc[:,
for i in range(len(data)):
= S*(1-data.loc[i, 'qx'])
S 'S'] = S data.loc[i,
Visualize survival rate:
= px.line(data, y="S", x="age", hover_data=['age'], labels=dict(age="Age", S="Survival Rate"))
fig =0.5, line_dash="solid", row=1, col=1, line_color="#000000", line_width=1)
fig.add_hline(y fig.show()
Based on the graph above, the life expectancy is:
'S']-0.5).abs().argsort()[:1]] data.iloc[(data[
age | qx | S | |
---|---|---|---|
76 | 76 | 0.067433 | 0.504283 |
We define life expectancy as the age after which the survival rate < 0.5. More explanation can be found here.
Get F or failure rate:
"F"] = 1 - data.loc[:, "S"] data.loc[:,
Visualize F:
= px.line(data, y="F", x="age", hover_data=['age'], labels=dict(age="Age", F="Failure Rate"))
fig =0.5, line_dash="solid", row=1, col=1, line_color="#000000", line_width=1)
fig.add_hline(y fig.show()
Get f or failure density:
"f"] = data.loc[:, "F"].diff() data.loc[:,
Visualize f:
= px.line(data, y="f", x="age", hover_data=['age'], labels=dict(age="Age", f="Failure Density"))
fig fig.show()
I am born in 1988 and by the time of writting, it is 2023. My age is 35 years old. For further calculation, we interested in the failure density starting from 35 years old.
Let’s see the failure density graph after 35 years old:
= px.line(data.loc[data['age'] >= 35], y="f", x="age", hover_data=['age'], labels=dict(age="Age", f="Failure Density"))
fig fig.show()
Find my life expectancy by integrating the area below the graph until it approaches 0.5:
= list(data.loc[data['age'] >= 35].age)
x = list(data.loc[data['age'] >= 35].f)
y
for i in range(len(x)):
= x[: i+1]
current_x = y[: i+1]
current_y = integrate.simpson(current_y, current_x)
area if area >= 0.5:
break
print("My life expectancy is {}".format(current_x[-1]))
My life expectancy is 78
Let’s visualize my survival rate:
= list(data.loc[data['age'] >= 35].age)
x = list(data.loc[data['age'] >= 35].f)
y = []
my_failure_rate = 0
my_life_expectancy
= False
executed
for i in range(len(x)):
= x[: i+1]
current_x = y[: i+1]
current_y = integrate.simpson(current_y, current_x)
area
my_failure_rate.append(area)if area >= 0.5 and not executed:
= current_x[-1]
my_life_expectancy = True
executed
= pd.DataFrame({"age": x, "failure_rate": my_failure_rate})
my_survival "survival_rate"] = 1 - my_survival.loc[:, "failure_rate"]
my_survival.loc[:,
= px.line(my_survival, y="survival_rate", x="age", hover_data=['age'], labels=dict(age="Age", survival_rate="My Survival Rate"))
fig =0.5, line_dash="solid", row=1, col=1, line_color="#000000", line_width=1)
fig.add_hline(y fig.show()
In the following section, we want to find the survival function for people born in year 2000 and still alive in 2023.
To do that we need to fit the relevant data for this group to the following distributions:
weibull_min
function is used instead of weibull_max
, because the x value (age
) are positive.View the data:
= list(data.loc[data['age'] >= 23].age)
x = list(data.loc[data['age'] >= 23].f)
y = []
failure_rate = 0
life_expectancy
= False
executed
for i in range(len(x)):
= x[: i+1]
current_x = y[: i+1]
current_y = integrate.simpson(current_y, current_x)
area
failure_rate.append(area)if area >= 0.5 and not executed:
= current_x[-1]
my_life_expectancy = True
executed
= pd.DataFrame({"age": x, "failure_rate": failure_rate})
survival_2023 "survival_rate"] = 1 - survival_2023.loc[:, "failure_rate"]
survival_2023.loc[:,
= px.line(survival_2023, y="survival_rate", x="age", hover_data=['age'], labels=dict(age="Age", survival_rate="Survival Rate"), title="Survival rate for 23 years old man in year 2023")
fig =0.5, line_dash="solid", row=1, col=1, line_color="#000000", line_width=1)
fig.add_hline(y fig.show()
Prepare survival function for exponential
, weibull
and lognorm
distributions:
# sf: survival function
def sf_exponential(x, loc=0, scale=1):
return scipy.stats.expon.sf(x, loc, scale)
def sf_weibull(x, c, loc=0, scale=1):
return scipy.stats.weibull_min.sf(x, c, loc, scale)
def sf_lognorm(x, s, loc=0, scale=1):
return scipy.stats.lognorm.sf(x, s, loc, scale)
exponential
distribution= scipy.optimize.curve_fit(sf_exponential, survival_2023.age, survival_2023.survival_rate)
popt, pcov
='Real Data', c='r')
plt.scatter(survival_2023.age, survival_2023.survival_rate, label
= sf_exponential(survival_2023.age, *popt)
pred
'b-', label="Prediction")
plt.plot(survival_2023.age, pred,
'Age')
plt.xlabel('Survival Rate')
plt.ylabel(
plt.legend() plt.show()
weibull
distribution= scipy.optimize.curve_fit(sf_weibull, survival_2023.age, survival_2023.survival_rate)
popt_weibull, pcov
='Real Data', c='r')
plt.scatter(survival_2023.age, survival_2023.survival_rate, label
= sf_weibull(survival_2023.age, *popt_weibull)
pred
'b-', label="Prediction")
plt.plot(survival_2023.age, pred,
'Age')
plt.xlabel('Survival Rate')
plt.ylabel(
plt.legend() plt.show()
lognorm
distribution= scipy.optimize.curve_fit(sf_lognorm, survival_2023.age, survival_2023.survival_rate)
popt, pcov
='Real Data', c='r')
plt.scatter(survival_2023.age, survival_2023.survival_rate, label
= sf_lognorm(survival_2023.age, *popt)
pred
'b-', label="Prediction")
plt.plot(survival_2023.age, pred,
'Age')
plt.xlabel('Survival Rate')
plt.ylabel(
plt.legend() plt.show()
weibbull
distribution is obviously the best distribution to fit our dataset.\[\Huge S(age) = e^{-(\frac{age-loc}{scale})^c} \]
where:
print("c, loc and scale = {:.2f}, {:.2f}, {:.2f}".format(popt_weibull[0], popt_weibull[1], popt_weibull[2]))
c, loc and scale = 37.84, -275.27, 355.85
Proof: The following graph shows the output of preceeding function, which is identical to the Section 7.3.
= [*popt_weibull]
c, loc, scale = survival_2023.age
x
= np.exp(-((x-loc)/scale)**c)
pred 'b-', label="Prediction")
plt.plot(x, pred,
plt.legend() plt.show()