This content, written by Scott Hoover, was initially posted in Looker Blog on Jan 4, 2016. The content is subject to __limited support.__

## Survival analysis - introduction

Many businesses view customer lifetime value (LTV) as the Holy Grail of metrics, and with good reason. As an absolute measure, it's an indication of how much money a business can reasonably expect to make from a typical customer. As a relative measure, it's a good gauge of business health—for example, if expected profit from the typical customer is decreasing over time, perhaps there are levers the business can identify and pull in order to adjust its course. There are, however, a multitude of approaches when calculating LTV, each ranging in complexity. Moreover, an LTV formula for one business may not hold for another. What *is* consistent are the fundamental inputs: revenue, costs, and estimates of customer lifetime. In this article, I'll focus on a popular method for estimating the typical customer's lifetime called Survival Analysis, which is borrowed from biostatistics. I will present this approach relying on a simulated subscription e-commerce data set (*see* attached script to reproduce).

## Objective

The objective in a Survival Analysis is to derive a lifetime estimate for the typical customer. In the case of our subscription business we might conclude that "...our typical customer remains a subscriber for 14 months." The result of this analysis will be a survival function or curve for our customers. Each point on the curve can be interpreted as the probability of surviving to a future point in time. Here's an example of a typical survival curve:

Survival Analysis can be done in discrete or continuous time. For simplicity—and to be consistent with our subscription business model—I will stick to the former. In this, and future articles, I will refer to the Survival function using the following notation:

## Setup

Let's suppose we have two tables of interest: `users`

and `payments`

. Our `users`

table contains basic information about users that sign up for our service—namely, a unique identifier, the date the user signed up, and the channel through which the user heard of our service. Our `payments`

table captures monthly subscription payments and includes a unique identifier, the foreign key that tells us which user the payment is associated with, the month the payment applies to, and, lastly, the amount of the payment. Let's take a look at these tables.

*users*

1 | 2014-06-01 | |

2 | 2014-06-01 | |

3 | 2014-06-01 | |

4 | 2014-06-02 | |

... | ... | ... |

*payments*

1 | 1 | 2014-06-01 | 25 |

2 | 1 | 2014-07-01 | 25 |

3 | 2 | 2014-06-01 | 25 |

4 | 2 | 2014-07-01 | 25 |

5 | 2 | 2014-08-01 | 25 |

... | ... | ... | ... |

## Exploratory analysis

To get a feel of our data, let's review some basic queries:

- The number of users signing up for our service is increasing, month-over-month.

```
select to_char(signup_date, 'YYYY-MM') as signup_month
, count(*) as number_of_users
from users
```

- Payments follow a similar trend:

```
select to_char(created_month, 'YYYY-MM') as created_month
, count(*) AS number_of_payments
from payments
group by 1
```

- To get a better feel of payments over time, we should marry these two queries and group our payments by user-signup month.

```
select to_char(users.signup_date, 'YYYY-MM') as "users.signup_month"
, to_char(payments.created_month, 'YYYY-MM') as "payments.created_month"
, count(distinct payments.user_id) as "users.number_of_users"
from payments
join users
on payments.user_id = users.user_id
group by 1,2
```

We can see from the above matrix that each row represents the cohort of users who signed up in a given month, and each column captures the number of users, from that cohort, with subsequent payments over time ª.

- It's important to compare apples to apples. Ideally, we should view the relative 1st, 2nd, ...
*N_th-month drop offs _across cohorts*. This requires a trivial addition to our SQL:

```
select to_char(users.signup_date, 'YYYY-MM') as "users.signup_month"
, datediff(month, users.signup_date, payments.created_month) as "payments.months_since_signup"
, count(distinct payments.user_id) as "users.number_of_users"
from payments
join users
on payments.user_id = users.user_id
group by 1,2
```

This effectively left justifies the previous query, allowing us to easly eyeball the relative retention rates across cohorts.

- The following two transformations will be useful moving forward:

```
select user_id
, max(created_month) as most_recent_payment
from payments as user_facts
group by 1
```

and

```
select
datediff(month, users.signup_date, user_facts.most_recent_payment) as "users.months_a_customer"
from users
left join [user_facts]
on users.id = user_facts.user_id
```

## The hazard

One interesting property of the discrete-time case is that our Survival function can be expressed in terms of the Hazard function, which itself is straightforward to compute. The Hazard is interpreted as the probability of failing to survive past time *t* given that the subject has survived to that point. Formally,

The Survival function, therefore, can be expressed as

These equations can be easily expressed within Looker making use of table calculations. I'll build a handful of easy-to-follow constituent table calculations; then I'll add them into a single expression to arrive at our final calculation.

First, include the above transformation for `users.months_a_customer`

in order to determine the number of people that canceled in any given period.

Next, let's calculate a running total of users (note that this is not meaningful in and of itself):

```
Cumulative Users = running_total(${users.number_of_users})
```

Next, let's calculate the total number of users:

```
Total Users = sum(${users.number_of_users})
```

Next, let's determine the number of active users after a given period:

```
Active Users After Period = if(row() = 1, ${total_users}, ${total_users} - ${cumulative_users})
```

Lastly, we can calculate our Hazard probability:

```
Hazard Probability = round(${users.number_of_users} / offset(${active_users_after_period}, -1), 2)
```

At this point, we may have a result set that looks something like this:

## Survival calculation

To calculate the Survival function, we'll rely on the cumulative-product formula above. However, neither SQL nor Looker have support for a cumulative product function. So, instead of writing some difficult-to-follow SQL, we'll make use of Looker's `running_total`

function and the following rule of logs:

Leading us to our final calculation:

```
Survival Probability = round(if(row() = 1, 1, exp(running_total(ln(round(1 - ${hazard_probability}, 2))))), 2)
```

Visualizing our survival curve, in aggregate, we might get something like this:

For all cohorts, the visualization is a bit busier, but can demonstrate the time-dependent fluctuations in cohorts' retention rates:

In the second article in this series, I'll demonstrate some useful calculations that piggyback off of the subject matter in this article. Stay tuned.

## Attachments

ª It's important to note that numbers in our diagonals of the matrix do not necessarily capture the full cohort size. Any users who sign up in a given month but who never have payment are implicitly omitted from this analysis. This can be handled, but is beyond the scope of this article.