 ## Introduction

Deciding which loans to invest in, or gauging the ongoing performance of a portfolio requires investors to be able to predict how much a given loan will return before reaching maturity. The present paper describes a model predicting such returns.

The mantra ‘Past performance is no guarantee of future results.’ shall always be kept in mind, in finance or other subjects. That being said, it’s better to infer from past data that to rely on out-of-touch assumptions. Hence our idea is to analyze the fate of previous loans and build a model that anticipates how future loans will behave.

Since the loan amount is known from inception, we ‘only’ need to predict the total amount paid back to investors to calculate the financial return. Furthermore, as the amount paid monthly (also called installment) is constant over time, estimating the total amount paid back to investor only requires to estimate the number of payments made over a loan’s life. A loan will generate payments as long as it ‘survives’, and a loan’s ‘death’ means it stops paying.

## Available Data

Lending Club provides historical data allowing us to analyze when loans stop paying. Unfortunately, most of the loans are still on-going, since Lending Club has grown spectacularly in the recent years.

Analyzing only mature loans is the simplest option. However, that causes two problems: first, the amount of data is significantly smaller. Out of 270,119 loans issued by Lending Club as of March 2014, only 20,233 are old enough to have reached maturity. Second, restricting a model to loans issued before 2011 can introduce a bias since borrowers’ characteristics may have changed in the past 3 years.

Therefore we need to choose a model that is able to factor in current loans as well.

## Normalization

Loans may have different terms, and therefore different time horizons. We ensure consistency by normalizing their age on an interval [0,1]. For instance, the age of 60-months loan in its 39th month is 39/60 = 0.65. An age of 0 corresponds to the issuance, an age of 1 to maturity. In some rare cases, a loan can go beyond its maturity, for instance, when it was late in payment. For the sake of simplicity, we’ll set the upper bound at 1 in all cases.

Likewise, the number of payments made by a loan can be normalized as the ratio between the total number of payments by the term of the loan. A Payment Ratio of 0 means the loans hasn’t generated any payments yet. A Payment Ratio of 1 means all the installments were paid. For the sake of simplicity, we will also consider that loans that were paid-back early have been fully paid at maturity only.

## Survival Function

Let S(t) be the survival rate of a loan at age t. If a loan is fully paid, then we have S(1) = 1. If it stopped paying at mid-life then S(0.5) = S(1) = 0.5. The complement of the Survival function S(t) corresponds to the probability of a loan to be dead at age t, which is the cumulative probability of dying at age t.

The problem of estimating survival is quite a common task, and a whole part of statistics called Survival Analysis is devoted to it. Two common usages are estimating how treatment influence life expectancy of patients in medicine, and predicting product reliability in engineering.

In both cases, statisticians also have to rely upon incomplete data, as we do with Peer Lending loans. For instance, medical researchers normally have to analyze the data while some subjects are still alive. Some subjects may also have moved away, and be lost for follow-ups (so whether they died or not is unknown). In both cases, one doesn’t know how much longer they might survive. Such survival times are termed ‘censored’ to indicate that the period of observation was cut off before the monitored event occurred.

Let us first compare the probability of death over time for defaulting loans with different terms. We graph the proportion of defaulting loans that stop paying over time, for 6,097 36-months loans and for 2,783 60-months loans:

In the graph above, the 36-month loans are crosses, the 60-month loans are dots. The horizontal axis corresponds to the maturity of the loans (0 = just issued, 1 = reached maturity), while the vertical axis corresponds to the proportion of loans that have stopped paying. In statistical analysis, the function giving the probability of failure over time is called the hazard function. We observe that the shapes are similar, and therefore will consider this hazard function is constant over the term of a loan (once again, the age of a loan is expressed on the interval [0,1]).

Since the hazard function is constant over the term of the loan, we can mix loans of different terms to plot the Lifetime Distribution Function. This curve shows the probability of default over time for loans that will default. The risk of default increases sharply over time, especially after a few months, then decreases once a loan has passed half-maturity.

## Default Rate

The previous Lifetime Distribution Function applies only to loans that are certain to default, it doesn’t tell us the impact of different default rates on the survival curve. In other words: is a loan with a 0.25 probability of default likely to default at a different time than a loan with a probability of 1.0?

To understand such an impact, we need to analyze the number of missed payments based on the default rate. To do so, we process through a series of Monte-Carlo simulations. The Monte-Carlo simulation is method to obtain numerical estimates through repeated random samplings. In the present case, we pickup a pre-defined average default rate d at random in the interval [0,1]. Then we select 1,000 loans such that the proportion of defaulting loans matches our pre-defined average d. Then we measure how much they missed their payments by. We repeat the process thousands of times to obtain reliable results:

Note: there are still missing payments when default=0 because of loans that haven’t reached maturity yet.

This shows that the probability of default has a linear effect on missed payments. In other words, the default rate ‘flexes’ the survival curve down, but does not change its shape. Therefore we can apply a Cox Proportional Hazards Model to create a survival estimator.

## Cox Proportional Hazards

The Cox model is a well-recognized statistical technique for modeling survival data that simultaneously explores the effects of several variables. It is often used to analyze the survival of patients in a clinical trial, where it allows statisticians to isolate the effects of treatment from the effects of other variables. A strength of the Cox model is that it allows to include ‘censored-data’, i.e. loans that are not mature yet.

The method of proportional hazards splits the survival functions in 2 components: an underlying, baseline hazard function F’(t), and an effect parameter g(z) describing the effect of a vector z of explanatory variables of the loan.

The baseline hazard function is the cumulative probability of death of a hypothetical ‘completely average’ loan. Since we know the default rate does not modify the shape of the curve F(t), we can obtain it by multiplying the previous lifetime distribution function F(t) by a constant B. We end up with a parametric proportional hazards model where the probability of survival at time t for a loan with the vector of covariates z is:

$$S(t | z) = 1 – g(z) \cdot B \cdot F(t)$$

The effect parameter g(z) is a proportionally constant function of the vector of covariates z. These covariates z are characteristics of the loan that have an impact on the probability of default. Some hypothetical examples are the loan grade, purpose or how many times the borrowers defaulted before.

It is typically assumed that the hazard responds logarithmically to continuous covariates. Categorical covariates are split in multiple, dummy variables. For instance, LoanPurpose = ‘car’ is changed into purpose_car = 1, purpose_other=0. The probability of a loan of characteristics z to have defaulted at time t can be constructed as:

$$\lambda (t |z) = \lambda_0(t) \cdot e^{\beta z}$$

Where $\lambda_0(t)$ is the baseline hazard, and $\beta$ is a vector of parameters corresponding to the effect of each covariate.

The regression method introduced by Cox allows us to estimate both $\beta$ and $\lambda_0(t)$ by maximizing the partial likelihood of the survival curve. In practice, a Newton-Raphson optimization algorithm is used with the Hessian of the partial log-likelihood to converge to the correct parameters.

If you’re lost: this is just a fancy way of saying there exists a clever mathematical method to find both an estimate of the average survival curve and the weight of each a loan’s characteristics so we can predict when it will default with the best possible accuracy.

## Baseline Survival

When applying the Cox proportional hazards method to Lending Club’s historical data, we obtain the following baseline survival table:

Time Survival
0.00 0.999
0.03 0.998
0.06 0.996
0.08 0.994
0.94 0.898
0.97 0.897
1 0.895

This says, for instance, that a perfectly average, 36-months loan has a probability of default of 0.4%(1 – 0.996) when it’s 2 months old (0.06 x 36).

The baseline survival at maturity is 0.895. Therefore S(1) = 0.895. Since S(1) = 1 – F’(1) = 1- B x F(1) and F(1) = 1, we have B = 0.105. In other words, a perfectly average loan has a 10.5% risk of defaulting before maturity.

This means:

$$S(1 | z) = 1 – 0.105 \cdot g(z)$$

## Effect Parameters

The Cox proportional hazards analysis gives the following parameter estimates:

Parameter Estimate Standard Error Lower 95% Upper 95%
FICO score (lower bound) –0.0010293 0.0004616 –0.001936 –0.000127
Sub Grade 0.05280805 0.0019866 0.0489075 0.0566947
Loan Amount –5.0413e–6 3.3655e–6 –1.167e–5 1.5238e–6
Debt-to-Income ratio 0.0027784 0.0015405 –0.000243 0.0057959
Open Credit Lines –0.0177774 0.0031108 –0.023885 –0.011691
Total Credit Lines 0.00447601 0.0013291 0.0018616 0.0070715
Number of Delinquencies –0.0541081 0.0175649 –0.089104 –0.020258
Number of Inquiries 0.11607386 0.0044991 0.1070607 0.1247001
Length of Employment –0.0077227 0.0029357 –0.013478 –0.00197
Home Ownership: ‘mortgage’ –0.2285397 0.1079474 –0.416494 0.0169571
Home Ownership: ‘none’ 0.46302684 0.4019667 –0.475457 1.1425968
Home Ownership: ‘other’ 0.04318891 0.1797057 –0.315094 0.3986464
Home Ownership: ‘own’ –0.1689286 0.1106693 –0.363594 0.0805187
Purpose: ‘Car’ –0.3688268 0.0751338 –0.519288 –0.22457
Purpose: ‘Credit card’ –0.5831873 0.0376587 –0.656992 –0.509326
Purpose: ‘Debt consolidation’ –0.2201599 0.0282871 –0.275089 –0.164145
Purpose: ‘Educational’ 0.23164711 0.1040292 0.0209673 0.4292712
Purpose: ‘Home improvement’ –0.1633505 0.0469559 –0.255983 –0.07186
Purpose: ‘Major purchase’ –0.2128853 0.0578053 –0.327724 –0.101034
Purpose: ‘Medical’ 0.1980045 0.0730734 0.0517766 0.3384039
Purpose: ‘Small business’ 0.59293817 0.0431258 0.5080413 0.677145
Purpose: ‘Vacation’ 0.07459664 0.1035141 –0.135136 0.2711493

A negative parameter estimate means it reduces the risk of default by ‘flattening’ the hazard curve. Hence the negative values for the FICO score, length of employment or the annual income. A positive parameter ‘flexes’ the hazard function up, meaning higher risks, hence the positive values for sub-grade or number of inquiries. Surprisingly, the number of delinquencies or the number of open credit lines have both negative parameters, which means they’re correlated with lower default rates. The most likely explanation is that they’re already factored in, excessively, in the Sub Grade.

Several characteristics turned out to be non-significant and will be discarded. For instance, the debt-to-income ratio has an estimate close to zero. The estimate even changes sign in the 95% confidence interval, meaning we don’t even know for sure if its influence is positive or a negative.

## Covariates Selection

Loans have more than a hundred different properties. Building a model with so many covariates is both impractical and dangerous, since the risk of over-fitting grows with the number of parameters.

We run the statistical data of mature loans through a stepwise regression to obtain the best predictive variables. The stepwise regression is an iterative process than selects the best predictive variables. Although controversial, this method is suited to the present case due to the relatively low number of covariates in our model.

Amongst the covariates selected are FICO score, Sub Grade, Debt-to-Income, Earliest Credit Line, Length of Employment or Number of Public Records.

## Estimating Returns

Being able to forecast the number of payments for a loan based on its characteristics allows us to calculate its expected return. S(1|z), the probability of survival at maturity, can also be viewed as the payment ratio. Multiplying it by the loan term N gives the expected total number of payments:

$$n = S(1|z) \cdot N$$

A series of identical payments over time is called an Annuity. With r being the monthly discount rate, the Net Present Value of a loan of amount A paying n annuities of amount p is:

$$NPV = \frac {-A} {(1+r)^0} + \frac {p} {(1+r)^1} + \frac {p} {(1+r)^2} + … + \frac {p} {(1+r)^{n}}$$

Which gives:

$$NPV = p \cdot \frac{1 – (1+r)^{-n}}{r} – A$$

When the NPV is 0, it means r the discount rate is such that the sum of the discounted payments equals the loan amount. This is the Internal Rate of Return

Unfortunately the IRR cannot be directly calculated. A computer program can, however, approximate it using subsequent iterations until the NPV is close enough to zero. A simple algorithm to speed up calculations called the secant method is:

$${ r }_{ n+1 }={ r }_{ n }-{ NPV }_{ n }\left( \frac { r_{ n }-r_{ { n-1 } } }{ NPV_{ n }-NPV_{ n-1 } } \right)$$

Once the monthly return r has been determined, obtaining the annual rate of return simply requires us to annualize it:

$$R = (1 + r)^{12 – 1}$$

Such a return estimate gives us a directly usable scoring mechanism.

## Validation

To assess the validity of our prediction algorithm, we need to cross-validate it, which means obtaining the model parameters from one set of data, then applying it to a different set and measuring how it performs.

To do so, we divide the historical Lending Club data in 2 distinct sets. To ensure a random but consistent distribution, the first set is made up of loans with an even ID number, the second one with loans with an odd ID number. We take the first set as ‘in-sample’ and fit the Cox Proportional hazard model to estimate the effect parameters. Then we select the loans past maturity in the second ‘out of sample’ data-set, score them with the Expected Return procedure aforementioned, sort them by decreasing score and compute the average performance for each decile:

Number of loans Expected Return Observed Return Default Rate
1,047 9.28% 9.50% 12.42%
1,047 7.81% 7.82% 13.28%
1,047 7.05% 8.67% 10.03%
1,047 6.43% 5.28% 13.28%
1,047 5.83% 4.41% 14.14%
1,047 5.15% 4.22% 12.03%
1,047 4.50% 4.91% 9.46%
1,047 3.86% 3.36% 9.93%
1,047 3.07% 3.81% 8.69%
1,047 –3.62% –4.42% 26.17%

This data shows top scored loans are the ones providing the best return on average. Financial performance degrades as the score gets lower, empirically validating the efficiency of our model.

The average return for loans in the top expected quartile is 6.44%, a very significant 2.44% above the average of 4.00% for the same data set.

It also shows that the default rate does not necessarily increase with lower scores. Since the goal of the method is to increase returns, it sometimes means favoring riskier loans because their high interest rate compensates for the risk.

## Extension

The LendingRobot Expected Return model relies on an improved and more sophisticated model than the one described above. Some of the improvements are:

• Adding composite variables to the model. For instance, while ‘Loan Amount’ is not by itself a good predictor, the ratio of ‘Loan Amount’ divided by ‘Monthly Income’, that indicates how much someone is borrowing compared to what they can afford can be more strongly correlated with the probability of default. The stepwise algorithm can be used again to measure the significance of these composite variables and keep the most relevant ones.
• Siloing the data according to a given variable, such as the loan grade, credit score or income, calculating the Cox parameters for each silo, and then applying a weighted average of the parameters. The underlying idea is that loan may have different behavioral characteristics based on their type. For instant the loan purpose may be more important for a high-risk loan than for a low-risk one. The optimal weights for each silo is obtained through a machine learning process called a greedy algorithm.

Our analysis shows that these improvements alone allow us to increase the returns of high score loans by 1.56%.

## Conclusion

An improved version of the Cox proportional hazard model allows to predict future returns with a significant edge over pure random samples.
Moreover, once the model parameters have been determined, calculating expected returns is straightforward and can be done in real-time when new investment opportunities arise.

1. sean says:

I’m a bit dubious about the normalising of time. Do you have any data to back it up? It seems more natural to assume that there is a constant hazard rate each month, than to say that ‘time speeds up’ for a 10 year loan….vs a 3 year loan. I understand the normalisation argument too ( eg if you have paid back 90% of your loan you are not going to default). I guess it is better to split data in maturities?

1. Emmanuel Marot says:

Loan terms are only 3 or 5 years. At the time of our analysis, we could only see limited discrepancies between the two, once the duration has been normalized.
That being said, we now have more data and are refining our analysis to take into account differences between platforms and terms. Stay tuned!

1. sean says:

sorry I hadn’t seen that that’s exactly what you did (comparing maturities) in the next paragraph/graph: “We graph the proportion of defaulting loans that stop paying over time, for 6,097 36-months loans and for 2,783 60-months loans”

2. Ashish says:

For training and test data set, you picked odd and even number loans. Ideally, I would train my algorithm on lets say 2007-2011 and then test it on 2011-2015 loans. This would validate if the economic factors influence the correlation with parameters such as Income, home ownership etc. What do you think?

1. Justin Hsi says:

We have reason to believe that economic factors are influencing the parameter’s correlation with default (as work on our next model suggests), but we think that splitting to train on old loans and test on new loans would have other confounding variables that could mask the economic factor’s influence (e.g. a platform’s constantly changing credit scoring model). Nevertheless, much work remains to be investigated in this area!

3. sean says:

I think it would be good if you validated the model the way it will be used: ie to predict new loan performance from old. so do something like using first x years data predict next year loans …

4. Robert says:

How do you model early full repayments when calculating expected returns?

1. Justin Hsi says:

At the individual note/loan level, early full repayment does not change our expected return (since it is the Internal Rate of Return of the predicted series of cashflows). For more details see our post on calculating returns. Partial prepayment does change our expected return insofar as relocating where we are on the hazard curve.

5. Nicolas says:

Interesting approach but your model includes “sub_grade” which is essentially LC model output. So that is why you get funny regression output. Also LC is updating its model so “sub_grade” changes which makes the model you propose unstable in some sense. Any thought on this?
Also what is the advantage of using Cox over a logistic model? I understand that if you want to be buy and sell on the 2ndary market Cox is beneficial but if it is to buy and hold the notes, logistic models could be easier to built.

1. Emmanuel Marot says:

We do update our model from time to time, in order of keeping it relevant with changes in economic conditions and LC’s updates to their model.
As a matter of fact, Cox was our original approach, but since then we split the survival and default models, so nowadays our model relies more on a multi-variate logistics regressions.