The Multi-Armed Platform Problem

Or why pizzas and loans create the same conundrum


Our life as investors (or investment advisor) would be much, much simpler if we had no choice in the kind of assets we buy (but probably no ‘raison d’être’). In reality, choices are plenty, if only because there are so many people eager to attract money. Some of these options are good, some others not so much. The question is, once we’re tried investing in an origination platform and it generates good return, should we keep pouring money in it, or should we still look for an even better opportunity?

Faced with multiple choices, one has to find a equilibrium between ‘exploit’ and ‘explore’. Let’s say you found a nice little Italian restaurant in your town that makes the best Napolitan-style pizza you’ve ever tasted. Should you keep going there (exploit) or keep trying other places as well (explore)? It’s likely that the next place you’ll try won’t be as good. But on the other hand, there may be an even better place somewhere, and you’ll never know if you keep eating the same pizzas, again and again.

Systematic Exploring

One simple, systematic method is to alternate exploitation and exploration. For instance, one time out of two, you go back to the usual pizza place. The other time, you try something new. If a new place becomes your favorite, you now go there half the time, and put back your oldest favorite in the pool of random place to try in the future.

This algorithm is an example of ‘epsilon-greedy’ strategy, ‘epsilon’ being when you try something new, and ‘greedy’ when you go back to the place providing the highest rewards so far. Which can be formalized as \( p_i(k+1)\) the probability of picking choice i amongst n after k trials that gave an average return for that choice of \(u_j(k) \) being:

p_i(k+1) = \left\{
1 – \epsilon + \frac{\epsilon}{n} & \quad \text{if }i = \text{argmax}_j u_j(k) \\
\frac{\epsilon}{n} & \quad \text{otherwise}

In our example of trying a new restaurant one time out of two, epsilon equals 0.5. This algorithm is both quite simple and shockingly efficient. One drawback, though, is that it is extremely sensitive to the value of epsilon. How often to explore versus exploit has a large impact on results and may depend upon the number of options and the duration of the experiment. At the extreme, imagine there are only two restaurants in your town, and the epsilon is high, you may end up going to your last preferred restaurant quite often! At the other end, if there are 24,000 restaurants, chances are pretty high that there’s a better restaurant somewhere, and that an epsilon too low means that you’re going to your ‘usual’ way too often1. It also makes sense that there should be less and less exploration as time progress. After all, if the restaurants provide a constant experience, there is no point in exploring anymore once you tried them all. But even with a less extreme scenario, one may consider the results to become more trustworthy over time. If the Italian place was the first restaurant you ever tried in town, the fact that you like it doesn’t mean much. Inversely, if after 20 years and hundreds of different places, it’s still you favorite, then maybe it’s not worth exploring other options anymore2. Second, because with less time left until the end of the experiment, it becomes more rationale to prioritize exploitation over exploration. Would choose a restaurant at random for the very last dinner out in a city you lived in for 10 years, or would rather go to your old time favorite one last time3? Reducing the frequency of ‘exploration’ versus ‘exploitation’ over time is called an Epsilon-decreasing strategy.

The Secretary Problem

A similar issue is tackled with the so-called ‘Secretary Problem’. Imagine you’re hiring a secretary. You’re interviewing multiple candidates. Considering that you can hire a candidate right after the interview or must pass on that one forever, when should you hire the best candidate your found so far? This requires to determine a stopping rule, such as ‘After passing on x-1 candidates, I’ll hire the best candidate so far’. For instance, with 3 candidates, if we decide to hire the first one, the probability of having selected the best is 1/3. Similarly, if we decide to wait for candidate 3 to make a hiring decision, we don’t have much choice left, so the probability to hire the best one is simply the probability that the best one was the third, or 1/3 again. But if we decided to pass on the first one, then hire #2 if that candidate is better than #1 or hire #3 if #2 is not better than #1, then our probability to have selected the best candidate is 3/6 (3 permutations out of 6). Another way to look at this is there is 3 cases out of 6 when we do NOT hire the best candidate: if it was the first one (2 permutations, #1 > #2 > #3 and #1 > #3 > #2) so we didn’t hire fast enough or if the candidate #2 is better than #1 but still not has good as #3 (1 permutation, #3 > #2 > #1) so we hire too early.
Interestingly, it shows that when interviewing at least 3 candidates, you should ALWAYS pass on the first one, however amazing is that candidate!
Generalizing this, we find that the probability of choosing the best candidate j amongst n by passing on the first k-1 is:

\[ P(n,k) = \frac{k-1}{k} \sum_{i=k}^n \frac{1}{i-1} \]

As n tends to infinity, we can compute the optimum cut-off. The optimal strategy is to reject out of hand the first 37% of the candidates, then select the first one who’s the best candidate so far. Elegantly, the probability to find the best candidate this way is also equal to 37%.

Incidentally, you may want to remember that method and percentage next time you’re looking for an apartment, a job, or even the love of your life… You shall systematically pass on the first 37% of your options, however great they are!

The Secretary Problem relies on two conditions that are not applicable to origination platforms, or even pizza restaurants. First, that you can immediately assess quality. Maybe the pizza you ordered wasn’t the best choice. The first loan you invested in a platform defaulting does not mean all loans should default. The second assumptions is that quality remains constant. But restaurants change, so do credit risk assessments. Another way to look at it is that the reward is somewhat random. We can observe some rewards, but still don’t know exactly what are the underlying distribution functions.

Which related to another well-know mathematical problem…

The Multi-Armed Bandit Problem

Imagine a row of several slot machines. When played, each machine provides a random reward, based from secret probabilities, specific to that machine. The multi-armed bandit problem aims to determine which machines a gambler should play, and how many times, in order to maximize his rewards.

This problem has been studied since World War II, because its application goes way beyond gambling, such as clinical trials or resource allocations. While it was originally thought to be intractable (it was even suggested to entice the Germans to work on it for wasting their time), an optimal strategy was published in 1979, bearing the name of ‘Gittins index’.

Interestingly, some non-optimal solutions turn out to be both easier to apply, and able to generate higher returns in real-life conditions. Such a solution is called ‘Boltzann Exploration’, or ’Softmax method’. To show off at your next diner party (or scare people away), you could mention that the Softmax function is a “gradient-log-normalizer of a categorical probability distribution”.

Given average returns u observed for each machine, the probability to play the machine i after k plays is:

\[ p_i(k+1) = \frac{e^\frac{u_i(k)}{T}}{\sum_{j=1}^n e^\frac{u_i(k)}{T}}, i = 1…n\]

Where T is a ‘temperature’ parameters, that controls the randomness of the choice. If T=0, the algorithm is purely greedy. As T goes to infinity, the algorithm chooses uniformly at random. The interesting aspect, compared to the epsilon-greedy strategy, is that exploration is also dependent upon returns observed so far amongst non-optimal choices. It makes you much less likely to give a second chance to a restaurant where the experience was abysmal.

Another option is to start by giving the same probability of play to all arms. Then, progressively increase the probability of playing the arm with the highest average gain. Such an algorithm is called a ‘Pursuit’ algorithm. The probability of playing the machine i after k plays is:

p_i(k+1) = \left\{
p_i(k) + \beta \big(1 – p_i(k) \big) & \quad \text{if }i = \text{argmax}_j u_j(k) \\
p_i(k) + \beta \big(0 – p_i(k) \big) & \quad \text{otherwise}

Where \( \beta \) is the learning rate, between 0 and 1.

Both the algorithms described above make sense, intuitively. One of their drawback is that they do not take into account the variance of each machine. As a player, you would probably react very differently when playing an arm that returns either \$0 or \$1, or an arm that gives you 50¢ everytime, although the average gain is the same. The lower the variance, the more confident we can also be that the results we observed are meaningful. An algorithm addressing that is the ‘UCB1-Tuned’. Initially, each arm is played once. At time k, the algorithm picks an arm j(k) such as:

j(k) = \text{argmax}_{i=1…k} \Bigg( u_i + \sqrt{\frac{ln(k)}{n_i} \text{min}\big( \frac{1}{4}, V_i(n_i)\big)}\Bigg)

where u is the average gain of an arm n the number of times it has been played so far. V is a measure of the variance, such that:

V_i(k) = \sigma_i^2(k) + \sqrt{\frac{2\text{ ln }(k)}{n_i(k)}}

It looks fancy, so it must perform better, right?

Well, it doesn’t. As demonstrated by Vermorel an Mohri as well as Kuleshov and Precup, a simple heuristics-based system such as the Softmax algorithm generates at least 50% less regret than UCB1-Tuned.

But this algorithm is still not exactly what we need. The multi-armed bandit problem is ‘all or nothing’. It doesn’t allow to hedge bet by splitting a dollar gambled between multiple machines, as we would do with a real portfolio allocation. Instead of choosing which place to dine tonight, imagine allocating a monthly restaurant budget across multiple places.

Multiple players

A simple approach is to consider multiple players, playing multiple bandits. The sum to ‘gamble’ is divided between players. Then, each of them decide which machine to play, following the Softmax algorithm described above. Imagine deciding in advance the five next restaurants you’ll go to. Because the algorithm is probabilistic, the choices will be nicely allocated following your exploitation / exploration choices. Every time a restaurant has been tried or a machine played, we update the corresponding expected return, which will in turn influence the choice of the next player.

Even more flexible, we can imagine that instead of distributing resources equally amongst multiple players, we divide them by the smallest amount we can commit. For instance, \$50 for a restaurant diner or \$25 on a Lending Club note. Then we distribute those amounts based on the probability computed by the Softmax algorithm.

In practice, for each unit of minimum commitment, we draw an independent and identically distributed random number \( 0 \leq \varepsilon \leq 1\) and allocate that amount to the first choice i amongst n to satisfy:

\[ \varepsilon \leq \frac{\sum_{k=1}^I e^\frac{u(k)}{T}}{\sum_{t=1}^n e^\frac{u_i(k)}{T}}, i = 1…n\]

Expected returns are constantly updated, but the amounts cannot be re-allocated once they’ve been committed. Imagine that for dining out, you would decide at the beginning of the month what will be your budget for each restaurant. You wouldn’t change it, even if you discovered a new gem a few days later. But if you get extra budget after that, you will allocate it with the updated expected returns.

Getting the Temperature

The one parameter that remains to be determined is T, the ‘temperature’ of our algorithm. Intuitively, the more certain we are of the results, the less we should explore.
One can imagine updating T such as:

\[ T_{t+1} = \alpha \cdot | f(t+1) – f(t) | + (1-\alpha) \cdot T_{t}\]

Where \( \alpha \) is a sensitivity constant and \( f() \) is a margin of error or a measure of riskiness. For instance, \( f(t) \) could be the latest difference in expected returns, the standard deviation between returns since inception, or the average of the variances calculated in the UCB1-Tuned algorithm. This will ensure that our allocation algorithm will self-adapt. For instance if a restaurant we tried before was to provide a surprisingly bad or good experience, or if the returns of an origination platforms were becoming increasingly disappointing, our algorithm would naturally start exploring other options more.


Ian Fleming once said ‘never say no to adventure, or you’ll live a very dull life’, but according to Euripides, ‘Chance fights ever on the side of the prudent’. Hopefully, a little math can help in finding the balance between the two.

  1. It also means you’re living in New York City ↩︎
  2. No wonder old people always go to the same restaurants ↩︎
  3. Again, an excellent reason for old people to always go to the same place ↩︎


Leave a Reply