an overview of distress
DISTRIBUTION-ESTIMATING SEARCH



The conceit of the algorithm is returning a probability distribution rather than a value. Supposing that we’ve an evaluation function that does so, which we will consult at the leaves of the search tree, we are left with two procedures to determine:

q1 a procedure that yields a distribution given the distributions of the successors of a position.
q2 a procedure that selects a successor of a position to explore, or alternatively, a procedure that allocates a given amount of time among the successors of a position.

The probability distributions may be over centipawn or wdl scores,* *By which I mean expected outcome. but we will generally assume the latter so that the support is bounded, and the distributions may be either continuous or discrete — with domain of any cardinality, such as distributions over { loss, draw, win} or the interval from −1 to +1 subdivided into any number of parts.

* By wdl score I mean expected outcome.

This first procedure might be riven in twain:

q1^a modelling (i ) the belief that a successor is optimal — which is not a distribution^† †For example, if two moves are believed to be optimal and are equally believed to be optimal, we would assign them both a value of 1 rather than ½. over the successors — or, alternatively, contriving (ii ) a distribution of the belief that a successor of a position would be played were the position the root, or (iii ) both, or (iv) something else entirely, which we will broadly refer to as a policy.
q1^b producing a distribution given not only the distributions of the successors of a position but also a policy.

Note that the policy may also be of use in answering q2.

† For example, if two moves are believed to be optimal and are equally believed to be optimal, we would assign them both a value of 1 rather than ½.



In two thousand twenty one, before having considered the problem as a whole and decomposing it as above, I considered q1 only and bluntly began down the following path.

The maximum of two random variables $X_{0}$ and $X_{1}$ with probability density functions $p_{0}$ and $p_{1}$ and cumulative distribution functions $q_{0}$ and $q_{1}$ ,

q_{i} (x) = P (X_{i} \leq x) = \int_{-}^{x} p_{i} (t) d t

if $X_{0}$ and $X_{1}$ are independent, has the cumulative distribution function

P (max (X_{0}, X_{1}) \leq x) = q_{0} (x) \times q_{1} (x) .

I used distributions over centipawn score, and instead of working with the distributions directly, I represented them with three summary statistics: the peak (the median or mean), the right-hand dispersion (the difference between the peak and the ninety-fifth percentile if I recall correctly), and the left-hand dispersion (. . . the fifth percentile). My hope was to find an acceptable approximation of the effect of $max (–, –)$ on the summary statistics. I did not succeed.

Four years later, I returned to the subject and began to contemplate it more completely, writing the following:

Let N be the number of successors, and then consider four quantities:

s −1 to +1 estimated score
v 0 to ? uncertainty of score estimate ( from “variance” )
f 0 to 1 probability position is substantially better than expected ( from “fortuitous” )
c 0 to 1 probability position is substantially worse than expected ( from “catastrophic” )

Informally, the quantities s and v are meant to describe the bulk of the distribution and the quantities f and c are meant to describe the tails (“the distribution” here being the distribution of belief about the true score of the position).

When switching perspectives, s negates, v remains the same, and f and c swap.

The variables s, v, f, and c can also be indexed, e.g.

f (i) c of the i-th successor (the position that follows from making the i-th move)
c(i) f of the i-th successor

Then also consider:

b(i) current belief that the i-th move is the best move or is as good as the best move
p(i) probability of picking the i-th move

The move with the highest p (which is also the move with the highest b) is the one we’d pick if this were the root. Here I’m treating b(i) as in the range 0 to 1 and initialized to ½ or 1 / N or initialized depending on s(i) and v(i) — but should it be log odds and initialized to 0, or something else?

When we write “ $\sum · · ·$ ”, this is shorthand for “ $\sum · · ·$ from $i = 1$ to $i = N$ ”.

The p(i) obey the property that $\sum p (i) = 1$ ; in other words, it’s a distribution.

Perhaps the b(i) are also a distribution, and perhaps further b(i) and p(i) should be the same. As a theoretical matter, they are distinct; if all moves are precisely equally good, b(i) should be 1 for all i, but p(i) should be 1 / N. And p(i) should depend on v(i), f (i), and c(i) in addition to s(i), but b(i) may depend only on s(i).

In some sense, the c of a position is the f of the successor you will pick.

c_{pick} = \sum c (i) \times p (i)

In another sense, the c of a position is the probability that all successors are catastrophic, since it only takes one of them being okay for the position to not be catastrophic.

c_{opt} = \prod c (i)

The relevance of that observation depends on our belief that we’d be able to find the right move if this position became the root, so we blend the two with some factor β:

c = c_{opt} \times β + c_{pick} \times (1 - β) .

This can be done asymmetrically — we may pick a higher β for our opponent than ourselves if we think they are less likely to make a mistake.

Similarly, $f_{pick} = \sum f (i) \times p (i)$ and the probability that any successor is fortuitous is

f_{opt} = 1 - \prod (1 - f (i)) .

We blend the two: $f = f_{opt} \times β + f_{pick} \times (1 - β) .$

There is a defect: it oughtn’t make a difference whether a bad move has a high chance of being unexpectedly worse or not.

Because of this defect, and a dissatisfaction with the arbitrariness and lack of theoretical simplicity, I abandoned this line of thinking as well.



This brings us to my ongoing endeavour. I have resolved to use discrete distributions over the interval from −1 to +1 subdivided into S parts. For q1^a we will use a distribution of the belief that the successor of a position would be played were the position the root. Then q1^b becomes rather simple: let $π (i)$ for $i = 1, . . ., N$ be the policy for the N successors and $p_{i} (x)$ for $x = 1, . . ., S$ be the probability mass function returned by the search of a successor. The variable x indexes the subintervals,
so x = 1 refers to the interval from −1 to −1 + 2/S and x = S refers to the interval from +1 − 2/S to +1; more generally, x refers to the interval from −1 + 2/S × (x−1) to −1 + 2/S × x. Then the distribution $r (x)$ we will return is

r (x) = \sum_{i = 1}^{N} π (i) \times p_{i} (x) .

Now we define q1^a. Consider again two random variables $X_{0}$ and $X_{1}$ with probability density functions $p_{0}$ and $p_{1}$ and cumulative distribution functions $q_{0}$ and $q_{1}$ . If $X_{0}$ and $X_{1}$ are independent, then

P (X_{0} \leq X_{1}) = {\sum_{x}}_{1}^{S} ({\sum_{y}}_{1}^{x} p_{0} (y)) \times p_{1} (x)

or more simply

P (X_{0} \leq X_{1}) = {\sum_{x}}_{1}^{S} q_{0} (x) \times p_{1} (x) .

We can extend this to three random variables:

P (X_{0} \leq X_{2} \land X_{1} \leq X_{2}) = \sum_{x} q_{0} (x) \times q_{1} (x) \times p_{2} (x) .

And more generally,

P (\underset{k}{⋀} X_{k} \leq X_{i}) = \sum_{x} (\prod_{k \neq i} q_{k} (x)) \times p_{i} (x) .

This is our belief that the i-th successor is optimal (at least as good as any other successor).

After normalization, these terms become our policy:

s = \sum_{i = 1}^{N} P (\underset{k}{⋀} X_{k} \leq X_{i})

π (i) = P (\underset{k}{⋀} X_{k} \leq X_{i}) / s .

There may be numerical difficulties when N is large because of the product $\prod q_{k} (x)$ , which might vary wildly in magnitude depending on the value of x, but this completes q1 in concept.

I suspect that q2 admits no similarly simple answer. Much like how α/β search* *Depth-first minimax search with iterative deepening, α/β pruning, and zero-window pruning. is effectively made a best-first search through the heuristic application of reductions and extensions,^† †Whether by reducing/extending some nodes directly according to some criteria or by uniformly applying late move reduction and ordering successors according to some criteria. the most successful procedure is likely to be found empirically, perhaps as a synthesis of various methods. Like in α/β search, online learning is apt to be indispensable.

* Depth-first minimax search with iterative deepening, α/β pruning, and zero-window pruning.

† Whether by reducing/extending some nodes directly according to some criteria or by uniformly applying late move reduction and ordering successors according to some criteria.

At nodes for which we happen to have saved information about the successors’ distributions, it may be reasonable to start by dividing time in proportion to $π (i)$ , or if visitation counts are also available, by dividing time in proportion to

π (i) + c \sqrt{\frac{log \sum_{k} v (k)}{v (i)}}

for some constant c, analogous to ucb1, or perhaps

π (i) + c A (i) \frac{\sqrt{\sum_{k} v (k)}}{1 + v (i)}

where c is scaled by roughly the square root of the variance and $A (i)$ is an adjusting heuristic, analogous to puct. Note the absence of a logarithm in this second formula.

Then search can proceed by iterative deepening, giving a node count to approximately divy out recursively and progressively searching with higher counts.



This cannot work as stated. Repeatedly taking the mixture of distributions, as in our prescription of $r (x)$ for q1^b, can only increase the variance, but we would expect the variance to generally decrease as draught increases. On the other hand, repeatedly taking the maximum of two bounded random variables will reduce variance. This suggests that the idea of blending r_opt and r_pick was a decent one, as it addresses a fundamental problem.

By r_opt we mean the distribution whose cumulative distribution function is $\prod_{i} q (x)$ and by r_pick we mean $r (x)$ as it was defined above.

It is tricky matter, avoiding degeneracy under repeated application into either a flat or a narrow distribution, diffusing or collapsing quickly. What process would push the variance toward a stable value, but could be overcome given provocation?

We might consider two possible principles at play:

A sharp policy (that is, having only a few good options, with a wide gap between those options and the remaining successors) should be cause for wariness, because the risk of catastrophe is higher,* *We have been blithely neglecting the fact that the scores of successors are almost certainly not independent. If the most promising successor turns out to be unexpectedly bad, it is somewhat likely the other eminent successors will be worse than believed hitherto. and the variance should perhaps remain the same or increase mildly. In such cases we would interpolate almost entirely toward r_pick.
A wide policy (that is, having many similar options) should be cause for confidence in the evaluation,^† †Separately, it is also justification to begin focusing on a subset of successors, as it is relatively safe to do so; a desire to differentiate the successors would provide an impetus to do so. because the consequence of a individual error is lessened, and so the variance should perhaps decrease. In such cases we would interpolate toward r_opt.

To put it more succinctly, low dispersion in the policy drives interpolation toward r_pick and high dispersion drives interpolation toward r_opt. This establishes some degree of negative feedback.

* We have been blithely neglecting the fact that the scores of successors are almost certainly not independent. If the most promising successor turns out to be unexpectedly bad, it is somewhat likely the other eminent successors will be worse than hitherto believed.

† Separately, it is also justification to begin focusing on a subset of successors, as it is relatively safe to do so; a desire to differentiate the successors would provide an impetus to do so.

At the present time, it remains an open question how this ought to be carried out exactly, or what heuristics ought to modify it. This essay was last updated on the fifteenth day of February in the two thousand twenty sixth year of the common era.

an overview of distressDISTRIBUTION-ESTIMATING SEARCH

an overview of distress
DISTRIBUTION-ESTIMATING SEARCH