an overview of distress
DISTRIBUTION-ESTIMATING SEARCH
The conceit of the algorithm
is returning a probability distribution rather than a value. Supposing that we’ve an evaluation function that does so, which we will consult at the leaves of the search tree, we are left with two procedures to determine:
- q1 a procedure that yields a distribution given the distributions of the successors of a position.
- q2 a procedure that selects a successor of a position to explore, or alternatively, a procedure that allocates a given amount of time among the successors of a position.
The probability distributions may be over centipawn or
wdl
scores,*
*By which I mean expected outcome.
but we will generally assume the latter so that the support is bounded, and the distributions may be either continuous or discrete — with domain of any cardinality, such as distributions over { loss, draw, win} or the interval from −1 to +1 subdivided into any number of parts.
* By wdl score I mean expected outcome.
This first procedure might be riven in twain:
- q1a modelling (i ) the belief that a successor is optimal — which is not a distribution† †For example, if two moves are believed to be optimal and are equally believed to be optimal, we would assign them both a value of 1 rather than ½. over the successors — or, alternatively, contriving (ii ) a distribution of the belief that a successor of a position would be played were the position the root, or (iii ) both, or (iv) something else entirely, which we will broadly refer to as a policy.
- q1b producing a distribution given not only the distributions of the successors of a position but also a policy.
Note that the policy may also be of use in answering
q2.
† For example, if two moves are believed to be optimal and are equally believed to be optimal, we would assign them both a value of 1 rather than ½.
In two thousand twenty one,
before having considered the problem as a whole and decomposing it as above, I considered
q1
only and bluntly began down the following path.
The maximum of two random variables
and
with probability density functions
and
and cumulative distribution functions
and
,
if
and
are independent, has the cumulative distribution function
I used distributions over centipawn score, and instead of working with the distributions directly, I represented them with three summary statistics: the peak (the median or mean), the right-hand dispersion (the difference between the peak and the ninety-fifth percentile if I recall correctly), and the left-hand dispersion
(. . . the
fifth percentile). My hope was to find an acceptable approximation of the effect of
on the summary statistics. I did not succeed.
Four years later,
I returned to the subject and began to contemplate it more completely, writing the following:
Let N be the number of successors, and then consider four quantities:
- s −1 to +1 estimated score
- v 0 to ? uncertainty of score estimate ( from “variance” )
- f 0 to 1 probability position is substantially better than expected ( from “fortuitous” )
- c 0 to 1 probability position is substantially worse than expected ( from “catastrophic” )
Informally, the quantities s and v are meant to describe the bulk of the distribution and the quantities f and c are meant to describe the tails (“the distribution” here being the distribution of belief about the true score of the position).
When switching perspectives, s negates, v remains the same, and f and c swap.
The variables s, v, f, and c can also be indexed, e.g.
- f (i) c of the i-th successor (the position that follows from making the i-th move)
- c(i) f of the i-th successor
Then also consider:
- b(i) current belief that the i-th move is the best move or is as good as the best move
- p(i) probability of picking the i-th move
The move with the highest p (which is also the move with the highest b) is the one we’d pick if this were the root. Here I’m treating b(i) as in the range 0 to 1 and initialized to ½ or 1 / N or initialized depending on s(i) and v(i) — but should it be log odds and initialized to 0, or something else?
When we write
“
”, this is shorthand for
“ from to
”.
The p(i) obey the property that
; in other words, it’s a distribution.
Perhaps the b(i) are also a distribution, and perhaps further b(i) and p(i) should be the same. As a theoretical matter, they are distinct; if all moves are precisely equally good, b(i) should be 1 for all i, but p(i) should be 1 / N. And p(i) should depend on v(i), f (i), and c(i) in addition to s(i), but b(i) may depend only on s(i).
In some sense, the c of a position is the f of the successor you will pick.
In another sense, the c of a position is the probability that all successors are catastrophic, since it only takes one of them being okay for the position to not be catastrophic.
The relevance of that observation depends on our belief that we’d be able to find the right move if this position became the root, so we blend the two with some factor β:
This can be done asymmetrically — we may pick a higher β for our opponent than ourselves if we think they are less likely to make a mistake.
Similarly,
and the probability that any successor is fortuitous is
We blend the two:
There is a defect: it oughtn’t make a difference whether a bad move has a high chance of being unexpectedly worse or not.
Because of this defect, and a dissatisfaction with the arbitrariness and lack of theoretical simplicity, I abandoned this line of thinking as well.
This brings us
to my ongoing endeavour. I have resolved to use discrete distributions over the interval
from −1
to +1
subdivided into
S
parts. For
q1a
we will use a distribution of the belief that the successor of a position would be played were the position the root. Then
q1b
becomes rather simple: let
for
be the policy for the
N
successors and
for
be the probability mass function returned by the search of a successor.
The variable x indexes the subintervals,
so x = 1 refers to the interval from −1 to −1 + 2/S and x = S refers to the interval from +1 − 2/S to +1; more generally, x refers to the interval from −1 + 2/S × (x−1) to −1 + 2/S × x.
Then the distribution
we will return is
Now we define
q1a. Consider again two random variables
and
with probability density functions
and
and cumulative distribution functions
and
. If
and
are independent, then
or more simply
We can extend this to three random variables:
And more generally,
This is our belief that the
i-th successor is optimal (at least as good as any other successor).
After normalization, these terms become our policy:
There may be numerical difficulties when
N
is large because of the product
, which might vary wildly in magnitude depending on the value of
x, but this completes
q1
in concept.
I suspect that
q2
admits no similarly simple answer. Much like how α/β search*
*Depth-first minimax search with iterative deepening, α/β pruning, and zero-window pruning.
is effectively made a best-first search through the heuristic application of reductions and extensions,†
†Whether by reducing/extending some nodes directly according to some criteria or by uniformly applying late move reduction and ordering successors according to some criteria.
the most successful procedure is likely to be found empirically, perhaps as a synthesis of various methods. Like in α/β search, online learning is apt to be indispensable.
* Depth-first minimax search with iterative deepening, α/β pruning, and zero-window pruning.
† Whether by reducing/extending some nodes directly according to some criteria or by uniformly applying late move reduction and ordering successors according to some criteria.
At nodes for which we happen to have saved information about the successors’ distributions, it may be reasonable to start by dividing time in proportion
to , or if visitation counts are also available, by dividing time in proportion to
for some constant
c, analogous to
ucb1, or perhaps
where
c
is scaled by roughly the square root of the variance and
is an adjusting heuristic, analogous to
puct.
Note the absence of a logarithm in this second formula.
Then search can proceed by iterative deepening, giving a node count to approximately divy out recursively and progressively searching with higher counts.
This cannot work as stated.
Repeatedly taking the mixture of distributions, as in our prescription of
for
q1b, can only increase the variance, but we would expect the variance to generally decrease as draught increases. On the other hand, repeatedly taking the maximum of two bounded random variables will reduce variance. This suggests that the idea of blending
ropt
and
rpick
was a decent one, as it addresses a fundamental problem.
By ropt we mean the distribution whose cumulative distribution function is and by rpick we mean as it was defined above.
It is tricky matter, avoiding degeneracy under repeated application into either a flat or a narrow distribution, diffusing or collapsing quickly. What process would push the variance toward a stable value, but could be overcome given provocation?
We might consider two possible principles at play:
- A sharp policy (that is, having only a few good options, with a wide gap between those options and the remaining successors) should be cause for wariness, because the risk of catastrophe is higher,* *We have been blithely neglecting the fact that the scores of successors are almost certainly not independent. If the most promising successor turns out to be unexpectedly bad, it is somewhat likely the other eminent successors will be worse than believed hitherto. and the variance should perhaps remain the same or increase mildly. In such cases we would interpolate almost entirely toward rpick.
- A wide policy (that is, having many similar options) should be cause for confidence in the evaluation,† †Separately, it is also justification to begin focusing on a subset of successors, as it is relatively safe to do so; a desire to differentiate the successors would provide an impetus to do so. because the consequence of a individual error is lessened, and so the variance should perhaps decrease. In such cases we would interpolate toward ropt.
To put it more succinctly, low dispersion in the policy drives interpolation toward
rpick
and high dispersion drives interpolation toward
ropt. This establishes some degree of negative feedback.
* We have been blithely neglecting the fact that the scores of successors are almost certainly not independent. If the most promising successor turns out to be unexpectedly bad, it is somewhat likely the other eminent successors will be worse than hitherto believed.
† Separately, it is also justification to begin focusing on a subset of successors, as it is relatively safe to do so; a desire to differentiate the successors would provide an impetus to do so.
At the present time,
it remains an open question how this ought to be carried out exactly, or what heuristics ought to modify it.
This essay was last updated on the fifteenth day of February in the two thousand twenty sixth year of the common era.