[OCR Extract - Question 1] pages=[OCRPageObject(index=0, markdown='# Unweighted Layered Graph Traversal: Passing a Crown via En

Topic: Functions and Graphs

pages=[OCRPageObject(index=0, markdown='# Unweighted Layered Graph Traversal: Passing a Crown via Entropy Maximization* \n\nXingjian Bai ${ }^{\\ddagger}$ Christian Coester ${ }^{\\ddagger}$ Romain Cosson ${ }^{\\S}$\n\n## Abstract\n\nIntroduced by Papadimitriou and Yannakakis in 1989, layered graph traversal is a central problem in online algorithms and mobile computing that has been studied for several decades, and which now is essentially resolved in its original formulation. In this paper, we demonstrate that what appears to be an innocuous modification of the problem actually leads to a drastic (exponential) reduction of the competitive ratio. Specifically, we present an algorithm that is $\\mathcal{O}\\left(\\log ^{2} w\\right)$-competitive for traversing unweighted layered graphs of width $w$. Our algorithm chooses the agent\'s position simply according to the probability distribution over the current layer that maximizes the sum of entropies of the induced distributions in the preceding layers.\n\n## 1 Introduction\n\nExploring an unknown environment with a mobile agent is a fundamental task in robotics and artificial intelligence [PY91]. Despite its importance, very few models allow for rigorous worst-case analysis, especially when the environment is a general graph (or metric space). One successful model introduced in the field of online algorithms is called \'layered graph traversal\', and it has been studied extensively since the 1990s [PY91, CL93, Ram95, Bur96, FFK $^{+} 98$, BCR22, BCR23, BCAGL23, CM24]. In this paper, we show that a simple and natural assumption on the problem - specifically, that the unknown graph is unweighted - drastically reduces its competitive ratio.\n\nProblem setting. In layered graph traversal, a mobile agent starts from some node - called the source - of an unknown weighted graph $G=(V, E)$, and is tasked to reach some other node - called the target. The graph is divided into "layers", where the $i$-th layer refers to the set of nodes at combinatorial depth $i$ (i.e. $i$ hops away from the source). The agent is short-sighted in the sense that it only gets to see layer $i+1$ once it is located at layer $i$. Luckily, the agent is broad-sighted, in the sense that it sees all nodes and edges going from layer $i$ to layer $i+1$ when it reaches layer $i$. The problem is parameterized by the width $w$ of the graph, defined as the maximum number of nodes in a layer. The cost of the agent is defined as the total distance traveled until reaching the target. A deterministic (resp. randomized) algorithm for layered graph traversal is said to be $c_{w}$-competitive if its cost (resp. expected cost) is at most $c_{w}$ times the length of the shortest path from source to target. The deterministic competitive ratio of layered graph traversal is known to lie between $\\Omega\\left(2^{w}\\right)\\left[\\mathrm{FFK}^{+} 98\\right]$ and $\\mathcal{O}\\left(w 2^{w}\\right)$ [Bur96] and its randomized counterpart was settled to $\\Theta\\left(w^{2}\\right)$ in [BCR22, BCR23]. It is known $\\left[\\mathrm{FFK}^{+} 98\\right]$ that layered graph traversal is equivalent to the special case where the graph is a tree and all edge lengths are either 0 or 1 .\n\nIn the unweighted variant (where all edge lengths are exactly 1), a depth-first search approach (DFS) is clearly $2 w-1$-competitive (see Appendix A). Despite the broad interest that layered graph traversal received, no randomized algorithm was shown to improve over this bound until [CM24] recently proposed a $\\mathcal{O}(\\sqrt{w})$-competitive randomized algorithm in the unweighted setting.\n\nOur results. In this paper, we show that the assumption that the graph is unweighted allows for an even more drastic improvement in the competitive ratio, reducing it to $\\mathcal{O}\\left(\\log ^{2} w\\right)$. Our algorithm is remarkably simple: It maintains at all times that the probability distribution of the agent\'s location maximizes an entropy defined on the revealed part of the graph. The majority of this paper is dedicated to proving this randomized upper bound, based on extending techniques from the online mirror descent framework $\\left[\\mathrm{BCL}^{+} 18\\right.$, BCLL21] with several\n\n[^0]\n[^0]: *The arXiv version of the paper can be accessed at https://arxiv.org/abs/2404.16176\n ${ }^{\\dagger}$ MIT. The work was carried out while at University of Oxford.\n $\\ddagger$ University of Oxford.\n $\\S$ Inria, Paris.', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=1, markdown='![img-0.jpeg](img-0.jpeg)\n\nFigure 1: Mario is a well-known example of an agent that is short-sighted but broad-sighted. The width $w$ corresponds to the screen height, with the target located somewhere off to the right. A natural question is: faced with the unknown maze ahead, which path should Mario choose?\nnew ideas. A few simple motivating results, such as tightness of $2 w-1$ for deterministic algorithms (achieved by depth-first search), a randomized lower bound of $\\Omega(\\log w)$, and some natural but unsuccessful randomized algorithms are surveyed in Appendix A.\n\nApplications: playing Mario, passing a crown and learning-augmented algorithms. An immediate application of unweighted layered graph traversal is illustrated in Figure 1, where the famous video game character Mario is short-sighted and broad-sighted by design.\n\nOn a more general ground, there are many situations where an agent is short-sighted in time (because it cannot predict the future) but broad-sighted in space (because it sees all past and current states of the world). For an amusing example, consider the following application: suppose a king or queen must design a rule for the passing of their crown and wishes to minimize how much it will "travel" between their descendants (supposedly, less movement leads to more political stability). They could decide that the leader of each new generation shall be chosen by an unweighted layered graph traversal algorithm. Note that the traditional method "give the crown to the closest living relative of the last holder" is equivalent to depth-first search, which has lesser competitive guarantees than the algorithm we introduce. Admittedly, the interest of today\'s monarchies in deviating from traditional methods might be limited.\n\nThe principle of passing a crown/baton applies to wider settings, such as combining strategies in the context of learning-augmented algorithms: $\\left[\\mathrm{ACE}^{+} 23 \\mathrm{a}\\right]$ recently proposed to use a layered graph traversal algorithm to choose between $w$ expert strategies, when there is a (metric) cost associated to switching between experts. In the general setting (when the experts move at arbitrary speeds) a weighted layered graph traversal algorithm induces a competitive ratio in $O\\left(w^{2}\\right)$, but in settings where all expert strategies move at the same speed, an unweighted layered graph traversal algorithm could be employed, resulting in a competitive ratio in $O\\left(\\log ^{2} w\\right)$ against the best combination of expert strategies.\n\n# 1.1 Discussion of techniques \n\nEntropy maximization. The downside of deterministic online algorithms is that they are completely predictable. This enables strong lower bounds, as an adversary can construct the input in the way that is most detrimental to the choices of the algorithm. Therefore, the key to achieving better competitive ratios via randomization lies in making algorithms less predictable. The motivating idea of our algorithm is to take this strategy to the extreme, by aiming for the most unpredictable configuration at all times. The unpredictability of a random decision is typically measured by its entropy, $\\sum_{u} x_{u} \\log \\frac{1}{x_{u}}$, where $x_{u}$ denotes the probability of decision $u$.\n\nTo implement the idea of maximizing unpredictability, a first tempting idea might be to choose the probability distribution over nodes $\\ell$ of the current layer that maximizes $\\sum_{\\ell} x_{\\ell} \\log \\frac{1}{x_{\\ell}}$, i.e., the uniform distribution. However, this ignores that the distribution over the current layer also induces distributions over all previous layers; for example, if all but one node of the current layer share the same ancestor in the preceding layer, then the uniform (high-entropy) distribution over the current layer would induce a very non-uniform (low-entropy) distribution', images=[OCRImageObject(id='img-0.jpeg', top_left_x=482, top_left_y=202, bottom_right_x=1200, bottom_right_y=600, image_base64=None)], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=2, markdown='over their ancestors in the preceding layer. We wish to maximize not only the entropy of the current layer, but also of their ancestors in each previous layer. Hence, our algorithm chooses the probability distribution over the current layer that maximizes the sum of entropies over all layers.\n\nRelation to prior mirror descent based online algorithms. Entropy-driven online algorithms have led to major advances recently [BCN14, BCL+18, BGMN19, BCLL21, CL22, BC22, BCR22]. Unlike our algorithm, these prior algorithms are not expressed as straightforwardly as merely maximizing entropy, but are formulated in the framework of online mirror descent. The entropic functionals (called regularizers) in these prior works involve a perturbation term $\\delta>0$, e.g. $\\Phi(x)=\\sum\\left(x_{u}+\\delta\\right) \\log \\left(x_{u}+\\delta\\right)$ in its simplest forms, and an algorithm (in the continuous-time framework of $\\left[\\mathrm{BCL}^{+} 18, \\mathrm{BCLL} 21\\right]$ ) is then described by a differential inclusion\n\n$$\n\\nabla^{2} \\Phi(\\boldsymbol{x}) \\dot{\\boldsymbol{x}}-\\gamma \\in-N_{K}\n$$\n\nwhere $\\dot{\\boldsymbol{x}}$ is the time derivative of configuration $\\boldsymbol{x}, \\gamma$ is the "desired direction of movement", and $N_{K}$ is the normal cone of the polytope $K$ of configurations. The interpretation of $\\Phi$ in these cases is that it replaces the Euclidean geometry (which would correspond to the case where $\\nabla^{2} \\Phi(x)$ is replaced by the identity matrix, as in gradient descent) by a geometry adapted to the uncertainty of the underlying problem. Although our algorithm also satisfies such an inclusion, it can be formulated even more simply as\n\n$$\nx(i)=\\underset{x \\in K}{\\arg \\min } \\Phi_{i}(x)\n$$\n\nwhere $\\Phi_{i}(x)$ denotes the sum of negative entropies of all layers visible at time $i$. Unlike previous mirror descent based online algorithms, this formulation in fact yields a memoryless algorithm since the selected $\\Phi_{i}$ only depends on the tree distances between the nodes of the current layer. This also allows us to obtain an alternative expression of our algorithm through elementary operations (see Lemma 2.1).\n\nAchieving $\\mathcal{O}\\left(\\log ^{2} \\boldsymbol{w}\\right)$. The analysis of our algorithm is split into deactivation phases (where nodes that become irrelevant are deactivated) and growth phases (where new nodes are added), which at a high level resemble similar phases previously considered in the "evolving tree game" of [BCR22]. However, our methods of analysing these phases have little in common with [BCR22] and require several novel ideas.\n\nOur regularizer can be equivalently written as\n\n$$\n\\Phi_{i}(x)=\\sum_{u} x_{u} \\log x_{u}=\\sum_{u} h_{p(u)} x_{u} \\log \\frac{x_{u}}{x_{p(u)}}\n$$\n\nwhere the sum is over vertices $u$ of the Steiner tree of the current layer, $x_{u}$ is the algorithm\'s probability of residing in the subtree rooted at $u, p(u)$ is the parent of $u$, and $h_{u}$ denotes the round number $i$ minus the combinatorial depth of $u$ in the tree, i.e. the height of $u$ at round $i$. Note that this quantity is well-defined only because the tree is unweighted, and all nodes at the last layer are at the same combinatorial depth. See Appendix B for the identity between the two sums.\n\nThe right-hand side of equation (1.1) reminds us of the "conditional entropy" regularizer employed in [CL22] for metric task systems on hierarchically separated trees. This expression allows us to derive an expression for the change of the conditional probabilities $\\frac{x_{u}}{x_{p(u)}}$ in each step (which does not involve Lagrange multipliers). However, the analysis in [CL22] requires that edge lengths decrease geometrically along root-to-leaf paths, and more crucially it only achieves a competitive ratio linear in the depth of the tree. In our case, the depth of the tree (even if merging edges as in [BCR22]) can be $\\Omega(w)$, so an analysis similar to [CL22] would fail to achieve a sublinear competitive ratio. The high depth of the tree is also the obstacle leading to the $\\Theta\\left(w^{2}\\right)$ competitive ratio for weighted layered graph traversal in [BCR22]. To achieve the exponential improvement to $\\mathcal{O}\\left(\\log ^{2} w\\right)$, a key idea of our proof is to separate the analysis of movement cost when handling deactivation phases, where the probability mass $x_{\\ell}$ of a leaf $\\ell$ decreases to 0 , into two substeps that are analyzed differently depending on whether $x_{\\ell}$ is smaller or greater than $1 / \\operatorname{poly}(w)$. These two substeps require two different potential functions, which we combine into an overall potential function that is compatible with both substeps of deactivation phases and also with growth phases.\n\nAnother interesting aspect - which echoes a technique in [CM24] - is that the value of the regularizer appears in the potential function, instead of the Bregman divergence with an offline configuration. The Bregman divergence would be problematic for us since, due to the lack of $\\delta$-perturbations in $\\Phi_{i}$, its Lipschitz constant can be $\\Omega(w)$, even when evaluated only at the relevant online algorithm configurations.', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=3, markdown="1.2 Related work The setting of layered graph traversal was introduced in a paper titled 'Shortest path without a map' by Papadimitriou and Yannakakis, motivated by applications in navigation robotics [PY91]. For the case of $w=2$, they showed that the deterministic variant of the problem is essentially equivalent to linear search, a problem which dates back to [Bel63] for which the competitive ratio is equal to 9 [BYCR93]. The first competitive analysis for arbitrary values of $w \\in \\mathbb{N}$ was obtained by $\\left[\\mathrm{FFK}^{+} 98\\right]$, which coined the term 'layered graph traversal' to emphasize how the agent collects information (i.e., the feedback). Following a series of improvements [FFK+98, Ram95, Bur96], the competitive ratio of the deterministic variant of the problem was proved to lie between $\\mathcal{O}\\left(w 2^{w}\\right)$ [Bur96] and $\\Omega\\left(2^{w}\\right)$ [FFK +98 ]. More recently, the competitive ratio of the randomized variant of the problem was resolved up to a constant factor, establishing it at $\\Theta\\left(w^{2}\\right)$ [BCR22, BCR23]. For the unweighted setting, [CM24] recently provided a $\\mathcal{O}(\\sqrt{w})$-competitive algorithm, leaving a substantial gap compared to the classical $\\Omega(\\log w)$ lower-bound. This paper significantly narrows this gap, presenting a $\\mathcal{O}\\left(\\log ^{2} w\\right)$-competitive algorithm.\n\nThe problem was also studied under the name 'metrical service systems' [CL93] to highlight its connection to the setting of metrical task systems [BLS92]. The denomination 'small set chasing' was also recently suggested by [BCR22] to point out its connection with more recent works on set chasing (see, e.g. [BLLS19, AGTG21, Sel20]). The problem is additionally related to several other problems in online computation, such as the $k$-taxi problem [CK19], whose lower bound is based on a reduction from layered graph traversal. Very recently, layered graph traversal has also found new applications in the context of learning-augmented algorithms [ACE+23b, $\\left.\\mathrm{ACE}^{+} 23 \\mathrm{a}\\right]$.\n\nExploration of an unknown environment by a mobile agent. In layered graph traversal, the agent's objective can be restated as follows: the agent must build a complete map of the underlying graph using the fewest moves possible. Graph exploration with a mobile agent has attracted attention since the 19th century [Luc82, Tar95], marked by the formal introduction of the depth-first search (DFS) as a maze-solving algorithm. DFS is particularly suited for exploring general unknown graphs when the agent is short-sighted and narrow-sighted: the agent traverses all $m$ edges in exactly $2 m$ moves, which is optimal in the sense of the competitive ratio. A natural extension of this problem is when the agent can also see its immediate neighborhood (a little less short-sighted) [KP94]. This setting is known as 'online graph exploration' [MMS12], and the goal is for the agent to visit all nodes while incurring a limited cost with respect to the optimal tour (i.e., the solution of the traveling salesman problem). The competitive ratio for that problem is still unsettled (it lies between $\\Omega(1)$ and $\\mathcal{O}(\\log n)$, where $n$ is the number of nodes in the graph) but many restricted classes of graphs admit $\\mathcal{O}(1)$-competitive algorithms, such as planar graphs or some minor-excluded graphs (see [BDHS23] and its references). In contrast, layered graph traversal proposes a different model of feedback where the agent gets to see simultaneously all nodes and edges at some given hop distance from its initial position. This setting is particularly well-motivated when the graph evolves with time (e.g., a phylogenetic tree). Many other graph exploration settings have been introduced, varying the capabilities of the agent: e.g., with limited memory [AKL+79], limited storage [YWB03], or considering multiple agents working a as a team [FGKP06, CMV23, CM24]. For a recent review of graph exploration problems, see the introduction of [Cos24]. Contrarily to the aforementioned examples, this paper studies the case of a single, fully-competent agent.\n1.3 Notations and preliminaries Consider a graph $G=(V, E)$ defined by a set of nodes $V \\subset \\mathcal{V}$ and edges $E$ over $V$, where $\\mathcal{V}$ is an arbitrary countable set. The edges are undirected and unweighted, and the graph is connected. One node is called the root and is denoted by $r \\in V$. For some $i \\in \\mathbb{N}$, we denote by $\\mathcal{L}(i)$ the set of nodes at combinatorial depth $i$, i.e., $i$ hops away from the root. We will refer to $\\mathcal{L}(i)$ as the $i$-th layer of the graph. The width $w$ of the graph is defined as the maximum cardinality of its layers, i.e., $w=\\max _{i}|\\mathcal{L}(i)|$. As standard in layered graph traversal, the goal to reach a certain target can be equivalently simplified to reaching the last layer of the graph.\n\nReduction to trees. For any node $u \\in V$, we denote by $i_{u}$ the combinatorial depth of $u$, i.e. satisfying $u \\in \\mathcal{L}\\left(i_{u}\\right)$. At any step $i \\geq i_{u}$, we call $h_{u}(i)=i-i_{u}$ the height of $u$. For any node $u \\in V \\backslash\\{r\\}$, we can arbitrarily pick a node of the preceding layer $p(u) \\in \\mathcal{L}\\left(i_{u}-1\\right)$, such that $(p(u), u) \\in E .{ }^{1}$ It is clear that traversing the tree\n\n[^0]\n[^0]: ${ }^{1}$ Observe that the choice of $p(u)$ can be made online by the agent. The reduction from graphs to tres is standard in the literature on layered graph traversal. Here, we use the assumption that any non-root node is connected to a node in the preceding layer because a layer refers to the set of nodes at the same combinatorial depth. We note that this assumption is not required in the weighted variant of the problem; see $\\left[\\mathrm{FFK}^{+} 98\\right]$. For more motivation on the setting of this paper, we refer to [CM24].", images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=4, markdown="formed by $T=\\{(p(u), u) \\mid u \\in V \\backslash\\{r\\}\\} \\subseteq E$ in an online manner is no less difficult than traversing the original graph. For this reason, we will focus the rest of the paper on the specific case where the layered graph is a tree. For any node $u \\in V$, we denote by $c(u) \\subset V$ the set of its children of $u$, and by $\\mathcal{L}_{u}(i)$ the subset of nodes of $\\mathcal{L}(i)$ that are descendants of $u$, for some $i \\in \\mathbb{N}$. We also denote by $u \\rightarrow r \\subset V$ the set of all nodes in the path from $u$ to $r$, excluding $r$.\n\nLayered graph traversal algorithm. Formally, a randomized unweighted layered graph traversal algorithm is defined by a function $\\mathcal{A}: \\mathcal{T} \\times \\mathcal{V} \\rightarrow \\mathcal{P}(\\mathcal{V})$ mapping a tree $T \\in \\mathcal{T}$ and a node $u \\in \\mathcal{V}$ in the penultimate layer of $T$ to a distribution $\\boldsymbol{x} \\in \\mathcal{P}(\\mathcal{V})$ over the set of nodes in the ultimate layer of $T$. The algorithm is deterministic if the distribution is always supported on a unique node. The cost of the algorithm is defined as the total number of edges traversed by an agent that picks its next position using $\\mathcal{A}$ at each layer and uses the shortest paths to move between positions elected in consecutive layers. The total cost of an agent until reaching layer $i$ depends on the random choices taken by the agent (if the algorithm is randomized) and we therefore focus on its expectation which we denote by $\\operatorname{Cost}(i)$. The algorithm is $c_{w}$-competitive if, for any graph of width $w$ and any step $i$, it satisfies $\\operatorname{Cost}(i) \\leq c_{w} i$.\n\nFractional perspective on unweighted layered graph traversal. A classical equivalent perspective on layered graph traversal - which is referred to as the fractional view - is that the algorithm $\\mathcal{A}: \\mathcal{T} \\rightarrow \\mathcal{P}(\\mathcal{V})$ maintains a distribution on the ultimate layer of the tree given as input. The movement cost charged to the algorithm is then the optimal transport cost between the two consecutive distributions. Given a fractional algorithm for layered graph traversal, one gets a randomized algorithm with the same cost by considering at each step an optimal coupling between the two consecutive distributions and sampling the next position of the agent according to the conditional distribution associated with its current position. We refer to [BCR22, Section 4] for more details on the reduction.\n\nThe active tree and the active polytope. At any round $i$, we say that a node is 'active' if it has a descendant in $\\mathcal{L}(i)$ and we denote the set of all active nodes by $V(i)$. A node is 'deactivated' at round $i$ if it ceases to be active at that round and the set of all deactivated nodes until round $i$ is denoted by $\\bar{V}(i)$. We will essentially focus on the 'active tree' that is formed by the nodes in $V(i)$. Observe that this tree is alternatively defined as the Steiner tree of $\\mathcal{L}(i) \\cup\\{r\\}$. We will denote by $\\boldsymbol{x}(i) \\in \\mathcal{P}(\\mathcal{L}(i))$ the probability distribution associated with the position of the agent on the $i$-th layer. Instead of viewing $\\boldsymbol{x}$ as a probability distribution, it will be useful to consider it more generally as a point of the active polytope, $K(i)$ that we define as,\n\n$$\nK(i)=\\left\\{\\boldsymbol{x} \\in \\mathbb{R}^{V(i) \\backslash\\{r\\}} \\mid \\forall u \\in V(i) \\backslash \\mathcal{L}(i): x_{u}=\\sum_{v \\in c(u)} x_{v}\\right\\}\n$$\n\nWe emphasize that $x_{r}$ is defined once and for all to be 1 . For that reason, it does not appear as a variable in the definition of the active polytope above (but $x_{r}$ does appear in one constraint, which effectively enforces $\\left.\\sum_{\\ell \\in \\mathcal{L}(i)} x_{\\ell}(i)=1\\right)$. We note that a probability distribution must also satisfy non-negativity constraints, which are not explicit in $K(i)$. It will be obvious that the configuration $\\boldsymbol{x}(i) \\in K(i)$ defined by our algorithm satisfies the positivity constraints (with strict inequality) and can indeed be interpreted as $\\boldsymbol{x}(i) \\in \\mathcal{P}(\\mathcal{L}(i))$. We note that $K(i)$ is an affine subspace of $\\mathbb{R}^{V(i) \\backslash\\{r\\}}$, Denoting by $\\mathbf{e}_{u} \\in \\mathbb{R}^{V(i) \\backslash\\{r\\}}$ the canonical basis vector associated with the element $u \\in V(i) \\backslash\\{r\\}$, we define the normal cone $N_{K(i)}$ associated with $K(i)$,\n\n$$\n\\begin{aligned}\nN_{K(i)} & =\\operatorname{Vect}\\left(\\left\\{\\boldsymbol{e}_{u}-\\sum_{v \\in c(u)} \\boldsymbol{e}_{v}: u \\in V(i) \\backslash \\mathcal{L}(i)\\right\\}\\right) \\\\\n& =\\left\\{\\boldsymbol{\\xi} \\in \\mathbb{R}^{V(i) \\backslash\\{r\\}}: \\xi_{u}=\\lambda_{p(u)}-\\lambda_{u} \\text { with } \\boldsymbol{\\lambda} \\in \\mathbb{R}^{V(i)} \\text { and } \\lambda_{\\ell}=0 \\text { for } \\ell \\in \\mathcal{L}(i)\\right\\}\n\\end{aligned}\n$$\n\nwhich is the vector space orthogonal to the the affine subspace $K(i)$.\nOptimal transport in a tree. For two given configurations, $\\boldsymbol{x}(i) \\in K(i)$ and $\\boldsymbol{x}(i+1) \\in K(i+1)$, we denote the optimal transport cost (also known as the earth-mover distance) between $\\boldsymbol{x}(i)$ and $\\boldsymbol{x}(i+1)$ by $\\operatorname{OT}(\\boldsymbol{x}(i), \\boldsymbol{x}(i+1))$. Extending the definition of a configuration to take value zero outside of the active polytope, the optimal transport cost on a tree is classically expressed as\n\n$$\n\\operatorname{OT}\\left(\\boldsymbol{x}, \\boldsymbol{x}^{\\prime}\\right)=\\sum_{u \\in V}\\left|x_{u}^{\\prime}-x_{u}\\right|\n$$", images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=5, markdown="We note that this definition matches the usual definition of optimal transport on the metric space associated with the underlying tree.\n\nDerivatives and instantaneous movement costs. The derivative of a differentiable real function $f(\\cdot)$ with respect to a single variable $t \\in \\mathbb{R}$ will be denoted by $\\partial_{t} f(t)$. The gradient of a twice-differentiable multivariate function $\\Phi(\\boldsymbol{x})$ will be denoted by $\\nabla \\Phi(\\boldsymbol{x})$, and its Hessian will be denoted by $\\nabla^{2} \\Phi(\\boldsymbol{x})$. For convenience, when the function that is differentiated is a configuration $\\boldsymbol{x}(t) \\in K(i)$, we will adopt the physicist's notation and denote its derivative $\\partial_{t} \\boldsymbol{x}(t)$ by $\\dot{\\boldsymbol{x}}$. Thus, at an arbitrary node $u \\in V(i)$, we will also denote $\\partial_{t} x_{u}(t)$ by $\\dot{x}_{u}$. In general, we will often drop dependence on time when it is clear from the context. We observe that if $\\forall t: \\boldsymbol{x}(t) \\in K(i)$, then $\\dot{\\boldsymbol{x}}(t)$ is orthogonal to the normal cone $N_{K(i)}$, i.e., $\\dot{\\boldsymbol{x}}(t) \\cdot \\boldsymbol{\\xi}=0$ for $\\boldsymbol{\\xi} \\in N_{K(i)}$. We define the instantaneous movement cost at instant $t$ as\n\n$$\n\\partial_{t} \\operatorname{Cost}(t)=\\sum_{u \\in V(i)}\\left|\\dot{x}_{u}\\right|\n$$\n\nBy the triangle inequality, it is clear that for any two configurations $\\boldsymbol{x} \\in K(i)$ and $\\boldsymbol{x}^{\\prime} \\in K(i)$, if $\\boldsymbol{x}(t)$ is a differentiable function with respect to $t$, interpolating between $\\boldsymbol{x}(a)=\\boldsymbol{x}$ and $\\boldsymbol{x}(b)=\\boldsymbol{x}^{\\prime}$ for $a [1.0m]

Part 1: pages=[OCRPageObject(index=0, markdown='# Unweighted Layered Graph Traversal: Passing a Crown via Entropy Maximization* \n\nXingjian Bai ${ }^{\\ddagger}$ Christian Coester ${ }^{\\ddagger}$ Romain Cosson ${ }^{\\S}$\n\n## Abstract\n\nIntroduced by Papadimitriou and Yannakakis in 1989, layered graph traversal is a central problem in online algorithms and mobile computing that has been studied for several decades, and which now is essentially resolved in its original formulation. In this paper, we demonstrate that what appears to be an innocuous modification of the problem actually leads to a drastic (exponential) reduction of the competitive ratio. Specifically, we present an algorithm that is $\\mathcal{O}\\left(\\log ^{2} w\\right)$-competitive for traversing unweighted layered graphs of width $w$. Our algorithm chooses the agent\'s position simply according to the probability distribution over the current layer that maximizes the sum of entropies of the induced distributions in the preceding layers.\n\n## 1 Introduction\n\nExploring an unknown environment with a mobile agent is a fundamental task in robotics and artificial intelligence [PY91]. Despite its importance, very few models allow for rigorous worst-case analysis, especially when the environment is a general graph (or metric space). One successful model introduced in the field of online algorithms is called \'layered graph traversal\', and it has been studied extensively since the 1990s [PY91, CL93, Ram95, Bur96, FFK $^{+} 98$, BCR22, BCR23, BCAGL23, CM24]. In this paper, we show that a simple and natural assumption on the problem - specifically, that the unknown graph is unweighted - drastically reduces its competitive ratio.\n\nProblem setting. In layered graph traversal, a mobile agent starts from some node - called the source - of an unknown weighted graph $G=(V, E)$, and is tasked to reach some other node - called the target. The graph is divided into "layers", where the $i$-th layer refers to the set of nodes at combinatorial depth $i$ (i.e. $i$ hops away from the source). The agent is short-sighted in the sense that it only gets to see layer $i+1$ once it is located at layer $i$. Luckily, the agent is broad-sighted, in the sense that it sees all nodes and edges going from layer $i$ to layer $i+1$ when it reaches layer $i$. The problem is parameterized by the width $w$ of the graph, defined as the maximum number of nodes in a layer. The cost of the agent is defined as the total distance traveled until reaching the target. A deterministic (resp. randomized) algorithm for layered graph traversal is said to be $c_{w}$-competitive if its cost (resp. expected cost) is at most $c_{w}$ times the length of the shortest path from source to target. The deterministic competitive ratio of layered graph traversal is known to lie between $\\Omega\\left(2^{w}\\right)\\left[\\mathrm{FFK}^{+} 98\\right]$ and $\\mathcal{O}\\left(w 2^{w}\\right)$ [Bur96] and its randomized counterpart was settled to $\\Theta\\left(w^{2}\\right)$ in [BCR22, BCR23]. It is known $\\left[\\mathrm{FFK}^{+} 98\\right]$ that layered graph traversal is equivalent to the special case where the graph is a tree and all edge lengths are either 0 or 1 .\n\nIn the unweighted variant (where all edge lengths are exactly 1), a depth-first search approach (DFS) is clearly $2 w-1$-competitive (see Appendix A). Despite the broad interest that layered graph traversal received, no randomized algorithm was shown to improve over this bound until [CM24] recently proposed a $\\mathcal{O}(\\sqrt{w})$-competitive randomized algorithm in the unweighted setting.\n\nOur results. In this paper, we show that the assumption that the graph is unweighted allows for an even more drastic improvement in the competitive ratio, reducing it to $\\mathcal{O}\\left(\\log ^{2} w\\right)$. Our algorithm is remarkably simple: It maintains at all times that the probability distribution of the agent\'s location maximizes an entropy defined on the revealed part of the graph. The majority of this paper is dedicated to proving this randomized upper bound, based on extending techniques from the online mirror descent framework $\\left[\\mathrm{BCL}^{+} 18\\right.$, BCLL21] with several\n\n[^0]\n[^0]: *The arXiv version of the paper can be accessed at https://arxiv.org/abs/2404.16176\n ${ }^{\\dagger}$ MIT. The work was carried out while at University of Oxford.\n $\\ddagger$ University of Oxford.\n $\\S$ Inria, Paris.', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=1, markdown='![img-0.jpeg](img-0.jpeg)\n\nFigure 1: Mario is a well-known example of an agent that is short-sighted but broad-sighted. The width $w$ corresponds to the screen height, with the target located somewhere off to the right. A natural question is: faced with the unknown maze ahead, which path should Mario choose?\nnew ideas. A few simple motivating results, such as tightness of $2 w-1$ for deterministic algorithms (achieved by depth-first search), a randomized lower bound of $\\Omega(\\log w)$, and some natural but unsuccessful randomized algorithms are surveyed in Appendix A.\n\nApplications: playing Mario, passing a crown and learning-augmented algorithms. An immediate application of unweighted layered graph traversal is illustrated in Figure 1, where the famous video game character Mario is short-sighted and broad-sighted by design.\n\nOn a more general ground, there are many situations where an agent is short-sighted in time (because it cannot predict the future) but broad-sighted in space (because it sees all past and current states of the world). For an amusing example, consider the following application: suppose a king or queen must design a rule for the passing of their crown and wishes to minimize how much it will "travel" between their descendants (supposedly, less movement leads to more political stability). They could decide that the leader of each new generation shall be chosen by an unweighted layered graph traversal algorithm. Note that the traditional method "give the crown to the closest living relative of the last holder" is equivalent to depth-first search, which has lesser competitive guarantees than the algorithm we introduce. Admittedly, the interest of today\'s monarchies in deviating from traditional methods might be limited.\n\nThe principle of passing a crown/baton applies to wider settings, such as combining strategies in the context of learning-augmented algorithms: $\\left[\\mathrm{ACE}^{+} 23 \\mathrm{a}\\right]$ recently proposed to use a layered graph traversal algorithm to choose between $w$ expert strategies, when there is a (metric) cost associated to switching between experts. In the general setting (when the experts move at arbitrary speeds) a weighted layered graph traversal algorithm induces a competitive ratio in $O\\left(w^{2}\\right)$, but in settings where all expert strategies move at the same speed, an unweighted layered graph traversal algorithm could be employed, resulting in a competitive ratio in $O\\left(\\log ^{2} w\\right)$ against the best combination of expert strategies.\n\n# 1.1 Discussion of techniques \n\nEntropy maximization. The downside of deterministic online algorithms is that they are completely predictable. This enables strong lower bounds, as an adversary can construct the input in the way that is most detrimental to the choices of the algorithm. Therefore, the key to achieving better competitive ratios via randomization lies in making algorithms less predictable. The motivating idea of our algorithm is to take this strategy to the extreme, by aiming for the most unpredictable configuration at all times. The unpredictability of a random decision is typically measured by its entropy, $\\sum_{u} x_{u} \\log \\frac{1}{x_{u}}$, where $x_{u}$ denotes the probability of decision $u$.\n\nTo implement the idea of maximizing unpredictability, a first tempting idea might be to choose the probability distribution over nodes $\\ell$ of the current layer that maximizes $\\sum_{\\ell} x_{\\ell} \\log \\frac{1}{x_{\\ell}}$, i.e., the uniform distribution. However, this ignores that the distribution over the current layer also induces distributions over all previous layers; for example, if all but one node of the current layer share the same ancestor in the preceding layer, then the uniform (high-entropy) distribution over the current layer would induce a very non-uniform (low-entropy) distribution', images=[OCRImageObject(id='img-0.jpeg', top_left_x=482, top_left_y=202, bottom_right_x=1200, bottom_right_y=600, image_base64=None)], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=2, markdown='over their ancestors in the preceding layer. We wish to maximize not only the entropy of the current layer, but also of their ancestors in each previous layer. Hence, our algorithm chooses the probability distribution over the current layer that maximizes the sum of entropies over all layers.\n\nRelation to prior mirror descent based online algorithms. Entropy-driven online algorithms have led to major advances recently [BCN14, BCL+18, BGMN19, BCLL21, CL22, BC22, BCR22]. Unlike our algorithm, these prior algorithms are not expressed as straightforwardly as merely maximizing entropy, but are formulated in the framework of online mirror descent. The entropic functionals (called regularizers) in these prior works involve a perturbation term $\\delta>0$, e.g. $\\Phi(x)=\\sum\\left(x_{u}+\\delta\\right) \\log \\left(x_{u}+\\delta\\right)$ in its simplest forms, and an algorithm (in the continuous-time framework of $\\left[\\mathrm{BCL}^{+} 18, \\mathrm{BCLL} 21\\right]$ ) is then described by a differential inclusion\n\n$$\n\\nabla^{2} \\Phi(\\boldsymbol{x}) \\dot{\\boldsymbol{x}}-\\gamma \\in-N_{K}\n$$\n\nwhere $\\dot{\\boldsymbol{x}}$ is the time derivative of configuration $\\boldsymbol{x}, \\gamma$ is the "desired direction of movement", and $N_{K}$ is the normal cone of the polytope $K$ of configurations. The interpretation of $\\Phi$ in these cases is that it replaces the Euclidean geometry (which would correspond to the case where $\\nabla^{2} \\Phi(x)$ is replaced by the identity matrix, as in gradient descent) by a geometry adapted to the uncertainty of the underlying problem. Although our algorithm also satisfies such an inclusion, it can be formulated even more simply as\n\n$$\nx(i)=\\underset{x \\in K}{\\arg \\min } \\Phi_{i}(x)\n$$\n\nwhere $\\Phi_{i}(x)$ denotes the sum of negative entropies of all layers visible at time $i$. Unlike previous mirror descent based online algorithms, this formulation in fact yields a memoryless algorithm since the selected $\\Phi_{i}$ only depends on the tree distances between the nodes of the current layer. This also allows us to obtain an alternative expression of our algorithm through elementary operations (see Lemma 2.1).\n\nAchieving $\\mathcal{O}\\left(\\log ^{2} \\boldsymbol{w}\\right)$. The analysis of our algorithm is split into deactivation phases (where nodes that become irrelevant are deactivated) and growth phases (where new nodes are added), which at a high level resemble similar phases previously considered in the "evolving tree game" of [BCR22]. However, our methods of analysing these phases have little in common with [BCR22] and require several novel ideas.\n\nOur regularizer can be equivalently written as\n\n$$\n\\Phi_{i}(x)=\\sum_{u} x_{u} \\log x_{u}=\\sum_{u} h_{p(u)} x_{u} \\log \\frac{x_{u}}{x_{p(u)}}\n$$\n\nwhere the sum is over vertices $u$ of the Steiner tree of the current layer, $x_{u}$ is the algorithm\'s probability of residing in the subtree rooted at $u, p(u)$ is the parent of $u$, and $h_{u}$ denotes the round number $i$ minus the combinatorial depth of $u$ in the tree, i.e. the height of $u$ at round $i$. Note that this quantity is well-defined only because the tree is unweighted, and all nodes at the last layer are at the same combinatorial depth. See Appendix B for the identity between the two sums.\n\nThe right-hand side of equation (1.1) reminds us of the "conditional entropy" regularizer employed in [CL22] for metric task systems on hierarchically separated trees. This expression allows us to derive an expression for the change of the conditional probabilities $\\frac{x_{u}}{x_{p(u)}}$ in each step (which does not involve Lagrange multipliers). However, the analysis in [CL22] requires that edge lengths decrease geometrically along root-to-leaf paths, and more crucially it only achieves a competitive ratio linear in the depth of the tree. In our case, the depth of the tree (even if merging edges as in [BCR22]) can be $\\Omega(w)$, so an analysis similar to [CL22] would fail to achieve a sublinear competitive ratio. The high depth of the tree is also the obstacle leading to the $\\Theta\\left(w^{2}\\right)$ competitive ratio for weighted layered graph traversal in [BCR22]. To achieve the exponential improvement to $\\mathcal{O}\\left(\\log ^{2} w\\right)$, a key idea of our proof is to separate the analysis of movement cost when handling deactivation phases, where the probability mass $x_{\\ell}$ of a leaf $\\ell$ decreases to 0 , into two substeps that are analyzed differently depending on whether $x_{\\ell}$ is smaller or greater than $1 / \\operatorname{poly}(w)$. These two substeps require two different potential functions, which we combine into an overall potential function that is compatible with both substeps of deactivation phases and also with growth phases.\n\nAnother interesting aspect - which echoes a technique in [CM24] - is that the value of the regularizer appears in the potential function, instead of the Bregman divergence with an offline configuration. The Bregman divergence would be problematic for us since, due to the lack of $\\delta$-perturbations in $\\Phi_{i}$, its Lipschitz constant can be $\\Omega(w)$, even when evaluated only at the relevant online algorithm configurations.', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=3, markdown="1.2 Related work The setting of layered graph traversal was introduced in a paper titled 'Shortest path without a map' by Papadimitriou and Yannakakis, motivated by applications in navigation robotics [PY91]. For the case of $w=2$, they showed that the deterministic variant of the problem is essentially equivalent to linear search, a problem which dates back to [Bel63] for which the competitive ratio is equal to 9 [BYCR93]. The first competitive analysis for arbitrary values of $w \\in \\mathbb{N}$ was obtained by $\\left[\\mathrm{FFK}^{+} 98\\right]$, which coined the term 'layered graph traversal' to emphasize how the agent collects information (i.e., the feedback). Following a series of improvements [FFK+98, Ram95, Bur96], the competitive ratio of the deterministic variant of the problem was proved to lie between $\\mathcal{O}\\left(w 2^{w}\\right)$ [Bur96] and $\\Omega\\left(2^{w}\\right)$ [FFK +98 ]. More recently, the competitive ratio of the randomized variant of the problem was resolved up to a constant factor, establishing it at $\\Theta\\left(w^{2}\\right)$ [BCR22, BCR23]. For the unweighted setting, [CM24] recently provided a $\\mathcal{O}(\\sqrt{w})$-competitive algorithm, leaving a substantial gap compared to the classical $\\Omega(\\log w)$ lower-bound. This paper significantly narrows this gap, presenting a $\\mathcal{O}\\left(\\log ^{2} w\\right)$-competitive algorithm.\n\nThe problem was also studied under the name 'metrical service systems' [CL93] to highlight its connection to the setting of metrical task systems [BLS92]. The denomination 'small set chasing' was also recently suggested by [BCR22] to point out its connection with more recent works on set chasing (see, e.g. [BLLS19, AGTG21, Sel20]). The problem is additionally related to several other problems in online computation, such as the $k$-taxi problem [CK19], whose lower bound is based on a reduction from layered graph traversal. Very recently, layered graph traversal has also found new applications in the context of learning-augmented algorithms [ACE+23b, $\\left.\\mathrm{ACE}^{+} 23 \\mathrm{a}\\right]$.\n\nExploration of an unknown environment by a mobile agent. In layered graph traversal, the agent's objective can be restated as follows: the agent must build a complete map of the underlying graph using the fewest moves possible. Graph exploration with a mobile agent has attracted attention since the 19th century [Luc82, Tar95], marked by the formal introduction of the depth-first search (DFS) as a maze-solving algorithm. DFS is particularly suited for exploring general unknown graphs when the agent is short-sighted and narrow-sighted: the agent traverses all $m$ edges in exactly $2 m$ moves, which is optimal in the sense of the competitive ratio. A natural extension of this problem is when the agent can also see its immediate neighborhood (a little less short-sighted) [KP94]. This setting is known as 'online graph exploration' [MMS12], and the goal is for the agent to visit all nodes while incurring a limited cost with respect to the optimal tour (i.e., the solution of the traveling salesman problem). The competitive ratio for that problem is still unsettled (it lies between $\\Omega(1)$ and $\\mathcal{O}(\\log n)$, where $n$ is the number of nodes in the graph) but many restricted classes of graphs admit $\\mathcal{O}(1)$-competitive algorithms, such as planar graphs or some minor-excluded graphs (see [BDHS23] and its references). In contrast, layered graph traversal proposes a different model of feedback where the agent gets to see simultaneously all nodes and edges at some given hop distance from its initial position. This setting is particularly well-motivated when the graph evolves with time (e.g., a phylogenetic tree). Many other graph exploration settings have been introduced, varying the capabilities of the agent: e.g., with limited memory [AKL+79], limited storage [YWB03], or considering multiple agents working a as a team [FGKP06, CMV23, CM24]. For a recent review of graph exploration problems, see the introduction of [Cos24]. Contrarily to the aforementioned examples, this paper studies the case of a single, fully-competent agent.\n1.3 Notations and preliminaries Consider a graph $G=(V, E)$ defined by a set of nodes $V \\subset \\mathcal{V}$ and edges $E$ over $V$, where $\\mathcal{V}$ is an arbitrary countable set. The edges are undirected and unweighted, and the graph is connected. One node is called the root and is denoted by $r \\in V$. For some $i \\in \\mathbb{N}$, we denote by $\\mathcal{L}(i)$ the set of nodes at combinatorial depth $i$, i.e., $i$ hops away from the root. We will refer to $\\mathcal{L}(i)$ as the $i$-th layer of the graph. The width $w$ of the graph is defined as the maximum cardinality of its layers, i.e., $w=\\max _{i}|\\mathcal{L}(i)|$. As standard in layered graph traversal, the goal to reach a certain target can be equivalently simplified to reaching the last layer of the graph.\n\nReduction to trees. For any node $u \\in V$, we denote by $i_{u}$ the combinatorial depth of $u$, i.e. satisfying $u \\in \\mathcal{L}\\left(i_{u}\\right)$. At any step $i \\geq i_{u}$, we call $h_{u}(i)=i-i_{u}$ the height of $u$. For any node $u \\in V \\backslash\\{r\\}$, we can arbitrarily pick a node of the preceding layer $p(u) \\in \\mathcal{L}\\left(i_{u}-1\\right)$, such that $(p(u), u) \\in E .{ }^{1}$ It is clear that traversing the tree\n\n[^0]\n[^0]: ${ }^{1}$ Observe that the choice of $p(u)$ can be made online by the agent. The reduction from graphs to tres is standard in the literature on layered graph traversal. Here, we use the assumption that any non-root node is connected to a node in the preceding layer because a layer refers to the set of nodes at the same combinatorial depth. We note that this assumption is not required in the weighted variant of the problem; see $\\left[\\mathrm{FFK}^{+} 98\\right]$. For more motivation on the setting of this paper, we refer to [CM24].", images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=4, markdown="formed by $T=\\{(p(u), u) \\mid u \\in V \\backslash\\{r\\}\\} \\subseteq E$ in an online manner is no less difficult than traversing the original graph. For this reason, we will focus the rest of the paper on the specific case where the layered graph is a tree. For any node $u \\in V$, we denote by $c(u) \\subset V$ the set of its children of $u$, and by $\\mathcal{L}_{u}(i)$ the subset of nodes of $\\mathcal{L}(i)$ that are descendants of $u$, for some $i \\in \\mathbb{N}$. We also denote by $u \\rightarrow r \\subset V$ the set of all nodes in the path from $u$ to $r$, excluding $r$.\n\nLayered graph traversal algorithm. Formally, a randomized unweighted layered graph traversal algorithm is defined by a function $\\mathcal{A}: \\mathcal{T} \\times \\mathcal{V} \\rightarrow \\mathcal{P}(\\mathcal{V})$ mapping a tree $T \\in \\mathcal{T}$ and a node $u \\in \\mathcal{V}$ in the penultimate layer of $T$ to a distribution $\\boldsymbol{x} \\in \\mathcal{P}(\\mathcal{V})$ over the set of nodes in the ultimate layer of $T$. The algorithm is deterministic if the distribution is always supported on a unique node. The cost of the algorithm is defined as the total number of edges traversed by an agent that picks its next position using $\\mathcal{A}$ at each layer and uses the shortest paths to move between positions elected in consecutive layers. The total cost of an agent until reaching layer $i$ depends on the random choices taken by the agent (if the algorithm is randomized) and we therefore focus on its expectation which we denote by $\\operatorname{Cost}(i)$. The algorithm is $c_{w}$-competitive if, for any graph of width $w$ and any step $i$, it satisfies $\\operatorname{Cost}(i) \\leq c_{w} i$.\n\nFractional perspective on unweighted layered graph traversal. A classical equivalent perspective on layered graph traversal - which is referred to as the fractional view - is that the algorithm $\\mathcal{A}: \\mathcal{T} \\rightarrow \\mathcal{P}(\\mathcal{V})$ maintains a distribution on the ultimate layer of the tree given as input. The movement cost charged to the algorithm is then the optimal transport cost between the two consecutive distributions. Given a fractional algorithm for layered graph traversal, one gets a randomized algorithm with the same cost by considering at each step an optimal coupling between the two consecutive distributions and sampling the next position of the agent according to the conditional distribution associated with its current position. We refer to [BCR22, Section 4] for more details on the reduction.\n\nThe active tree and the active polytope. At any round $i$, we say that a node is 'active' if it has a descendant in $\\mathcal{L}(i)$ and we denote the set of all active nodes by $V(i)$. A node is 'deactivated' at round $i$ if it ceases to be active at that round and the set of all deactivated nodes until round $i$ is denoted by $\\bar{V}(i)$. We will essentially focus on the 'active tree' that is formed by the nodes in $V(i)$. Observe that this tree is alternatively defined as the Steiner tree of $\\mathcal{L}(i) \\cup\\{r\\}$. We will denote by $\\boldsymbol{x}(i) \\in \\mathcal{P}(\\mathcal{L}(i))$ the probability distribution associated with the position of the agent on the $i$-th layer. Instead of viewing $\\boldsymbol{x}$ as a probability distribution, it will be useful to consider it more generally as a point of the active polytope, $K(i)$ that we define as,\n\n$$\nK(i)=\\left\\{\\boldsymbol{x} \\in \\mathbb{R}^{V(i) \\backslash\\{r\\}} \\mid \\forall u \\in V(i) \\backslash \\mathcal{L}(i): x_{u}=\\sum_{v \\in c(u)} x_{v}\\right\\}\n$$\n\nWe emphasize that $x_{r}$ is defined once and for all to be 1 . For that reason, it does not appear as a variable in the definition of the active polytope above (but $x_{r}$ does appear in one constraint, which effectively enforces $\\left.\\sum_{\\ell \\in \\mathcal{L}(i)} x_{\\ell}(i)=1\\right)$. We note that a probability distribution must also satisfy non-negativity constraints, which are not explicit in $K(i)$. It will be obvious that the configuration $\\boldsymbol{x}(i) \\in K(i)$ defined by our algorithm satisfies the positivity constraints (with strict inequality) and can indeed be interpreted as $\\boldsymbol{x}(i) \\in \\mathcal{P}(\\mathcal{L}(i))$. We note that $K(i)$ is an affine subspace of $\\mathbb{R}^{V(i) \\backslash\\{r\\}}$, Denoting by $\\mathbf{e}_{u} \\in \\mathbb{R}^{V(i) \\backslash\\{r\\}}$ the canonical basis vector associated with the element $u \\in V(i) \\backslash\\{r\\}$, we define the normal cone $N_{K(i)}$ associated with $K(i)$,\n\n$$\n\\begin{aligned}\nN_{K(i)} & =\\operatorname{Vect}\\left(\\left\\{\\boldsymbol{e}_{u}-\\sum_{v \\in c(u)} \\boldsymbol{e}_{v}: u \\in V(i) \\backslash \\mathcal{L}(i)\\right\\}\\right) \\\\\n& =\\left\\{\\boldsymbol{\\xi} \\in \\mathbb{R}^{V(i) \\backslash\\{r\\}}: \\xi_{u}=\\lambda_{p(u)}-\\lambda_{u} \\text { with } \\boldsymbol{\\lambda} \\in \\mathbb{R}^{V(i)} \\text { and } \\lambda_{\\ell}=0 \\text { for } \\ell \\in \\mathcal{L}(i)\\right\\}\n\\end{aligned}\n$$\n\nwhich is the vector space orthogonal to the the affine subspace $K(i)$.\nOptimal transport in a tree. For two given configurations, $\\boldsymbol{x}(i) \\in K(i)$ and $\\boldsymbol{x}(i+1) \\in K(i+1)$, we denote the optimal transport cost (also known as the earth-mover distance) between $\\boldsymbol{x}(i)$ and $\\boldsymbol{x}(i+1)$ by $\\operatorname{OT}(\\boldsymbol{x}(i), \\boldsymbol{x}(i+1))$. Extending the definition of a configuration to take value zero outside of the active polytope, the optimal transport cost on a tree is classically expressed as\n\n$$\n\\operatorname{OT}\\left(\\boldsymbol{x}, \\boldsymbol{x}^{\\prime}\\right)=\\sum_{u \\in V}\\left|x_{u}^{\\prime}-x_{u}\\right|\n$$", images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=5, markdown="We note that this definition matches the usual definition of optimal transport on the metric space associated with the underlying tree.\n\nDerivatives and instantaneous movement costs. The derivative of a differentiable real function $f(\\cdot)$ with respect to a single variable $t \\in \\mathbb{R}$ will be denoted by $\\partial_{t} f(t)$. The gradient of a twice-differentiable multivariate function $\\Phi(\\boldsymbol{x})$ will be denoted by $\\nabla \\Phi(\\boldsymbol{x})$, and its Hessian will be denoted by $\\nabla^{2} \\Phi(\\boldsymbol{x})$. For convenience, when the function that is differentiated is a configuration $\\boldsymbol{x}(t) \\in K(i)$, we will adopt the physicist's notation and denote its derivative $\\partial_{t} \\boldsymbol{x}(t)$ by $\\dot{\\boldsymbol{x}}$. Thus, at an arbitrary node $u \\in V(i)$, we will also denote $\\partial_{t} x_{u}(t)$ by $\\dot{x}_{u}$. In general, we will often drop dependence on time when it is clear from the context. We observe that if $\\forall t: \\boldsymbol{x}(t) \\in K(i)$, then $\\dot{\\boldsymbol{x}}(t)$ is orthogonal to the normal cone $N_{K(i)}$, i.e., $\\dot{\\boldsymbol{x}}(t) \\cdot \\boldsymbol{\\xi}=0$ for $\\boldsymbol{\\xi} \\in N_{K(i)}$. We define the instantaneous movement cost at instant $t$ as\n\n$$\n\\partial_{t} \\operatorname{Cost}(t)=\\sum_{u \\in V(i)}\\left|\\dot{x}_{u}\\right|\n$$\n\nBy the triangle inequality, it is clear that for any two configurations $\\boldsymbol{x} \\in K(i)$ and $\\boldsymbol{x}^{\\prime} \\in K(i)$, if $\\boldsymbol{x}(t)$ is a differentiable function with respect to $t$, interpolating between $\\boldsymbol{x}(a)=\\boldsymbol{x}$ and $\\boldsymbol{x}(b)=\\boldsymbol{x}^{\\prime}$ for $a[1m]