Sequential Decision Algorithms for Measurement-Based Impromptu Deployment of a Wireless Relay Network along a Line

We are motivated by the need, in some applications, for impromptu or as-you-go deployment of wireless sensor networks. A person walks along a line, starting from a sink node (e.g., a base-station), and proceeds towards a source node (e.g., a sensor) which is at an a priori unknown location. At equally spaced locations, he makes link quality measurements to the previous relay, and deploys relays at some of these locations, with the aim to connect the source to the sink by a multihop wireless path. In this paper, we consider two approaches for impromptu deployment: (i) the deployment agent can only move forward (which we call a pure as-you-go approach), and (ii) the deployment agent can make measurements over several consecutive steps before selecting a placement location among them (which we call an explore-forward approach). We consider a light traffic regime, and formulate the problem as a Markov decision process, where the trade-off is among the power used by the nodes, the outage probabilities in the links, and the number of relays placed per unit distance. We obtain the structures of the optimal policies for the pure as-you-go approach as well as for the explore-forward approach. We also consider natural heuristic algorithms, for comparison. Numerical examples show that the explore-forward approach significantly outperforms the pure as-you-go approach. Next, we propose two learning algorithms for the explore-forward approach, based on Stochastic Approximation, which asymptotically converge to the set of optimal policies, without using any knowledge of the radio propagation model. We demonstrate numerically that the learning algorithms can converge (as deployment progresses) to the set of optimal policies reasonably fast and, hence, can be practical, model-free algorithms for deployment over large regions.


I. INTRODUCTION
There are situations in which a wireless sensor network (WSN) needs to be deployed in an impromptu or as-you-go fashion. One such situation is in emergencies, e.g., situational awareness networks deployed by first-responders such as firefighters or anti-terrorist squads. As-you-go deployment is also  of interest when deploying networks over large terrains, such as forest trails, particularly when the network is temporary and needs to be quickly redeployed in a different part of the forest (e.g., to monitor a moving phenomenon such as groups of wildlife), or when the deployment needs to be stealthy (e.g., to monitor fugitives). Our work in this paper is motivated by the need for as-yougo deployment of wireless relay networks over large terrains, such as forest trails, where planned deployment would be time consuming and difficult. We consider the problem of as-yougo deployment of relay nodes along a line, between a sink node (e.g., the WSN base-station) and a source node (e.g., a sensor) (see Figure 1), where the single deployment agent (the person who is carrying out the deployment) starts from the sink node, places relay nodes along the line, and places the source node where required. In applications, the location at which sensor placement is required might only be discovered as the deployment agent walks (e.g., in an animal monitoring application, by finding a concentration of pugmarks, or a watering hole).
In the perspective of an optimal planned deployment, we would need to place relay nodes at all potential locations (see Figure 1) and measure the qualities of all possible links (between all pairs of potential locations) in order to decide where to place the relays. This approach would provide the global optimal solution, but the time and effort required might not be acceptable in the applications mentioned earlier. With impromptu deployment, the next relay placement locations depend on the radio link qualities to the previously placed nodes; these link qualities and also the source location are discovered as the agent walks along the line. Such an approach requires fewer measurements compared to planned deployment, but, in general, is suboptimal.
In this paper, we mathematically formulate the problems of impromptu deployment of relays along a line as optimal sequential decision problems. The cost of a deployment is evaluated as a linear combination of three components: the sum transmit power along the path, the sum outage probability along the path, and the number of relays deployed; we provide a motivation for this cost structure. We formulate relay placement problems that minimize the expected average cost perstep. Our channel model accounts for path-loss, shadowing, and fading. We explore deployment with two approaches: (i) the pure as-you-go approach and (ii) the explore-forward approach. In the pure as-you-go approach, the deployment agent can only move forward; this approach is a necessity if the deployment needs to be quick. Due to shadowing, the path-loss over a link of a given length is random, and a more efficient deployment can be expected if link quality measurements at several locations along the line are compared and an optimal choice is made among these; we call this approach explore-forward. Explore-forward would require the deployment agent to retrace his steps; but this might provide a good compromise between deployment speed and deployment efficiency. We formulate each of these problems as a Markov decision process (MDP), obtain the optimal policy structures, illustrate their performance numerically and compare their performance with reasonable heuristics. Next, we propose several learning algorithms and prove that each of them asymptotically converges to the optimal policy if we seek to minimize the average cost per unit distance for deployment over a long line. We also demonstrate the convergence rate of the learning algorithms via numerical exploration.

A. Related Work
In existing literature, problems of impromptu deployment of wireless networks are addressed by heuristics and by experimentation. Howard et al., in [1], provide heuristic algorithms for incremental deployment of sensors in order to cover the deployment area. Souryal et al., in [2], address the problem of impromptu wireless network deployment with experimental study of indoor RF link quality variation; a similar approach is taken in [3] also. The authors of [4] describe a breadcrumbs system for aiding firefighting inside buildings. Their work addresses the same class of problems as ours, with the requirement that the deployment agent has to stay connected to k previously placed nodes in the deployment process. Their work considers the trade-off between link qualities and the deployment rate, but does not provide any optimality guarantee of their deployment schemes. Bao and Lee, in [5], study the scenario where a group of first-responders, starting from a command centre, enter a large area where there is no communication infrastructure, and as they walk they place relays at suitable locations in order to stay connected among themselves as well as with the command centre. However, the above described approaches are based on heuristic algorithms, rather than on deriving algorithms from rigorous formulations; hence, these approaches do not provide any provable performance guarantee.
In our work we formulate impromptu deployment as a sequential decision problem, and derive optimal deployment policies. Recently, Sinha et al. ([6]) have provided an algorithm based on an MDP formulation in order to establish a multi-hop network between a sink and an unknown source location, by placing relay nodes along a random lattice path. Their model uses a deterministic mapping between power and wireless link length, and, hence, does not consider the effect of shadowing that leads to statistical variability of the transmit power required to maintain the link quality over links having the same length. The statistical variation of link qualities over space requires measurement-based deployment, in which the deployment agent makes placement decisions at a point based on the measurement of the power required to establish a link (with a given quality) to the previously placed node. Our previous work on this problem: We view this paper as a continuation of our earlier conference paper [7] which provides the first theoretical formulation of measurement-based impromptu deployment. While the current paper is devoted to problem formulations, derivations of deployment policies and their properties, numerical exploration and comparison of the policies, in another conference paper [8] we provide results of using the algorithms to carry out actual deployments in a forest-like setting.

B. Organization
The rest of the paper is organized as follows. The system model and notation have been described in Section II. Impromptu deployment with pure as-you-go approach has been described in Section III. Section IV addresses the problem of impromptu deployment with explore-forward. A numerical comparison between these two approaches are made in Section V. Section VI and Section VII describe the learning algorithms for the explore-forward approach approach. Numerical results are provided in Section VIII on the rate of convergence of the learning algorithms, followed by the conclusion. All proofs and some discussion are provided in the appendices.
II. SYSTEM MODEL AND NOTATION Throughout this paper, we assume that the line is discretized into steps of length δ (see Figure 1), starting from the sink node. Each point, located at a distance of an integer multiple of δ from the sink node, is considered to be a potential location where a relay can be placed. As the single deployment agent walks along the line, at each step or at some subset of steps, he measures the link quality from the current location to the previous node; these measurements are used to decide the location and transmit power of the next relay node.

A. Channel Model and Outage Probability
We consider the usual aspects of path-loss, shadowing, and fading to model the wireless channel. The received power of a packet (say the k-th packet, k ≥ 1) in a particular link (i.e., a transmitter-receiver pair) of length r is given by: where P T is the transmit power, c is the path-loss at the reference distance r 0 , η is the path-loss exponent, H k denotes the fading random variable seen by the k-th packet (e.g., it could be an exponentially distributed random variable for the Rayleigh fading model), and W denotes the shadowing random variable. H k captures the variation of the received power over time, and it takes independent values over different coherence times. The path-loss between a transmitter and a receiver at a given distance can have a large spatial variability around the mean path-loss (averaged over fading), as the transmitter is moved over different points at the same distance from the receiver; this is called shadowing. 1 Shadowing is usually modeled as a log-normally distributed, random, multiplicative path-loss factor; in dB, shadowing is distributed with values of standard deviation as large as 8 to 10 dB. Also, shadowing is spatially uncorrelated over distances that depend on the sizes of the objects in the propagation environment (see [9]); our measurements in a forest-like region of our Indian Institute of Science campus proved Log-normality of shadowing and gave a shadowing decorrelation distance of 6 meters (see [8]). In this paper, W is assumed to take values from a set W. We will denote by p W (w) the probability mass function or probability density function of W , depending on whether W is a countable set or an uncountable set as in the case of lognormal shadowing.
A link is considered to be in outage if the received signal power drops (due to fading) below P rcv−min (e.g., below −88 dBm, a figure that we have obtained via experimentation for the popular TelosB "motes," see [10]). Since practical radios can only be set to transmit at a finite set of power levels, the transmit power of each node can be chosen from a discrete set, S := {P 1 , P 2 , · · · , P M }, where P 1 ≤ P 2 ≤ · · · ≤ P M . For a link of length r, a transmit power γ and any particular realization of shadowing W = w, the outage probability is denoted by Q out (r, γ, w), which is increasing in r and decreasing in γ, w (according to (1)).
Note that Q out (r, γ, w) depends on the fading statistics. For a link with shadowing realization w, if the transmit power is γ, the received power of a packet will be P rcv = γc( r r0 ) −η wH. Outage is defined to be the event P rcv ≤ P rcv−min . If H is exponentially distributed with mean 1 (i.e., for Rayleigh fading), then we have, Q out (r, γ, w) = P γc( r r0 ) −η wH ≤ P rcv−min = 1 − e − P rcv−min ( r r 0 ) η γcw .
The outage probability of a randomly chosen link of given length and given transmit power is a random variable, where the randomness comes from the spatial variation of link quality due to shadowing. Outage probability is measured by sending sufficiently large number of packets over a link and calculating the percentage of packets whose RSSI is below P rcv−min .

B. Deployment Process and Related Notation
In this paper, we consider two approaches for deployment. Pure as-you-go deployment: In this case, after placing a relay, the agent skips the next A steps (A ≥ 0), and sequentially estimates shadowing from the locations (A + 1 Consider (1). If we transmit a sufficiently large number of packets on a link over multiple coherence times and record the received signal strength of all the packets, we can compute P rcv which is the mean received signal power averaged over fading. If the realization of shadowing in that link is w, then P rcv = P T c( r r 0 ) −η wE(H).
1), (A + 2), · · · , (A + B). As the agent explores the locations (A+1), (A+2), · · · , (A+B −1) and estimates the shadowing in those locations, at each step he decides whether to place a relay there, and if the decision is to place a relay, then he also decides at what transmit power the relay will operate. In this process, if he has walked (A + B) steps away from the previous relay, or if he encounters the source location within this distance, then he must place a node. Explore-forward deployment: After placing a node, the deployment agent skips the next A locations (A ≥ 0) and estimates the shadowing w := (w A+1 , w A+2 , · · · , w A+B ) 2 to the previous node from locations (A+1), (A+2), · · · , (A+B). Then he places the relay at one of the locations (A + 1), (A + 2), · · · , (A + B) and repeats the same process for placing the next relay. This procedure is illustrated in Figure 2. If the source location is encountered within (A + B) steps from the previous node, then the source is placed.
We will see later that, in both approaches, it is sufficient to measure the outage probabilities Q out (r, γ, w r ), A + 1 ≤ r ≤ A + B, γ ∈ S, and there is no need to explicitly measure shadowing in the links.
Choice of A and B: If the propagation environment is very good, or if we need to place a limited number of relays over a long line, it is very unlikely that a relay will be placed within the first few locations from the previous node. In such cases, link quality measurements from those first few locations are wasted, since shadowing is i.i.d. across links. In such cases, we can skip measurements at locations 1, 2, · · · , A and make measurements from locations (A + 1), (A + 2), · · · , (A + B). However, we can simply choose A = 0. In general, the choice of A and B will depend on the constraints and requirements for the deployment. A larger value of A will result in faster exploration of the line, since measurements at many locations need not be made. For a fixed A, a larger value of B results in more measurements, and hence we can expect a better performance on an average. However, A and B must be chosen such that the random variable Q out (A + B, P M , W ) is within tolerable limits with high probability; otherwise the deployment agent might measure too many links which have very high outage probability and are not useful.

C. Independence of Shadowing Across Links
As shown in Figure 1, the sink is called Node 0, the relay closest to the sink is called Node 1, and the relays are enumerated as nodes {1, 2, 3, · · · } as we walk away from the source. The link whose transmitter is Node i and receiver is Node j is called link (i, j). A generic link is denoted by e. Let us recall that the length of each link is an integer multiple of the step size δ.
We assume that the shadowing at any two different links in the network are independent, i.e., W (e1) is independent of W (e2) for e 1 = e 2 . Then the independence is a reasonable assumption if δ is chosen to be at least the decorrelation distance (see [9]) of shadowing. For the experimental setting (in the forest inside Indian Institute of Science campus) as in Section II-A, we can safely assume independent shadowing at different potential locations if δ is greater than 6 m (see [8]).

D. Traffic Model
We consider a model where the traffic is so low that there is only one packet in the network at a time; we call this the "lone packet model." As a consequence of this assumption, there are no simultaneous transmissions to cause interference. This permits us to easily write down the communication cost on a path over the deployed relays. Such a traffic model is realistic for sensor networks that carry low duty cycle measurements, or just carry an occasional alarm packet. A design with the lone packet model can be the starting point for a design with desired positive traffic (see [11]). Also, even though the network is designed for the lone packet traffic, it will be able to carry some amount of positive traffic from the source to the sink (see [8] for experimental evidence of this claim; a five-hop line network deployed (using the methodology derived in this paper) over a 500 m long trail in a forest-like environment, was able to carry 127 byte packets at a rate of 4 packets per second, with end-to-end packet loss probability less than 1%).

E. Network Cost Structure
In this section we develop the cost that we use to evaluate the performance of a given deployment policy. Given the current location of the deployment agent with respect to the previous relay, and given the measurements made to the previous relay, a policy will provide the placement decision (in the case of as-you-go deployment, whether or not to place the relay, and if place then at what power, and in the case of explore-forward deployment, where among the B locations to place the relay and at which power). Formal definition of a policy will be given later in this paper.
Let us denote the number of placed relays up to x steps (i.e., xδ meters) from the sink by N x (≤ x); define N 0 = 0. Since deployment decisions are based on measurements to already placed relays, and since the path-loss over a link is a random variable (due to shadowing), we see that {N x } x≥1 is a random process. In this paper we have assumed that each node forwards each packet to the immediately previously placed relay (e.g., with reference to Figure 1, the source forwards all packets to Relay 2, which, in turn, forwards all packets to Relay 1, etc.). See [7] for the considerably more complex possibility of relay skipping while forwarding packets.
When the node i is placed, the deployment policy also prescribes the transmit power that this node should use, say, Γ i ; then the outage probability over the link (i, i − 1), so created, is denoted by Q (i,i−1) out . We evaluate the cost of the deployed network, up to xδ steps, as a linear combination of three cost measures: (i) The number of relays placed, i.e., N x . (ii) The sum outage, i.e., . The motivation for this measure is that, for small values of Q out , the sumoutage is approximately the probability that a packet sent from the point x to the source encounters an outage along the path from the point x back to the sink. (iii) The sum power over the hops, i.e., Nx i=1 Γ i . This is a measure of the energy required to operate the network (see discussion later in this section).
These three costs are combined into one cost measure by combining them linearly and taking expectation (under a policy π), as follows: The multipliers ξ out ≥ 0 and ξ relay ≥ 0 can be viewed as capturing the emphasis we wish to place on the corresponding measure of cost. For example, a large value of ξ out will aim for a network deployment with smaller end-to-end expected outage. We can view ξ relay as the cost of placing a relay. More formally, these cost multipliers also emerge as "Lagrange" multipliers if we formulate the problem of minimizing the energy cost subject to constraints on the other two costs. We will formalize this in Section II-F.
A Motivation for the Sum Power Objective: In case all the nodes have wake-on radios, the nodes normally stay in sleep mode, and each sleeping node draws a very small current from the battery (see [12]). When a node has a packet, it sends a wake-up tone to the intended receiver. The receiver wakes up and the sender transmits the packet. The receiver sends an ACK packet in reply. Clearly, the energy spent in transmission and reception of data packets governs the lifetime of a node, given that the ACK size is negligible compared to the packet size. We assume that a fixed modulation scheme is used, so that the transmission bit rate over all links is the same (e.g., in IEEE 802.15.4 radios, that are commonly used for sensor networking, the standard modulation scheme provides a bit rate of 250 Kbps). We also assume a fixed packet length.
Let t p be the transmission duration of a packet over a link, and suppose that the node i (1 ≤ i ≤ N x ) uses power Γ i during transmission. Let P r denote the packet reception power expended in the electronics at any receiving node. If the packet generation rate ζ at the source is very small, the lifetime of the k-th node (1 ≤ k ≤ N x ) is T k := E ζ(Γ k +Pr)tp seconds (E is the total energy in a fresh battery). Hence, the rate at which we have to replace the batteries in the network from the sink up to distance x steps is given by can be absorbed into ξ relay . Hence, the battery depletion rate between the sink and the point x is proportional to Nx k=1 Γ k . Note that, Nx k=1 Γ k is the total transmit power to send a packet from node N x to the sink node, since there is no collision among packets transmitted from various nodes (due to lone packet traffic; see Section II-D).

F. Deployment Objective
We assume in this paper that the the distance L to the source from the sink (at the start of the line) is a priori unknown, and no knowledge about its distribution is available. Hence, we assume that L = ∞ and use deployment policies that seek to minimize the average cost per step. This setting can also be useful when L is large (e.g., a long forest trail), or when we seek to deploy relays along various trails (the trails might be interconnected among themselves). Once deployed, such a chain of nodes can be used to realize and connect several source-sink pairs, or even each node could act as a sensor and a relay. 3 1) The Unconstrained Problem: Motivated by the cost structure in (2) and the L = ∞ model, we seek to solve the following problem: where π is a placement policy (i.e., deployment strategy), and Π is the set of all possible placement policies (to be formalized later). We formulate (3) as a long-term average cost Markov decision process (MDP).
2) Connection to a Constrained Problem: Note that, (3) is the relaxed version of the following constrained problem where we seek to minimize the mean power per step subject to a constraint on the mean outage per step and a constraint on the mean number of relays per step: The following standard result tells us how to choose the Lagrange multipliers ξ out and ξ relay (see [13], Theorem 4.3): Theorem 1: Consider the constrained problem (4). If there exists a pair ξ * out ≥ 0, ξ * relay ≥ 0 and a policy π * such that π * is the optimal policy of the unconstrained problem (3) under (ξ * out , ξ * relay ) and the constraints in (4) are met with equality under π * , then π * is an optimal policy for (4) also.

A. Markov Decision Process (MDP) Formulation
Here we seek to solve problem (3), for the pure as-you-go approach. When the agent is r steps away from the previous node (A + 1 ≤ r ≤ A + B), he measures the shadowing w on the link from the current location to the previous node. He uses the knowledge of (r, w) to decide whether to place a node 3 In [7], we considered the scenario where L is unknown, but there is prior information (e.g., the mean L) on its distribution. This led us to model L as a geometrically distributed number of steps and minimize the expected total cost of the network. The step length δ and the mean L, can be used to obtain the parameter of the geometric distribution, i.e., the probability θ that the line ends at the next step. In the current paper, we consider the case L ∼ Geometric(θ) (in steps) only for pure as-you-go case, the reason being to exploit the connection between this model and the L = ∞ model. at his current location, and what transmit power γ ∈ S to use if he places a relay. In this case, we formulate the impromptu deployment problem as a Markov Decision Process (MDP) with state space {A + 1, A + 2, · · · , A + B} × W. At state (r, w), (A + 1) ≤ r ≤ (A + B − 1), w ∈ W, the action is either to place a relay and select a transmit power, or not to place. When r = A + B, the only feasible action is to place and select a transmit power γ ∈ S. If, at state (r, w), a relay is placed and it is set to use transmit power γ, a hop-cost of γ + ξ out Q out (r, γ, w) + ξ relay is incurred. A deterministic Markov policy π is a sequence of mappings {µ k } k≥1 from the state space to the action space, and it is called a stationary policy if µ k = µ for all k. Given the state (i.e., the measurements), the placement decision is made according to the policy.

B. Formulation for L ∼ Geometric(θ)
Under the pure as-you-go approach, we will first minimize the expected total cost for L ∼ Geometric(θ), and then take θ → 0; this approach provides the policy structure for the average cost problem (see [14], Chapter 4).
In the L ∼ Geometric(θ) case, the deployment process regenerates (probabilistically) after placing a relay, because of the memoryless property of the geometric distribution, and because of the fact that deployment of a new node will involve measurement of qualities of new links not measured before, and the new links have i.i.d. shadowing independent of the previously measured links. The state of the system at such regeneration points is denoted by 0 (also, there are states of the form (r, w)). When the source is placed at the end of the line, the process terminates. Suppose N is the (random) number of relays placed, and node N + 1 is the source node (as shown in Figure 1). We first seek to solve the following: We will first investigate this approach assuming finite W, and later generalize it for the case when W is a Borel subset of the positive real line.

C. Bellman Equation
Let us denote the optimal expected cost-to-go at state (r, w) and at state 0 be J(r, w) and J(0) respectively. Note that here we have an infinite horizon total cost MDP with a finite state space and finite action space. The assumption P of Chapter 3 in [14] is satisfied, since the single-stage costs are nonnegative. Hence, by the theory developed in [14], we can restrict ourselves to the class of stationary deterministic Markov policies.
By Proposition 3.1.1 of [14], the optimal value function J(·) satisfies the Bellman equation which is given by, for all (A + 1) ≤ r ≤ (A + B − 1) (explanation follows after the equations), J(r, w) = min min γ∈S (γ + ξoutQout(r, γ, w)) + ξ relay + J(0), θE W min γ∈S (γ + ξoutQout(r + 1, γ, W )) These equations are understood as follows. If the current state is (r, w), (A + 1) ≤ r ≤ (A + B − 1) and the line has not ended yet, we can either place a relay and set its transmit power to γ ∈ S, or we may not place. If we place, the cost min γ∈S (γ + ξ out Q out (r, γ, w) + ξ relay ) is incurred at the current step, and the cost-to-go from there is J(0). If we do not place a relay, the line will end with probability θ in the next step, in which case a cost E W min γ∈S (γ + ξ out Q out (r + 1, γ, W )) will be incurred. If the line does not end in the next step, the next state will be a random state (r + 1, W ) and a mean cost of E W J(r + 1, W ) will be incurred. At state (A + B, w) the only possible decision is to place a relay. At state 0, the deployment agent starts walking until he encounters the source location or location (A + 1); if the line ends at step k, 1 ≤ k ≤ A + 1 (with probability (1 − θ) k−1 θ), a cost of E W min γ∈S (γ + ξ out Q out (k, γ, W )) is incurred. If the line does not end within (A + 1) steps (this event has probability (1 − θ) A+1 ), the next state will be (A + 1, W ).
E. Policy Structure: OptAsYouGo Algorithm Lemma 1: J(r, w) is increasing in r, ξ out and ξ relay , decreasing in w, and jointly concave in ξ out and ξ relay . J(0) is increasing and jointly concave in ξ out and ξ relay .
Proof: See Appendix A. Next, we propose an optimal algorithm for impromptu deployment under the pure as-you-go approach.
Algorithm 1: (OptAsYouGo Algorithm) At state (r, w) (where A + 1 ≤ r ≤ A + B − 1), place a relay if and only if min γ∈S (γ + ξ out Q out (r, γ, w)) ≤ c th (r) where c th (r) is a threshold increasing in r. If the decision is to place a relay, the optimal power to be selected is given by argmin γ∈S γ + ξ out Q out (r, γ, w) . At state (A + B, w), select the transmit power argmin γ∈S γ + ξ out Q out (A + B, γ, w) .
Theorem 2: Under the pure as-you-go approach, Algorithm 1 provides the optimal policy for Problem (3).
Proof: See Appendix A. From now on, we will call Algorithm 1 as OptAsYouGo (Optimal algorithm with pure As-You-Go approach). 4 Remark: Note that, in order to make a placement decision, one need not explicitly measure the shadowing w in a given link; measuring the outage probabilities at each transmit power level γ ∈ S for a given link will suffice to make the decision. In fact, we have taken (r, w) as a typical state for simplicity of representation; so long as the channel model given by (1) is valid, we can take (r, {Q out (r, γ, w)} γ∈S ) as a typical state.
Remark: The trade-off in the impromptu deployment problem is that if we place relays far apart, the cost due to outage increases, but the cost of placing the relays decreases. The intuition behind the threshold structure of the policy is that if at distance r we get a good link with the combination of power and outage less than a threshold, then we should accept that link because moving forward is unlikely to yield a better link. c th (r) is increasing in r. Since Q out (r, γ, w) is increasing in r for any γ, w, and since shadowing is i.i.d across links, the probability of a link (to the previous node) having desired QoS decreases as we move away from the previous node. Hence, the optimal policy will try to place relays as soon as possible if r is large, and this explains why c th (r) is increasing in r. Note that the threshold c th (r) does not depend on w, due to the fact that shadowing is i.i.d. across links. Multiplying both sides of the value iteration by p W (w) and summing over w ∈ W, we obtain an iteration in terms of V (k) (·) and this iteration does not involve J (k) (·). Since J (k) (r, w) ↑ J(r, w) for each r, w and J (k) (0) ↑ J(0) as k ↑ ∞, we can argue that V (k) (r) ↑ E W J(r, W ) = V (r) for all r (by Monotone Convergence Theorem) and V (k) (0) ↑ J(0) = V (0). Then we can compute c th (r) by knowing V (·) itself (see the expression of c th (r) in the proof of Theorem 2); we need not keep track of the cost-to-go values J (k) (r, w) for each state (r, w), at each stage k. Here we simply need to keep track of V (k) (·).

F. Computation of the Optimal Policy
Similar iterations were proposed in our prior work [7] for a slightly different model; please see [7], Section III-A-5 for a detailed derivation.

G. Average Cost Problem: Optimality of OptAsYouGo
Note that the problem (5) can be considered as an infinite horizon discounted cost problem with discount factor (1 − θ). Hence, keeping in mind that we have finite state and action spaces, we observe that for the discount factor sufficiently close to 1, i.e., for θ sufficiently close to 0, the optimal policy for problem (5) is optimal for the problem (3) (see [14], Proposition 4.1.7). In particular, the optimal average cost per step with pure as-you-go approach, λ * , is given by λ * = lim θ→0 θJ θ (0) (see [14], Section 4.1.1), where J θ (0) is the optimal cost for problem (5) under pure as-you-go with the probability of the line ending in the next step is θ. Now suppose that W is a Borel subset of the real line. In this case, we still have a finite action space, and bounded, nonnegative cost per step. We can still write the Bellman Equation (6) for the case L ∼ Geometric(θ). We see that 0 ≤ J(A + B, w) − J(0) ≤ P M + ξ out + ξ relay . Now, by using the fact that 0 ≤ θJ(0) ≤ P M + ξ out + ξ relay , we can prove by induction that |J(r, w) − J(0)| is uniformly bounded across θ ∈ (0, 1), r ∈ {A + 1, A + 2, · · · , A + B}, w ∈ W and it is also equicontinuous in w for all θ ∈ (0, 1). Hence, by Theorem 5.5.4 of [15], the optimal average cost per step is again λ * = lim θ→0 θJ θ (0). As θ ↓ 0, we will obtain a sequence of optimal policies (i.e., mappings from the state space to the action space), and a limit point of them will be an average cost optimal policy.

H. HeuAsYouGo: A Suboptimal Pure As-You-Go Heuristic
This is a modified version of the deployment algorithm proposed in [2]. The algorithm is just a natural heuristic; it has not been derived from any sequential optimization formulation.
Algorithm 2: (HeuAsYouGo) The power used by the relays is set to a fixed value. At each potential location, the deployment agent checks whether the outage to the previous relay meets a certain predetermined target with this fixed transmit power level. After placing a relay, the next relay is placed at the last location where the target outage is met; or place at the (A + 1)-st location (after the previously placed relay) in the unlikely situation where the target outage is violated in the (A + 1)-st location itself. If the agent reaches the (A + B)-th step and if all previous locations violate the outage target, he must place the next relay at step (A + B).
This algorithm requires the deployment agent to move back by one step and place in case the outage target is violated for the first time in (A + 2)-nd step or beyond.

A. Semi-Markov Decision Process (SMDP) Formulation
Here we seek to solve the unconstrained problem (3). We formulate our problem as a Semi-Markov Decision Process (SMDP) with state space W B and action space {A + 1, A + 2, · · · , A + B} × S. The vector w := (w A+1 , w A+2 , · · · , w A+B ), i.e., the shadowing from B locations, is the state in our SMDP. In the state w, an action (u, γ) ∈ {A+1, A+2, · · · , A+B}×S is taken where u is the distance of the next relay (from the previous relay) that would be placed and γ is the transmit power that this relay will use. In this case, a hop-cost of γ + ξ out Q out (u, γ, w u ) + ξ relay is incurred. After placing a relay, the next state becomes w := (w A+1 , w A+2 , · · · , w A+B ) with probability g(w ) := A+B r=A+1 p Wr (w r ) (since shadowing is i.i.d. across links). Let us denote, by the vector-valued random variable W (k), the (random) state at the k-th decision instant, and by µ k (W (k)) the action at the k-th decision instant. For a deterministic Markov policy {µ k } k≥1 , let us define the functions µ (1) k : W B → {A + 1, A + 2, · · · , A + B} and µ (2) k : W B → S as follows: if µ k (w) = (u, γ), then µ (1) k (w) = u and µ (2) k (w) = γ.

B. Policy Structure: Algorithm OptExploreLim
Note that, under any policy, W (k) is i.i.d across k, k ≥ 1. The state space is a Borel space and the action space is finite. The hop cost and hop length (in number of steps) are uniformly bounded across all state-action pairs. Hence, we can work with stationary deterministic policies (see [16] for finite state space, i.e., finite W, and [17] for a general Borel state space, i.e., when W is a Borel set). Under our current scenario, the optimal average cost per step, λ * , exists (in fact, the limit exists) and is same for all states, i.e. for all w ∈ W B . For simplicity, we work with finite W in this section, but the policy structure holds for Borel state space also.
We next present a deployment algorithm called "OptEx-ploreLim," an optimal algorithm for limited exploration.
Algorithm 3: (OptExploreLim Algorithm:) In the state w which is captured by the measurements {Q out (u, γ, w u )} for A + 1 ≤ u ≤ A + B, γ ∈ S, place the new relay according to the policy µ * (later we will also use the notation π * or π * (ξ out , ξ relay ) to denote the same policy) as follows: where λ * (or λ * (ξ out , ξ relay )) is the optimal average cost per step for the Lagrange multipliers (ξ out , ξ relay ).
Theorem 3: The policy µ * given by Algorithm 3 is optimal for the problem (3) under the explore-forward approach.
Proof: The optimality equation for the SMDP is given by (see [16] v * (w) is the optimal differential cost corresponding to state w. The structure of the optimal policy is obvious from (8), since w ∈W B g(w )v * (w ) does not depend on (u, γ) (note that, v * (w) in (8) is obtained after taking minimum over (u, γ)).
Later we will also use the notation π * (ξ out , ξ relay ) to denote the OptExploreLim policy under the pair (ξ out , ξ relay ).
Remark 1: Same optimality equation and optimal policy structure will hold when we have a Borel state space (e.g., for Log-Normal shadowing), by the theory presented in [17].
Remark 2: Note that, the optimal policy depends on the state w only via the outage probabilities which can be easily measured by the agent.
Remark 3: If we take an action (u, γ), a cost (γ + ξ out Q out (u, γ, w u ) + ξ relay ) will be incurred. On the other hand, if we incur a cost of λ * over each one of those u steps, the total cost incurred will be λ * u. The policy selects the placement point that minimizes the difference between these two. Note that, the deployment process regenerates at each placement point (due to i.i.d shadowing across links).
Remark 4: Also, note that, the policy requires the deployment agent to know λ * . But computation of λ * will require perfect knowledge of propagation environment (e.g., the pathloss exponent η in (1), the distribution of shadowing in a link, etc.); see Section IV-C. Later we will propose two learning algorithms in Section VI and Section VII, which will not require such knowledge of the propagation environment.
Theorem 4: The optimal average cost per step λ * (ξ out , ξ relay ) is jointly concave, increasing and continuous in ξ out and ξ relay .
Proof: See Appendix B. Let us consider a sub-class of stationary deployment policies (parameterized by λ ≥ 0, ξ out ≥ 0 and ξ relay ≥ 0) given by: where λ is not necessarily equal to λ * (ξ out , ξ relay ).
Under the class of policies given by (9), let , k ≥ 1, denote the sequence of internode distances, transmit powers and link outage probabilities that the optimal policy yields during the deployment process. By the assumption of i.i.d. shadowing across links, it follows that Γ(λ, ξ out , ξ relay ), Q out (λ, ξ out , ξ relay ) and U (λ, ξ out , ξ relay ) denote the mean power per link, mean outage per link and mean placement distance (in steps) respectively, under the policy given by (9), where λ is not necessarily equal to λ * (ξ out , ξ relay ). Also, let Γ * (ξ out , ξ relay ), Q * out (ξ out , ξ relay ) and U * (ξ out , ξ relay ) denote the optimal mean power per link, the optimal mean outage per link and the optimal mean placement distance (in steps) respectively, under the OptExploreLim algorithm (i.e., policy π * (ξ out , ξ relay ) when λ in (9) is replaced by λ * (ξ out , ξ relay )). By the Renewal-Reward theorem, the optimal mean power per step, the optimal mean outage per step, and the optimal mean number of relays per step are given by Theorem 5: For a given ξ out , the mean number of relays per step under the OptExploreLim algorithm (Algorithm 3), 1 U * (ξout,ξ relay ) , decreases with ξ relay . Similarly, for a given ξ relay , the mean outage probability per step, Q * out (ξout,ξ relay ) U * (ξout,ξ relay ) , decreases with ξ out under the optimal policy.
Proof: See Appendix B. Remark: The proof of Theorem 5 is quite general; the results hold for the pure as-you-go approach also.
Proof: See Appendix B. Remark 5: The result provided in Theorem 6 will be used to develop the learning algorithms in Section VI and Section VII.

C. Policy Computation
We adapt a policy iteration (from [16]) based algorithm to calculate λ * . The algorithm generates a sequence of stationary policies {µ k } k≥1 (note that the notation µ k was used for a different purpose in Section IV-A; here each µ k is a stationary, deterministic, Markov policy), such that for any k ≥ 1, µ k (·) : W B → {A+1, · · · , A+B}×S maps a state into some action. Define the sequence {µ Algorithm 4: The policy iteration based algorithm is as follows: Step 0 (Initialization): Start with an initial stationary deterministic policy µ 0 .
Step 1 (Policy Evaluation): Calculate the average cost λ k corresponding to the policy µ k , for k ≥ 0. λ k is equal to the following quantity (by the Renewal Reward Theorem; see [18], Proposition 7.3): Step 2 (Policy Improvement): Find a new policy µ k+1 by solving the following: If µ k and µ k+1 are the same policy (i.e., if λ (k−1) = λ k ), then stop and declare µ * = µ k , λ * = λ k . Otherwise, go to Step 1.
Remark: It was shown in [16] that this policy iteration will converge in a finite number of iterations, for finite state and action spaces. The policy iteration will provide λ * in finite number of steps. The convergence requires that under any stationary policy, the state evolves as an irreducible Markov chain, which is satisfied here. When we have a general Borel state space (e.g., for log-normal shadowing), convergence may not happen in a finite number of states, but large enough number of iterations will provide a value close to λ * .
Computational Complexity: The finite state space has cardinality |W| B . Then, O(|W| B ) addition operations are required to compute λ k from the policy evaluation step. However, careful manipulation leads to a drastic reduction in this computational requirement, as shown by the following theorem.
Theorem 7: In the policy evaluation step in Algorithm 4, we can reduce the number of computations in each iteration from Proof: See Appendix B.

D. HeuExploreLim: An Intuitive but Suboptimal Heuristic
A natural heuristic for (3) under the explore-forward approach is the following HeuExploreLim Algorithm (Heuristic Algorithm for Limited Explore-Forward): Algorithm 5: (HeuExploreLim Algorithm) Under the explore-forward setting as discussed in Section IV, at state w, make the decision according to the following rule: Remark: This heuristic is not optimal. Under any stationary deterministic policy µ, let us denote the cost of a link by C µ (a random variable) and the length of a link by U µ (under any stationary deterministic policy µ, the deployment process regenerates at the placement points). Our optimal policy given in Theorem (3) Eµ(Uµ) if and only if the variance U µ is zero, i.e., we always place at the same distance from the previous node. But this does not happen in practice due to the variability in shadowing over space. Hence, HeuExploreLim is suboptimal.
Remark: The advantage of HeuExploreLim is that, given ξ out and ξ relay , HeuExploreLim does not require any propagation model parameter such as η or σ. However, a learning algorithm reported in Section VI also has the same advantage, and provides near-optimal performance if the deployment continues for a sufficient number of steps.

V. COMPARISON BETWEEN EXPLORE-FORWARD AND PURE AS-YOU-GO APPROACHES
Let us denote the optimal average cost per step (for a given ξ out and ξ relay ) under the explore-forward and pure as-you-go approaches by λ * ef and λ * ayg .
The proof is done by arguing that pure as-you-go approach is a special case explore-forward.
In Appendix C, we have presented some numerical work which illustrates the structure of the OptAsYouGo algorithm (we have shown the variation of the threshold c th (r) as a function of r, for various values of ξ out and ξ relay ; see Appendix C, Section A), and also numerically compared various deployment algorithms (see Appendix C, Section B). A detailed explanation of the numerical results has also been provided in Appendix C. The purpose of the comparison is to provide insights into the performance of various algorithms, and to select the algorithm which is best suited for practical deployment. In this section, we will just discuss the choice of parameter values in our numerical work in Appendix C.

A. Parameter Values
We consider deployment for a given ξ out and a given ξ relay , for the objective in (3). In Appendix C, we provide numerical results for deployment with iWiSe motes ([19]) (based on the Texas Instrument (TI) CC2520 which implements the IEEE 802.15.4 PHY in the 2.4 GHz ISM band, yielding a bit rate of 250 Kbps, with a CSMA/CA medium access control (MAC); 9 dBi antennas were used in the experiments. The set of transmit power levels S is taken to be {−18, −7, −4, 0, 5} dBm, which is a subset of the transmit power levels available in the chosen device. For the channel model as in (1), our measurements in a forest-like environment inside the Indian Institute of Science Campus gave path-loss exponent η = 4.7 and c = 10 0.17 (i.e., 1.7 dB); see [8]. Shadowing W was found to be log-normal; W = 10 Y 10 with Y ∼ N (0, σ 2 ), where σ = 7.7 dB. Shadowing decorrelation distance was found to be 6 meters. Fading is assumed to be Rayleigh; H ∼ Exponential(1).
We define outage to be the event when the received signal power of a packet falls below P rcv−min = 10 −9.7 mW (i.e., −97 dBm); for a commercial implementation of the PHY/MAC of IEEE 802.15.4, −97 dBm received power corresponds to a 2% packet loss probability for 127 byte packets for iWiSe motes, as per our measurements.
We consider deployment along a line with step size δ = 20 meters, A = 0, B = 5.
Choice of B: There is no specific rule as to how to choose B; the choice can be arbitrary. A higher value of B will result in a lower value of the optimal average cost, since the deployment agent will have more choices for larger B. We chose B is the following way. Define a link to be good if its outage probability is less than 3%, and choose B to be the largest integer such that the probability of finding a good link of length Bδ is more than 20%, when the highest transmit power is used. For the measured parameters η = 4.7, c = 10 0.17 , σ = 7.7 dB, and 5 dBm transmit power, B turned out to be 5. If B is increased further, the probability of getting a good link will be very small. Hence, exploring beyond 5 steps would be wasteful in terms of measurement effort and deployment time.
The main conclusion from the comparison among various algorithms in Appendix C is that the algorithms based on the explore-forward approach significantly outperform the algorithms based on the pure-as-you-go approach, at the cost of slightly more number of measurements per step (see Appendix C for a detailed discussion). Hence, for applications that do not require rapid deployment, such as deployment along a long forest trail for wildlife monitoring, exploreforward is a better approach to take. Thus, for the learning algorithms presented later, we will consider only exploreforward approach.
VI. OPTEXPLORELIMLEARNING: LEARNING WITH EXPLORE-FORWARD, FOR GIVEN ξ out AND ξ relay Based on the discussion in Section V, we proceed, in the remaining paper, with developing learning algorithms based on the optimal policy for OptExploreLim. Let us recall problem (3). It is obvious from Section V that Explore-Forward approach is much better suited (compared to pure as-yougo approach) for deployment over large terrains, because it provides a good compromise between the network cost and the number of measurements to be made.
We observe that the optimal policy (given by Algorithm 3) can be completely specified by the optimal average cost per step λ * , for given values of ξ out and ξ relay . But the computation of λ * requires policy iteration. Policy iteration requires the channel model parameters η and σ, and it is computationally intensive. In practice, these parameters of the channel model might not be available. Under this situation, the agent measures {Q out (u, γ, w u ) : A+1 ≤ u ≤ A+B, γ ∈ S} before deploying each relay, but he has to learn the optimal average cost per step in the process of deployment, and, use the corresponding updated policy each time he places a new relay. In order to address this requirement, we propose an algorithm which will maintain a running estimate of λ * , and update it each time a relay is placed. The algorithm is motivated by the theory of Stochastic Approximation (see [20]), and it uses, as input, the measurements made for each placement, in order to improve the estimate of λ * . We prove that, as the number of deployed relays goes to infinity, the running estimate of average network cost per step converges to λ * almost surely. Note that, this algorithm is much more convenient than the traditional Q-learning algorithm; the Qlearning algorithm maintains a value function for each state of an MDP and updates each of them over time, whereas this algorithm updates only a scalar value, namely an estimate of the optimal average cost per step.
Let us recall that the sink is called node 0, and the subsequent relays are called nodes 1, 2, · · · . After the deployment is over, let us denote the length, transmit power and outage values of the link between node k and node (k − 1) by u k , γ k and Q (k,k−1) out . After placing the (k − 1)-st node, we will place node k, and consequently u k , γ k and Q (k,k−1) out will be decided according to the following algorithm.
Algorithm 6: (OptExploreLimLearning Algorithm) Let λ (k) be the estimate of the optimal average cost per step after placing the k-th relay (the sink is called node 0), and let λ (0) be the initial estimate. In the process of placing relay (k+1), if the measured outage probabilities are {Q out (u, γ, w u ) : A + 1 ≤ u ≤ A + B, γ ∈ S}, then place relay (k + 1) according to the following policy: After placing relay (k + 1), update λ (k) as follows (using the measurements made in the process of placing relay (k + 1)): One example is a k = 1 k . Theorem 9: Suppose that the channel model is given by (1), and that shadowing is i.i.d. across links. If we employ Algorithm 6 in the deployment process, we will have λ (k) → λ * almost surely.
Proof: By Theorem 6, under the optimal policy specified by λ * , we have we have E W f (W , λ * ) = 0, thus leading to the stochastic approximation update in Algorithm 6. The detailed proof can be found in Appendix D.
While Algorithm 6 utilizes the general stochastic approximation update, Algorithm 7 ensures that the iterate λ (k) is the actual average network cost per step up to the k-th relay.
Algorithm 7: Start with any λ (0) > 0. Let, for k ≥ 1, λ (k) be the average cost per step for the portion of the network already deployed between the sink and the k-th relay, i.e., Place the (k + 1)-st relay according to the following policy: Corollary 1: Under Algorithm 7 in the deployment process, we will have λ (k) → λ * almost surely.
Proof: See Appendix D.

VII. OPTEXPLORELIMADAPTIVELEARNING WITH CONSTRAINTS ON OUTAGE PROBABILITY AND RELAY PLACEMENT RATE
In Section VI, we provided a stochastic approximation algorithm for relay deployment, with given multipliers ξ out and ξ relay , without knowledge of the propagation parameters. The multipliers have to be chosen appropriately in order to enforce performance targets in a constrained sequential optimization formulation. Let us recall that Theorem 1 tells us how to choose the Lagrange multipliers ξ out and ξ relay (if they exist) in (3) in order to solve the problem given in (4).
However, we need to know the radio propagation parameters (e.g., η and σ) in order to compute an optimal pair (ξ * out , ξ * relay ) (if it exists) so that both constraints in (4) are met with equality. In real deployment scenarios, these propagation parameters might not be known to the deployment agent. Hence, in this section, we provide a sequential placement and learning algorithm such that, as the relays are placed, the placement policy iteratively converges to the set of optimal policies for the constrained problem displayed in (4). The policy is of the OptExploreLim type, and the cost of the deployed network converges to the optimal cost. We modify the OptExploreLimLearning algorithm so that a running estimate (λ (k) , ξ k out , ξ k relay ) gets updated each time a new relay is placed. The objective is to make sure that the running estimate (λ (k) , ξ k out , ξ k relay ) eventually converges to the set of optimal (λ * (ξ out , ξ relay ), ξ out , ξ relay ) tuples as the deployment progresses, and that the constraints are satisfied asymptotically. Our approach is via two time-scale stochastic approximation.

A. OptExploreLim: Effect of Multipliers ξ out and ξ relay
Consider the constrained problem in (4) and its relaxed version in (3). We will seek a policy for the problem in (4) in the class of OptExploreLim policies (see (7)). Clearly, there exists at least one tuple (q, N ) for which there exists a pair ξ * out > 0, ξ * relay > 0 such that, under the optimal policy π * (ξ * out , ξ * relay ), both constraints are met with equality. In order to see this, choose any ξ out > 0, ξ relay > 0 and consider the corresponding optimal policy π * (ξ out , ξ relay ) (provided by OptExploreLim). Suppose that the mean outage per step and mean number of relays per step, under the policy π * (ξ out , ξ relay ), are q 0 and n 0 , respectively. Now, if we set the constraints q = q 0 and N = n 0 in (4), we obtain one instance of such a tuple (q, N ).
On the other hand, there exist (q, N ) pairs which are not feasible. One example is the case N = 1 A+B (i.e., inter-node distance is always (A+B)), along with q < where P M is the maximum available transmit power level at each node. In this case, the outage constraint cannot be satisfied while meeting the constraint on the mean number of relays per step, since even use of the highest transmit power P M at each node will not satisfy the per-step outage constraint.
Definition 1: Let us denote the optimal mean power per step for problem (4) by γ * , for a given (q, N ). The set K(q, N ) is defined as follows: where the optimal average cost per step of the unconstrained problem (3) under OptExploreLim is λ * (ξ out , ξ relay ). K(q, N ) can possibly be empty (in case (q, N ) is not a feasible pair). Hence, we make the following assumption which ensures the non-emptiness of K(q, N ).
Assumption 1: The constraint parameters q and N in (4) are such that there exists at least one pair ξ * out ≥ 0, ξ * relay ≥ 0 for which (λ * (ξ * out , ξ * relay ), ξ * out , ξ * relay ) ∈ K(q, N ). Remark: Assumption 1 implies that the constraints are consistent (in terms of achievability). If ξ * out > 0, ξ * relay > 0, it would imply that both of the constraints are active. If ξ * out = 0, it would imply that we can keep the mean outage per step strictly less than q by using the minimum available power at each node, while meeting the constraint on the mean number of relays per step. The optimal policy in Algorithm 3, under ξ out = 0, will place relays with inter-relay distance (A + B) steps, and use the minimum available power level at each node. ξ * out = ∞ implies that the outage constraint cannot be met even with the highest power level at each node, under the relay placement rate constraint. Similar arguments apply to ξ * relay . We now establish some structural properties of K(q, N ). Theorem 10: If K(q, N ) is non-empty, then the following are true: • Suppose that there exists ξ * out > 0, ξ * relay > 0 such that the policy π * (ξ * out , ξ * relay ) satisfies both constraints in (4) with equality. Then, there does not exist ξ out ≥ 0, ξ relay ≥ 0 satisfying (i) (λ * (ξ out , ξ relay ), ξ out , ξ relay ) ∈ K(q, N ), and (ii) Proof: See Appendix E, Section A. Assumption 2: The shadowing random variable W has a continuous probability density function (p.d.f.) over (0, ∞); for any w ∈ (0, ∞), P(W = w) = 0. One example could be log-normal shadowing.
Proof: See Appendix E, Section B. Remark: Note that, by Theorem 11, we need not do any randomization (see [21] for reference) among deterministic policies in order to meet the constraints with equality.
Algorithm 8: This algorithm iteratively updates relay ) be the iterates after placing the k-th relay (the sink is called node 0), and let (λ (0) , ξ be the initial estimates. In the process of deploying the k-th relay, if the shadowing (which is measured indirectly only via Q out (u, γ, w u ) for A + 1 ≤ u ≤ A + B and γ ∈ S) is w = {w A+1 , · · · , w A+B }, then place the k-th relay according to the following policy: After placing the k-th relay, let us denote the transmit power, distance (in steps) and outage probability from relay k to relay (k − 1) by γ k , u k and Q out (u k , γ k , w u k ). After placing the k-th relay, make the following updates (using the measurements made in the process of placing the k-th relay): where Λ [0,A2] (x) denotes the projection of x on the interval [0, A 2 ]. A 2 and A 3 need to be chosen carefully; the reason is explained in the discussion later in this section (along with a brief discussion on how A 2 and A 3 have to be chosen), and a detailed method of choosing A 2 and A 3 has been described in Appendix E, Section C5.
Proof: See Appendix E, Section C for a detailed proof.

Discussion of Theorem 12:
(i) Two timescales: In Appendix E, Section C, we rewrite the update scheme (13) as a two-timescale stochastic approximation (see [20], Chapter 6). Note that, lim k→∞ b k a k = 0, i.e., ξ out and ξ relay are adapted in a slower timescale compared to λ (which is adapted in the faster timescale). The dynamics behaves as if ξ out and ξ relay are updated simultaneously in a slow outer loop, and, between two successive updates of ξ out and ξ relay , we update λ in an inner loop for a long time. Thus, the λ update equation views ξ out and ξ relay as quasi-static, while the ξ out and ξ relay update equations view the λ update equation as almost equilibrated. See [20] (Chapter 6) for reference. (ii) Structure of the iteration: Note that, (Q out (u k , γ k , w u k )− qu k ) is the excess outage compared to the allowed outage qu k for the k-th link. If this quantity is positive (resp., negative), the algorithm increases (resp., decreases) ξ out in order to reduce (resp., increase) the outage probability in subsequent steps. Similarly, if u k < 1 N , the algorithm increases ξ relay in order to reduce the relay placement rate. The objective is to ensure relay )) = 0; we will see later (in Theorem 13) that this iteration will ensure that the constraints in (4) are met. In the faster timescale, our aim is to ensure that lim k→∞ E W min u,γ (γ + ξ (iii) Outline of the proof: We will present the proof of Theorem 12 in five subsections in Appendix E, Section C. We first prove the almost sure boundedness of the λ (k) iterates in Subsection C1. Next, we prove in Subsection C2 that the difference between the sequences λ (k) and λ * (ξ relay ) converges to 0 almost surely; this will be required to prove the desired convergence in the faster timescale. This result has been proved using the theory in [20] (Chapter 6) and Theorem 9. In order to ensure almost sure boundedness of the slower timescale iterates, we have used the projection operation in the slower timescale. In Subsection C3 we pose the slower timescale iteration in the same form as a projected stochastic approximation iteration (see [22], Equation 5.3.1). In order to prove the desired convergence of the projected stochastic approximation, in Subsection C4, we will show that our iteration satisfies certain conditions given in  N ). Hence, we will need to ensure that if (ξ out , ξ relay ) is a stationary point of the o.d.e., then (λ * (ξ out , ξ relay ), ξ out , ξ relay ) ∈ K(q, N ). In order to ensure this, we need to choose A 2 and A 3 properly. The choice of A 2 and A 3 is rather technical, and is explained in detail in Appendix E, Section C5. Here we will just provide a brief description of their choices, without any explanation. The number A 2 has to be chosen so large that under ξ out = A 2 and for all A + 1 ≤ u ≤ A + B, we will have P(argmin γ∈S (γ + A 2 Q out (u, γ, W )) = P M ) > 1 − κ for some small enough κ > 0. We must also have Q * out (A2,0) U * (A2,0) ≤ q. The number A 3 has to be chosen so large that for any ξ out ∈ [0, A 2 ], we will have The numbers A 2 and A 3 have to be chosen so large that there exists at least one (ξ out , ξ relay ) ∈ [0, A 2 ] × [0, A 3 ] such that (λ * (ξ out , ξ relay ), ξ out , ξ relay ) ∈ K(q, N ). (iv) Asymptotic behaviour of the iterates: If the pair (q, N ) is such that one can be met with strict inequality and the other can be met with equality while using the optimal mean power per step for this pair (q, N ), then one Lagrange multiplier will converge to 0. This will happen if q > E W Qout(A+B,P1,W )

A+B
; we will have ξ (k) out → 0 (obvious from OptExploreLim with ξ out = 0) in this case. Here we will place all the relays at the (A + B)-th step and use the smallest power level at each node. On the other hand, if the constraints are not feasible, then either ξ relay → A 3 (since convergence to ∞ is not possible due to the projection operation) or both will happen. Note that, if K(q, N ) is nonempty, then there might possibly exist multiple pairs ξ * out ≥ 0, ξ * relay ≥ 0 such that (λ * (ξ * out , ξ * relay ), ξ * out , ξ * relay ) ∈ K(q, N ). So, in the rest of the paper, we will assume that K(q, N ) may possibly have multiple tuples. However, we strongly believe that if K(q, N ) is nonempty, then the iterates will always (for all sample paths) converge to the same limit (probably the only tuple in K(q, N )); we found the evidence of this assertion by extensive simulation.

C. Asymptotic Performance of OptExploreLimAdaptiveLearning
Let us denote by π oelal the (nonstationary) deployment policy induced by the OptExploreLimAdaptiveLearning algorithm (i.e., Algorithm 8). We will now show that π oelal is an optimal policy for the constrained problem (4).
Theorem 13: Suppose that Assumption 1 and Assumption 2 hold. Then, under proper choice of A 2 and A 3 , the policy π oelal solves the problem (4); i.e., we have: Proof: See Appendix E, Section D. Theorem 14: Suppose that Assumption 1 and Assumption 2 hold. Then, under proper choice of A 2 and A 3 , we have: Proof: The proof is similar to the proof of Theorem 13, and hence has been skipped.

VIII. CONVERGENCE SPEED OF LEARNING ALGORITHMS: A SIMULATION STUDY
In this section, we provide a simulation study to demonstrate the convergence rate of the OptExploreLimLearning algorithm (Algorithm 7) and the OptExploreLimAdaptiveLearning algorithm (Algorithm 8).
A. OptExploreLimLearning for Given ξ out and ξ relay Let us choose ξ out = 100 and ξ relay = 1. The optimal average cost per step, for this choice of parameters and under the propagation environment as in Section V-A, is 0.8312 (computed numerically using policy iteration).
Suppose that the actual η = 4.7, σ = 7.7 dB, but at the time of deployment we have an initial estimate that η = 4, σ = 7 dB; thus, we start with λ (0) = 0.4577. After placing the k-th relay, the actual average cost per step of the relay network connecting the k-th relay to the sink is λ (k) ; this quantity is a random variable whose realization depends on the shadowing realizations over the links measured in the process of deployment up to the k-th relay. We ran 10000 simulations of Algorithm 7 starting with different seeds for the shadowing random process (using the MATLAB package), and estimating E(λ (k) ) as the average of the samples of λ (k) over these 10000 simulations. We also do the same for λ (0) = 1.7667, which is the optimal cost for η = 5.5, σ = 9 dB.
The estimates of E(λ (k) ), k ≥ 1 as a function of k, for the two initial values of λ (0) , are shown in Figure 3. Also shown, in Figure 3, is the optimal value λ * = 0.8312 for the true propagation parameters (i.e., η = 4.7, σ = 7.7 dB). From Figure 3, we observe that E(λ (k) ) approaches the optimal cost 0.8312 for the actual propagation parameters, as the number of deployed relays increases, and gets to within 10% of the optimal cost by the time that 4 or 5 relays are placed, starting with two widely different initial guesses of the propagation parameters. Thus, OptExploreLimLearning could be useful even in situations where the distance from sink to the source can be covered in as few as 4 to 5 relays.
Note that, each simulation yields one sample path of the deployment process. We ran 10000 simulations in order to obtain the estimates of E(λ (k) ) as a function of k, by averaging over 10000 sample paths; the convergence speed will vary from one sample path to another even though λ (k) → 0.8312 almost surely.

B. OptExploreLimAdaptiveLearning
In this section, we will discuss how OptExploreLimAdap-tiveLearning (Algorithm 8) performs for deployment over a finite distance under an unknown propagation environment. We assume that the true propagation parameters are given in Section V-A (e.g., η = 4.7, σ = 7.7 dB). If we know the true propagation environment, then, under the choice ξ relay = 1 and ξ out = 100, the optimal average cost per step will be 0.8312, and this can be achieved by OptExploreLim (Algorithm 3). The corresponding mean outage per step will be 0.0045 2.2859 = 0.001969 (i.e., 0.1969%) and the mean number of relays per step will be 1 2.2859 (see Figure 7 in Appendix C). Now, suppose that we wish to solve the constrained problem in (4) with the targets q = 0.001969 (i.e., 0.1969%) and N = 1 2.2859 , but we do not know the true propagation environment. Hence, the deployment will use OptExploreLi-mAdaptiveLearning with some initial choice of ξ In order to make a fair comparison, we seek to compare among the following three scenarios: (i) η and σ are completely known (we use OptExploreLim with ξ relay = 1 and ξ out = 100 in this case), (ii) imperfect estimates of η and σ are available prior to deployment, and OptExploreLimAdaptive-Learning is used to learn the optimal policy, and (iii) imperfect estimates of η and σ are available prior to deployment, but a corresponding suboptimal policy is used throughout the deployment without any update. For convenience in writing, we introduce the abbreviations OELAL and OEL for OptEx-ploreLimAdaptiveLearning and OptExploreLim, respectively. We also use the abbreviation FPWU for "Fixed Policy without Update." Now, we formally introduce the following cases that we consider in our simulations: (i) OEL: OEL corresponds to the case where we know η = 4.7, σ = 7.7 dB, and use OptExploreLim (Algorithm 3) with ξ out = 100, ξ relay = 1, λ * = 0.8312. OEL will meet both the constraints with equality, and at the same time will minimize the mean power per step.
(ii) OELAL Case 1: OELAL Case 1 is the case where the true η and σ (which are unknown to the deployment agent) are specified by Section V-A, but we use OptEx-ploreLimAdaptiveLearning with ξ (0) out = 75, ξ (0) relay = 1.25 and λ (0) = 0.5007, in order to meet the constraints specified earlier in this subsection. Note that, under ξ out = 75 and ξ relay = 1.25, the optimal mean cost per step is 0.5007 for η = 4, σ = 7 dB. Hence, we start with a wrong choice of Lagrange multipliers, a wrong estimate of η and σ, and an estimate of the optimal average cost per step which corresponds to these wrong choices. The goal is to see how fast the variables λ (k) , ξ (k) out and ξ (k) relay converge to the desired target 0.8312, 100 and 1, respectively. We also seek to study how close to the desired target values are the quantities such as mean power per step, mean outage per step and mean placement distance for the relay network between k-th relay and the sink node (for any k ≥ 1). (iii) OELAL Case 2: OELAL Case 2 is different from OELAL Case 1 only in the aspect that λ (0) = 1.7679 is used in OELAL Case 2. Note that, under ξ out = 75 and ξ relay = 1.25, the optimal mean cost per step is 1.7679 for η = 5.5, σ = 9 dB. (iv) FPWU Case 1: In this case, the true η and σ are unknown to the deployment agent. The deployment agent uses ξ out = 75, ξ relay = 1.25 and λ * = 0.5007 throughout the deployment process under the algorithm specified by (7). Clearly, he chooses a wrong set of Lagrange multipliers ξ out = 75, ξ relay = 1.25, and he has a wrong estimate η = 4, σ = 7 dB. The optimal average cost per step λ * = 0.5007 is computed for these wrong choice of parameters, and the corresponding policy is used throughout the deployment process without any update. This case is simulated to see what is the gain in performance by updating the policy under OptExploreLimAdaptiveLearning, w.r.t. the case where a suboptimal policy driven by the initial imperfect estimate of parameters is used without any online update. (v) FPWU Case 2: It differs from FPWU Case 1 only in the aspect that we use λ * = 1.7679 in FPWU Case 2. Recall that, under ξ out = 75 and ξ relay = 1.25, the optimal mean cost per step is 1.7679 for η = 5.5, σ = 9 dB.
For simulation of OELAL, we chose the step sizes as follows. We chose a k = 1 k 0.55 , chose b k = 10000 k 0.8 for the ξ out update and b k = 1 k 0.8 for the ξ relay update (note that, both ξ out and ξ relay are updated in the same timescale). We simulated 10000 independent network deployments (i.e., 10000 sample paths of the deployment process) with OptEx-ploreLimAdaptiveLearning in MATLAB, and estimated (by averaging over 10000 deployments) the expectations of λ (k) , ξ Ui k , from the sink node to the k-th placed node. In each simulated network deployment, we placed 20000 nodes, i.e., k was allowed to go up to 20000. Asymptotically the estimates are supposed to converge to the values provided by OEL (by Theorem 14).
Observations from the Simulations: The results of the simulations are summarized in Figure 4 (see the previous page). From these plots, we make the following important observations.
The estimates of the expectations of λ (20000) , ξ and 2.2859, respectively. We found similar results for OELAL Case 2 also. Hence, the quantities converge very close to the desired values. We have shown convergence only up to k = 50 deployments in most cases, since the convergence rate of the algorithms in the initial phase are most important in practical deployments.
All the quantities except expectation of ξ (k) out and ξ (k) relay (which are updated in a slower timescale) converge reasonably close to the desired values by the time the 50-th relay is placed, which will cover a distance of roughly 2 − 3.5 km. distance. FPWU Case 1 and FPWU Case 2 either violate some constraint or uses significantly higher per-step power compared to OEL. But, by using the OptExploreLimAdaptiveLearning algorithm, we can achieve per-step power expenditure close to the optimal while (possibly) violating the constraints by small amount; even in case the performance of OELAL is not very close to the optimal performance, it will be significantly better than the performance under FPWU cases (compare OELAL Case 2 and FPWU Case 2 in Figure 4).
The speed of convergence will depend on the choice of the step sizes. We have shown the numerical results for one particular a k and b k sequence; optimizing the rate of convergence by choosing optimal step sizes is left for future endeavours in this direction. Also, note that, the choice of ξ (0) out , ξ (0) relay and λ (0) will have a significant effect on the performance of the network over a finite length; the more accurate are the estimates of η and σ, and the better are the initial choice of ξ (0) out , ξ (0) relay and λ (0) , the better will be the convergence speed of OptExploreLimAdaptiveLearning.

IX. CONCLUSION
In this paper, we have developed several approaches for as-you-go deployment of wireless relay networks using online measurements, under very light traffic assumption. Each problem was formulated as an MDP and its optimal policy structure was studied. We also studied a few learning algorithms that will asymptotically converge to the corresponding optimal policies. Numerical results have been provided to illustrate the performance and trade-offs.
This work can be extended or modified in several ways: (i) Networks that are robust to node failures and long term link variations would either require each relay to have multiple neighbours (i.e., the deployment would need to be multiconnected), or each node can use a power level that is higher than the power specified by the deployment algorithm, or the nodes can choose their transmit powers adaptively as the environment changes. (ii) It would be of interest to develop deployment algorithms for 2 and 3 dimensional regions, where a team of agents cooperates to carry out the deployment. (iii) If network lifetime is not a matter of concern, one could use a fixed high transmit power level in all nodes. Then the problem would be to find deployment strategies so that the mean outage per step is minimized subject to a constraint on the mean relay placement rate. The approach taken in our current paper is able to address this new problem as well. (iv) We have assumed very light traffic conditions in our design (what we call "lone packet" traffic). But we have found that these designs can carry a useful amount of positive traffic; our experimental deployment (reported in [8]) of a network using OptExploreLimLearning with suitable parameters demonstrates that a 500 m long network having only 5 relays can carry 4 packets per second, while having endto-end packet loss probability less than 1%, in a typical forestlike environment. It will be of interest, however, to develop deployment algorithms that can provide theoretical guarantees to achieve desired traffic rates.

APPENDIX A AVERAGE COST PER STEP: WITH PURE AS-YOU-GO
Proof of Lemma 1 Note that the function J (0) (·) := 0 satisfies all the assertions. Let us assume, as our induction hypothesis, that J (k) (·) satisfies all the assertions. Now Q out (r, γ, w) is increasing in r and decreasing in w (by our channel modeling assumptions in Section II-A), and the single stage costs are linear (hence concave) increasing in ξ relay , ξ out . Then from the value iteration, J (k+1) (r, w) is pointwise minimum of functions which are increasing in r, ξ out and ξ relay , decreasing in w, and jointly concave in ξ out and ξ relay . Hence, the assertions hold for J (k+1) (r, w). Similarly, we can show that the assertions hold for J (k+1) (0). Since J (k) (·) ↑ J(·), the results follow.
Proof of Theorem 2 Consider the Bellman equation (6). We will place a relay at state (r, w) iff the cost of placing a relay, i.e., min γ∈S (γ +ξ out Q out (r, γ, w))+ξ relay +J(0) is less than or equal to the cost of not placing, i.e., θE W min γ∈S (γ + ξ out Q out (r + 1, γ, W )) + (1 − θ)E W J(r + 1, W ). Hence, it is obvious that we will place a relay at state (r, w) iff min γ∈S (γ + ξ out Q out (r, γ, w)) ≤ c th (r) where the threshold c th (r) is given by: , if there exists a stationary policy {µ, µ, · · · } such that for each state, the action chosen by the policy is the action that achieves the minimum in the Bellman equation, then that stationary policy will be an optimal policy, i.e., the minimizer in Bellman equation gives the optimal action. Hence, if the decision is to place a relay at state (r, w), then the power has to be chosen as argmin γ∈S γ + ξ out Q out (r, γ, w) .
Since Q out (r, γ, w) and J(r, w) is increasing in r for each γ, w, it is easy to see that c th (r) is increasing in r.

APPENDIX B AVERAGE COST PER STEP: WITH EXPLORE-FORWARD
Proof of Theorem 4 Let us recall the definition of the functions µ (1) and µ (2) . Now, λ µ := is the average cost of a specific stationary deterministic policy µ (by the Renewal Reward Theorem, since the placement process regenerates at each placement point). For each policy (µ (1) , µ (2) ), the numerator is linear, increasing in ξ out and ξ relay and the denominator is independent of ξ out and ξ relay . Now, λ * (ξ out , ξ relay ) = inf µ λ µ . Hence, the proof follows immediately since the pointwise infimum of increasing linear functions of ξ out and ξ relay is increasing and jointly concave in ξ out and ξ relay , and since any increasing, concave function is continuous. Proof of Theorem 5: We will prove only the second statement of the theorem since the proof of the first statement is similar.
Consider any κ > 0. Now, since the mean cost per step is a linear combination of the mean power per step, mean outage per step and the mean number of relays per step, we can write: and λ * (ξout + κ, ξ relay ) where the inequality in (17) follows from the fact that π * (ξ out , ξ relay ) is an optimal policy for (ξ out , ξ relay ), and the inequality in (18) follows from the fact that π * (ξ out +κ, ξ relay ) is an optimal policy for (ξ out + κ, ξ relay ).
Proof of Theorem 7: Note that in (10), if the minimum is achieved by more than one pair of (u, γ), then any one of them can be considered to be the optimal action. Let us use the convention that among all minimizers the pair (u, γ) with minimum u will be considered as the optimal action, and if there are more than one such minimizing pair with same values of u, then the pair with smallest value of γ will be considered. We recall that S = {P 1 , P 2 , · · · , P M }. Let us denote, under policy µ k+1 , the probability that the optimal control is (u, γ) and the shadowing is w at the u-th location, by b k (u, γ, w).
Now, we can write, and w g(w)µ Note that, the policy improvement step is not explicitly required in the policy iteration. This is because in the policy evaluation step, λ k is sufficient to compute b k (u, γ, w) for all u, γ, w and thereby to compute λ k+1 . Hence, we need not store the policy in each iteration.
Proof of Lemma 2: Let us denote the HeuExploreLim policy by µ h and any other stationary, deterministic policy by µ. Let us denote the sequence of link costs incurred in the deployment process (for a semi-infinite line with given shadowing over all possible links) under policy µ h by c µ h ,1 , c µ h ,2 , · · · and the corresponding link lengths by u µ h ,1 , u µ h ,2 , · · · . Let us denote, under policy µ h , the shadowing observed at the i-th location (where A + 1 ≤ i ≤ A + B) in the measurement process for the placement of the l-th node, by w i,l . Now, let us couple the deployment processes under policies µ and µ h in the following way. Suppose that, under policy µ, the shadowing observed at the i-th location for the placement of the l-th node is again w i,l (this is valid since shadowing is i.i.d across links). Clearly,

APPENDIX C COMPARISON BETWEEN EXPLORE-FORWARD AND PURE
AS-YOU-GO APPROACHES Proof of Theorem 8 Note that for the average cost problem with pure as-you-go, there exists an optimal threshold policy (similar to Theorem 2), since the optimal policy for problem (5) achieves λ * ayg average cost per step for θ sufficiently close to 0. So, let one such optimal policy be given by the set of thresholds {c th (r)} A+1≤r≤A+B−1 . Now, let us consider the average cost minimization problem with explore-forward. Consider the policy where we first measure w A+1 , w A+2 , · · · , w A+B and decide to place a relay u steps away from the previous relay (where A + 1 ≤ u ≤ A + B − 1) if min γ∈S (γ + ξ out Q out (r, γ, w r )) > c th (r) for all r ≤ (u−1) and min γ∈S (γ+ξ out Q out (u, γ, w u )) ≤ c th (u). We must place if we reach at a distance (A+B) from the previous relay. But this is a particular policy for the problem where we gather w A+1 , w A+2 , · · · , w A+B and then decide where to place the relay, and clearly the average cost per step for this policy is λ * ayg which cannot be less than the optimal average cost λ * ef .
A. Optimal Policy Structure for the Pure As-You-Go Approach The variation of c th (r) (see Section III-E and Section III-G, we have taken θ sufficiently close to 0) with r, for various values of the relay cost ξ relay and the cost of outage ξ out , has been shown in Figure 5 and Figure 6. For a fixed ξ out , c th (r) decreases with ξ relay ; i.e., as the cost of placing a relay increases, we place relays less frequently. On the other hand,  Fig. 7. Results for ξout = 100: mean cost per step, mean power per link, mean outage per link and mean placement distance (steps) vs. ξ relay for the four algorithms: OptExploreLim, OptAsYouGo, HeuExploreLim, and HeuAsYouGo. Unit of ξ relay is actually mW, but in this figure it is shown in dBm. ξ relay , when expressed in dBm, is equal to 10 log 10 (ξ relay ). In the Power plot, the HeuAsYouGo plot overlaps the OptAsYouGo plot, since the node power in the HeuAsYouGo algorithm was taken to be the same as the mean node power with the OptAsYouGo algorithm.
for a fixed ξ relay , c th (r) increases with ξ out . This happens because if the cost of outage increases, we cannot tolerate outage and place the relays close to each other. Note also that, c th (r) increases in r as stated in Algorithm 1.

B. Comparison Among Various Deployment Algorithms
Next, assuming a system model as described in Section II and assuming the parameter values as in Section V-A, we computed the mean cost per step, mean power per node, mean outage per link and mean placement distance (between successive relays) for four deployment algorithms presented so far 5 . Some of the results are shown in Figure 7. In order to make a fair comparison, we used the mean power per node for OptAsYouGo as the fixed node transmit power for HeuAsYouGo, and the mean outage per link of OptAsYouGo as the pre-fixed target outage for HeuAsYouGo. The following observations are from the plots in Figure 7.
1) Mean Placement Distance (see the top left panel of Figure 7): Pure as-you-go algorithms (OptAsYouGo, HeuAsY-ouGo) place relays sooner than the algorithms that explore forward (OptExploreLim, HeuExploreLim) before placing a relay (see Figure 7). This is as expected, since pure as-yougo algorithms do not have the advantage of exploring over several locations and then picking the best. A pure as-yougo approach tends to be cautious, and therefore tries to avoid a high outage by placing relays frequently. As ξ relay (cost of a relay) increases, relays will be placed less frequently (according to Theorem 5).
2) Mean Outage per Link (see the top right panel of Figure 7): As ξ relay increases, the mean outage per link increases because we will place fewer relays with higher inter-relay distances. Pure as-you-go algorithms have link outage probability comparable to explore-forward algorithms, but they place relays too frequently. We observe that the per-link outage of HeuAsYouGo is different from that of OptAsYouGo. This happens because whenever we place a node using HeuAsYouGo, the exact outage target is never met with equality. Also, the per-link outage may decrease with ξ relay for HeuAsYouGo. As ξ relay increases, the node power and the target outage (chosen from OptAsYouGo) increases in such a way that the per-link outage for HeuAsYouGo behaves in this fashion.
We have also observed that, as ξ out , the penalty for outage, increases, the mean outage per link decreases. But that result has not been shown here. Figure 7): Increasing ξ relay will place relays less frequently, hence the transmit power increases. OptAsYouGo has smaller placement distance compared to OptExploreLim and HeuEx-ploreLim, and hence it uses less power at each hop; we note, however, that OptAsYouGo places more relays, and, hence, could still end up using more power per step.

3) Mean Power per Link (see the bottom left panel of
In the power plot, the HeuAsYouGo plot overlaps the OptAsYouGo plot, since the node power in the HeuAsYouGo algorithm was taken to be the same as the mean node power with the OptAsYouGo algorithm. We have also seen that increasing ξ out (the cost per unit outage) will lower outage and hence the per-node transmit power increases.

4) Network Cost Per
Step (see the bottom right panel of Figure 7): The network cost per step is the optimal average cost per step; see (3). Cost increases with ξ relay (see Figure 7) and ξ out . OptAsYouGo has a larger cost than OptExploreLim and HeuExploreLim, owing to shorter links. The average cost per step of HeuExploreLim is very close to OptExploreLim and cost of HeuAsYouGo is close to OptAsYouGo, even though the heuristic policies are not optimal. However, we observed that this does not always happen. For example, for ξ relay = 0.1 and ξ out = 1000, we found that the average cost per step for OptAsYouGo and HeuAsYouGo are 1.3485 and 1.9581 respectively, and the average cost per step for OptExplore-Lim and HeuExploreLim are 0.9810 and 1.0537 respectively. Discussion: (i) HeuExploreLim and HeuAsYouGo appear to be attractive at the first sight because they are intuitive, easy to implement, and they do not require any channel model for given ξ out and ξ relay . But, they are suboptimal, and we do not have any performance guarantee (e.g., the optimality gap w.r.t. the optimal algorithms OptExploreLim and OptAsYouGo). Hence, if we know the radio propagation model (e.g., η and σ) exactly, and if ξ out and ξ relay are given, it is better to compute the optimal policies and then deploy according to them. (ii) Note that, the mean number of measurements made per step for the pure as-you-go approach is 1, whereas it is B E(U ) under the explore-forward approach, where E(U ) is the mean distance between successive relays. From the numerical results presented in this section, we find that, under the explore-forward approach, the mean number of measurements required will be at most 3, and can be even less than 2 depending on the situation. For applications that do not require rapid deployment, such as deployment in a large forest for monitoring purpose, this many measurements is affordable. Hence, for the learning algorithms presented later in the paper, we will consider only explore-forward approach. (iii) More importantly, in practice the propagation environment will not be known, and, in order to solve the problem defined in (4), we need to choose ξ * out and ξ * relay while deploying (as explained in Theorem 1), if possible. But we cannot choose this pair if we do not have a prior knowledge of the propagation environment. Poor choice of ξ out and ξ relay might lead to violation of the constraints in the constrained problem defined in (4), or might result in a higher mean power per step compared to the optimal mean power per step under the constraints. Hence, we need to adapt ξ out and ξ relay as deployment progresses; this has been explained later in this paper. The adaptive algorithms use the structure of the optimal policy OptExploreLim.
APPENDIX D OPTEXPLORELIMLEARNING: LEARNING WITH EXPLORE-FORWARD, FOR GIVEN ξ out AND ξ relay Proof of Theorem 9: Let us denote the shadowing random variable in the link between the potential locations located at distances iδ and jδ from the sink node by W i,j . The sample space Ω associated with the deployment process is the collection of all ω (each ω corresponds to a fixed realization {w i,j : i ≥ 0, j ≥ 0, i > j, A + 1 ≤ i − j ≤ A + B} of all possible shadowing random variables that might be encountered in the measurement process for deployment up to infinity). Let F be the Borel σ-algebra on Ω. Let S k = k i=1 U i be the distance (in steps) of the k-th relay from the source, and The sequence of σ-algebras F k is increasing in k, and F k captures the history of the deployment process up to the deployment of the k-th relay.
By Theorem 2 (in Chapter 2) of [20], if the four conditions are satisfied, then λ (k) will almost surely converge to the unique zero of f (·). But, that unique zero is the optimal average cost per step λ * which satisfies f (λ * ) = 0 (by Theorem 6). Hence, the problem reduces to checking the conditions (i)-(iv).
Since f (λ) is Lipschitz continuous with Lipschitz constant (A + B), condition (i) is satisfied. Condition (ii) is satisfied by the choice of a k .
By definition of N k , we have E W (N k+1 |F k ) = E W (N k+1 |λ (k) ) = 0 (since shadowing is i.i.d. across links, the shadowing values encountered in the process of measurement for placing a new node are independent of the shadowing values encountered in the measurement process for deploying the previous nodes) which implies that {N k+1 } k≥1 is a Martingale difference sequence w.r.t. F k . Now, since the conditional second moment is greater than conditional variance almost surely, we have (almost surely): Now, we know that γ ≤ P M , A + 1 ≤ u ≤ A + B, outage probability is always in [0, 1], and ξ out and ξ relay are fixed. Hence, E(|N k+1 | 2 |F k ) can be upper bounded by K(1 + |λ (k) | 2 ) for some K > 0. Hence, condition (iii) is also satisfied. Condition (iv) is satisfied by the following lemma.
Proof: Let us define K 0 to be the smallest integer such that a k (A + B) < 1 for all k ≥ K 0 (K 0 exists since a k ↓ 0). For any starting value λ (0) , it is easy to find a positive real number d (depending on the value of λ (0) ) such that λ (k) ∈ [−d, d] for all k ≤ K 0 ; this is easy to see because the node transmit power, node outage probability and placement distance for each node are bounded quantities.
Without loss of generality, we can take d > P M + ξ out + ξ relay where P M is the maximum transmit power level of a node. We already have that λ (k) ∈ [−d, d] for all k ≤ K 0 . Now we will show that λ (k) ∈ [−d, d] for all k ≥ K 0 . To this end, let us assume, as our induction hypothesis, that λ (k) ∈ [−d, d] for some k ≥ K 0 . If we can show that λ (k+1) ∈ [−d, d], we will be done with the proof.
Proof of Corollary 1: Suppose that, we choose a k = Now, with this step size, and, in general, Hence, in (11) of Algorithm 6, we can replace , and this proves the theorem.

A. Proof of Theorem 10
Proof of the first statement: Let us assume that π * (ξ * out , ξ * relay ) satisfies both constraints in (4) with equality for some ξ * out > 0, ξ * relay > 0, i.e., π * (ξ * out , ξ * relay ) is an optimal policy for problem (4). Now, let us assume that there exists ξ out ≥ 0, ξ relay ≥ 0 satisfying (i) (λ * (ξ out , ξ relay ), ξ out , ξ relay ) ∈ K(q, N ), and (ii) Q * out (ξ out ,ξ relay ) U * (ξ out ,ξ relay ) < q. We will show that this leads to a contradiction. Let us consider the problem of minimizing the mean outage per step subject to a constraint Γ * (ξ * out ,ξ * relay ) U * (ξ * out ,ξ * relay ) on the mean power per step and a constraint 1 U * (ξ * out ,ξ * relay ) = N on the mean number of relays per step. Clearly, by Theorem 1, π * (ξ * out , ξ * relay ) is an optimal policy for this problem since it satisfies both constraints with equality. Note that, π * (ξ * out , ξ * relay ) has a mean outage per step q. But, we also see that the policy π * (ξ out , ξ relay ) has the same mean power per step and a smaller mean number of relays per step compared to π * (ξ * out , ξ * relay ) (since (λ * (ξ out , ξ relay ), ξ out , ξ relay ) ∈ K(q, N )), and has a strictly smaller mean outage per step compared to π * (ξ * out , ξ * relay ). This leads to a contradiction since π * (ξ * out , ξ * relay ) is an optimal policy for the problem of minimizing the mean outage per step subject to a constraint Γ * (ξ * out ,ξ * relay ) U * (ξ * out ,ξ * relay ) on the mean power per step and a constraint 1 U * (ξ * out ,ξ * relay ) = N on the mean number of relays per step. Similarly, we can show a contradiction if, instead of assuming Q * out (ξ out ,ξ relay ) U (ξ out ,ξ relay ) < q, we had assumed 1 U (ξ out ,ξ relay ) < N . Hence, the first statement is proved. Proof of the second statement: This statement follows from the fact that for any ξ relay ≥ 0, π * (0, ξ relay ) always places at a distance (A + B) and uses the smallest power P 1 , thereby incurring a mean outage per step equal to E W Qout(A+B,P1,W ) A+B .
Let us assume that g(r, γ) is continuous in both ξ out and ξ relay (we will prove this assertion in Lemma 4 at the end of the proof of the theorem). By Lemma 4, the mean placement distance U * (ξ out , ξ relay ) = A+B r=A+1 γ∈S rg(r, γ) is continuous in ξ out and ξ relay . Similarly, the mean power per link Γ * (ξ out , ξ relay ) = A+B r=A+1 γ∈S γg(r, γ) is continuous in ξ out and ξ relay .
Proof: Let us fix any r ∈ {A + 1, · · · , A + B} and any γ ∈ S. We will show that g(r, γ) is continuous in ξ out . The continuity of g(r, γ) w.r.t. ξ relay will follow the same line of arguments.
Consider any sequence {ξ n } n≥1 such that ξ n → ξ out . Let us denote the joint probability distribution of placement distance and node transmit power by g n (r, γ), if the cost per unit outage is ξ n and if OptExploreLim is used in the deployment process. We will show that g n (r, γ) → g(r, γ) as n → ∞.
In state w, the OptExploreLim algorithm (Algorithm 3) will place the next relay at distance r and decide power level γ if w ∈ E γ for all γ = γ, γ ∈ S and if w ∈ E u,γ for all u = r, γ ∈ S.
Let us define E = ∩ γ =γ E γ ∩ u =r,γ ∈S E u,γ . Note that, g(r, γ) = P(E) = E(I E ), where I denotes the indicator function, and the expectation is over the joint distribution of the shadowing vector W (shadowing random variables from B locations). Now, for any γ = γ, we have P γ + ξ out Q out (r, γ, W r ) = γ + ξ out Q out (r, γ , W r ) = 0. Also, if γ ∈ S, u = r. These two assertions follow from Assumption 2 and the fact that Q out (r, γ, w) is continuous in w. Hence, we discard these zero probability events in our analysis and safely assume that: • For γ = γ, the complement E γ has the same expression as E γ except that the < sign is replaced by > sign. • For γ ∈ S, u = r, E u,γ has the same expression as E u,γ except that the < sign is replaced by > sign. Now, consider any sequence {ξ n } n≥1 such that ξ n → ξ out . Let E (n) γ , E (n) u,γ and E (n) be the sets obtained where we replace ξ out by ξ n in the expressions of the sets E γ , E u,γ and E respectively. Clearly, we can make similar claims for E (n) γ , E (n) u,γ for any n ≥ 1.
Recall that, g(r, γ) = P(E) = E(I E ). Clearly, if we can show that E(I E (n) ) → E(I E ), the lemma will be proved.

C. Proof of Theorem 12
Let us denote the shadowing random variable in the link between the potential locations located at distances iδ and jδ from the sink node by W i,j . The sample space Ω associated with the deployment process is the collection of all ω (each ω corresponds to a fixed realization {w i,j : i ≥ 0, j ≥ 0, i > j, A + 1 ≤ i − j ≤ A + B} of all possible shadowing random variables that might be encountered in the measurement process for deployment up to infinity). Let F be the Borel σ-algebra on Ω. Let S k = k i=1 U i be the distance (in steps) of the k-th relay from the source, and sequence of σ-algebras F k is increasing in k, and F k captures the history of the deployment process up to the deployment of the k-th relay.
Let us recall the outline of the proof of Theorem 12 in Section VII-B.
Proof: Let us define K 0 to be the smallest integer such that a k (A + B) < 1 for all k ≥ K 0 (K 0 exists since a k ↓ 0). For any starting value λ (0) , it is easy to find a positive real number d (depending on the value of λ (0) ) such that λ (k) ∈ [−d, d] for all k ≤ K 0 ; this is easy to see because ξ relay ∈ [0, A 3 ] for all k, and the node transmit power, node outage probability and placement distance for each node are bounded quantities.
Without loss of generality, we can take d > P M + A 2 + A 3 where P M is the maximum transmit power level of a node. We already have that λ (k) ∈ [−d, d] for all k ≤ K 0 . Now we will show that λ (k) ∈ [−d, d] for all k ≥ K 0 . To this end, let us assume, as our induction hypothesis, that λ (k) ∈ [−d, d] for some k ≥ K 0 . If we can show that λ (k+1) ∈ [−d, d], we will be done with the proof.
2) Analyzing the Faster Time-Scale Iteration of λ (k) : Let us denote by λ * (ξ out , ξ relay ) the optimal average cost per step for the problem in (3), for given ξ out and ξ relay .
Using the first order Taylor series expansion of the function Λ [0,A2] (·), and using the fact that Λ [0,A2] (ξ where x + = max{x, 0} and x − = − min{x, 0}. A similar expression holds for the ξ (k) out update. Since Q out (·, ·, ·) and u k are bounded quantities, and since lim k→0 b k a k = 0, we have: Now, note that, the function f (λ, ξ out , ξ relay ) = E W min u,γ γ + ξ out Q out (u, γ, W u ) + ξ relay − λu is Lipschitz continuous in all arguments, and the o.d.e.λ(t) = f (λ(t), ξ out , ξ relay ) has a unique globally asymptotically stable equilibrium λ * (ξ out , ξ relay ) for any ξ out ≥ 0, ξ relay ≥ 0 (see the proof of Theorem 9). The quantity λ * (ξ out , ξ relay ) is Lipschitz continuous in ξ out and ξ relay . Also by Lemma 5 and the projection operation in the slower timescale, the iterates are bounded almost surely. relay ). But it is important to note that this lemma does not guarantee the convergence of the slower timescale iterates to a single point in the twodimensional Euclidean plane.
3) The slower timescale iteration: Let us recall the notation Q out (λ, ξ out , ξ relay ), U (λ, ξ out , ξ relay ), Q * out (ξ out , ξ relay ) and U * (ξ out , ξ relay ) as defined in Section IV-B. Let us also recall the update equation (13) We rewrite the slower timescale update equations in (13) as (23). Note that, the functions f 1 (ξ out , ξ relay ), f 2 (ξ out , ξ relay ), g 1 (λ, ξ out , ξ relay ), and g 2 (λ, ξ out , ξ relay ) have been defined in (23). The quantities M are two zero mean Martingale difference noise sequences w.r.t. F k−1 ; this can be seen as follows. Since shadowing is i.i.d. across links, the shadowing values encountered in the process of measurement for placing the k-th node are independent of the history of the process up to the placement of node (k − 1). Hence, The update is done as follows. We computeξ out in a similar fashion. Hence, projection onto the set G is nothing but coordinatewise projection.
Note that, (23) is in the same form as the standard projected stochastic approximation (Equation : Before checking the five conditions, we will present a lemma that will be useful for checking one condition. Lemma 7: Suppose that Assumption 2 holds. Under the decision rule given by (9), the mean power per step Γ(λ,ξout,ξ relay ) U (λ,ξout,ξ relay ) , mean number of relays per step 1 U (λ,ξout,ξ relay ) and mean outage per step Q out (λ,ξout,ξ relay ) U (λ,ξout,ξ relay ) are continuous in λ, ξ out and ξ relay . Proof: The proof is similar to that of Theorem 11. Now, we will check that conditions A5. Note that, Q out (λ, ξ out , ξ relay ) is continuous in each argument (by Lemma 7). Hence, Q out (λ, ξ out , ξ relay ) is uniformly continuous over the compact set Now, by Lemma 6, the Euclidean distance between (λ (k) , ξ relay ) converges to 0 almost surely as k → ∞. Hence, by uniform continuity, we conclude that lim k→∞ |Q out (λ (k) , ξ relay ) are uniformly bounded across k ≥ 1, since the outage probabilities and placement distances are bounded quantities.
Checking Condition A5.3.1: This condition requires that is the closure of its interior, which is true in our problem. It also requires that the L.H.S. of each constraint inequality in (24) is continuously differentiable, which is also true in our problem.
Note that, the L.H.S. of each constraint inequality in (24) is a function of ξ out and ξ relay . Condition A5.3.1 of [22] needs that for each point on the boundary of G, the gradients of the functions (in the L.H.S. of (24)) corresponding to the active constraints are linearly independent. Note that on each point of the boundary of G, at most two constraints can be simultaneously active (see (24)). If there are exactly two active constraints, one will be for ξ out and the other one will be for ξ relay . Clearly, the gradients (with respect to the tuple (ξ out , ξ relay )) of the active constraint(s) at any boundary point of G are orthogonal, and hence linearly independent.
Checking Condition A5.3.2: Let m(t) := sup{n ≥ 1 : We have to show that, there exists a T > 0 such that for any > 0, we have We will prove the first result, for any T > 0 and any > 0. Let us define the event where the second equality follows from the continuity of probability.
(by Doob's inequality for Martingales). Since, |M (i) 1 | ≤ C for some C > 0 (since outage probability and placement distance are two bounded quantities; see the expression for M the above quantity can be upper-bounded by P(E n ) ≤ Hence, by Borel-Cantelli lemma, P lim sup n→∞ E n = 0, relay ) and use the modified functions like as in (25), the conditions checked in the previous subsection will still hold. This is evident from the fact that, once we know ξ relay ) becomes a deterministic quantity, and that the randomness in the computation of the new iterates ξ  − q, 1 U * (ξout,ξ relay ) − N , let us define the map: We want to show that the iterates (ξ (k) out , ξ (k) relay ) will converge almost surely to the set of stationary points of the o.d.e. (ξ out (t),ξ relay (t)) = Λ G f1(ξout(t),ξ relay (t)) U * (ξout(t),ξ relay (t)) , f2(ξout(t),ξ relay (t)) U * (ξout(t),ξ relay (t)) . is the gradient of a continuously differentiable function. Let us denote, by Γ π , U π and Q out,π , the mean power per link, mean placement distance per link and mean outage per link respectively, under any given stationary deployment policy π. Let us define the function Lemma 8: G(ξ out , ξ relay ) is continuously differentiable and its gradient is f1(ξout,ξ relay ) U * (ξout,ξ relay ) , f2(ξout,ξ relay ) U * (ξout,ξ relay ) . Proof: The proof of Lemma 8 will be provided later in this section.
Proof: The proof of Lemma 9 will be provided later in this section. One way of choosing A 2 and A 3 has been described before the proof of this lemma.
Proof of Lemma 8 Suppose that, for a given (ξ out , ξ relay ), the partial derivative ∂G ∂ξout exists. We will first show that this partial derivative is equal to f1(ξout,ξ relay ) U * (ξout,ξ relay ) = Q * out (ξout,ξ relay ) U * (ξout,ξ relay ) −q. Note that, the right partial derivative w.r.t. ξ out (if it exists) is: G(ξ out + ∆, ξ relay ) − G(ξ out , ξ relay ) ∆ Now, the optimal policy π * (ξ out , ξ relay ) for the unconstrained problem in (3) will also minimize the expression for G(ξ out , ξ relay ) in (27). But, the policy π * (ξ out , ξ relay ) will be suboptimal for the pair (ξ out + ∆, ξ relay ). Hence, we have: In a similar manner, by using the fact that π * (ξ out , ξ relay ) is suboptimal for the pair (ξ out −∆, ξ relay ), we can claim that Since we have assumed that ∂G ∂ξout exists, we must have ∂G ∂ξout+ = ∂G ∂ξout− , which proves that the partial derivative w.r.t. ξ out will be equal to ( Q * out (ξout,ξ relay ) U * (ξout,ξ relay ) − q). We now turn to the existence of ∂G ∂ξout . Note that, since G(ξ out , ξ relay ) is the minimum of a family of affine functions of ξ out and ξ relay , G(ξ out , ξ relay ) is concave and hence coordinatewise concave. Hence, for any given ξ relay , there are only at most countably many values of ξ out where ∂G ∂ξout does not exist. To see this, let us define the function H(ξ out , ξ relay ) to be the supremum of the subgradients of G(ξ out , ξ relay ) with respect to ξ out (keeping ξ relay fixed), at a point (ξ out , ξ relay ). Since G(ξ out , ξ relay ) is concave, H(ξ out , ξ relay ) will be decreasing in ξ out . But any monotone real-valued function has an at most countable number of discontinuities (see [24], Theorem 4.30). Hence, for a given ξ relay , the function H(ξ out , ξ relay ) is discontinuous for an at most countable number of values of ξ out , and consequently ∂G ∂ξout exists everywhere except for an at most countable set of values of ξ out .
For a given ξ relay , let ξ out be one such value where ∂G ∂ξout does not exist. Then, there exists a sequence {ζ n } n≥1 ↓ 0 such that ∂G ∂ξout exists at each ξ out = ξ out + ζ n . This follows from the fact that for any ζ > 0, we can find one ξ out ∈ (ξ out , ξ out + ζ) where ∂G ∂ξout exists, otherwise the number of points where ∂G ∂ξout does not exist will become uncountable. Similarly, there exists a sequence {κ n } n≥1 ↓ 0 such that ∂G ∂ξout exists at each ξ out = ξ out − κ n .
Note that, by concavity, lim n→∞ ∂G ∂ξout | ξ out −κn ≥ ∂G ∂ξout− | ξ out ≥ ∂G ∂ξout+ | ξ out ≥ lim n→∞ ∂G ∂ξout | ξ out +ζn . The last term in this chain of inequalities is equal to by the arguments in the beginning of this proof and by the continuity results in Theorem 11. Same arguments hold for the first term in the chain of inequalities. Hence, ∂G ∂ξout− | ξ out = ∂G ∂ξout+ | ξ out = ∂G ∂ξout | ξ out = ( Q * out (ξ out ,ξ relay ) U * (ξ out ,ξ relay ) − q). In a similar way, we can show that ∂G ∂ξ relay = ( 1 U * (ξout,ξ relay ) − N ). Now we see that both of the partial derivatives of G exist at all points and the partial derivatives are continuous in both ξ out and ξ relay (by Theorem 11). Hence, by Theorem 12.11 of [25], G(·, ·) is differentiable. Hence, the lemma is proved.
Choice of A 2 and A 3 : Let us consider the scenario where Λ G f1(ξout,ξ relay ) U * (ξout,ξ relay ) , f2(ξout,ξ relay ) U * (ξout,ξ relay ) has a stationary point (ξ out , ξ relay ) (on the boundary of G) such that (λ * (ξ out , ξ relay ), ξ out , ξ relay ) ∈ K(q, N ). In this case, if relay ) → (λ * (ξ out , ξ relay ), ξ out , ξ relay ) (depending on the sample path of the iterates in the OptExploreLi-mAdaptiveLearning algorithm), then we cannot expect the desired performance from the OptExploreLimAdaptiveLearning algorithm. To alleviate this problem, we need to choose A 2 and A 3 in a proper way. One method of choosing A 2 and A 3 is given below.
We will first explain how A 2 has to be chosen. Note that, for any given link of length u and shadowing realization w, argmin γ∈S (γ + ξ out Q out (u, γ, w)) = P M if we choose ξ out sufficiently large. We use this fact in the choice of A 2 . The number A 2 has to be chosen so large that under ξ out = A 2 and for all A + 1 ≤ u ≤ A + B, we will have P(argmin γ∈S (γ + A 2 Q out (u, γ, W )) = P M ) > 1 − κ for some small enough κ > 0. Such a choice of A 2 ensures that (i) the mean power per link (under policy π * (A 2 , ξ relay )), Γ * (A 2 , ξ relay ) ≥ (1 − κ)P M + κP 1 (which is close enough to P M ), which, further, ensures that (ii) Γ * (A2,ξ relay ) 1/N is greater than or equal to the optimal mean power per step for problem (4). The second claim is easy to see, since Γ * (A 2 , ξ relay ) ≥ (1 − κ)P M + κP 1 ≥ Γ * (ξ * out , ξ * relay ), and since U * (ξ * out , ξ * relay ) ≥ 1 N (recall Assumption 1 about the existence of ξ * out and ξ * relay ). Note that, the choice of κ depends on (q, N ) and the radio propagation parameters, and, hence, must be made carefully so that the condition is satisfied. In the proof of Lemma 9, we will see that this condition ensures that for any stationary point of the form ξ out = A 2 , ξ relay ∈ (0, A 3 ), we have Q * out (A2,ξ relay ) U * (A2,ξ relay ) = q and 1 U * (A2,ξ relay ) = N , and, consequently, the point (A 2 , ξ relay ) will be in K(q, N ).
The choice of A 2 must satisfy another condition. We need to choose A 2 so large that Q * out (A2,0) U * (A2,0) ≤ q. Note that, if (q, N ) is a feasible constraint pair, then a constraint q on the mean outage per step alone (if we drop the constraint on the relay placement rate) is also feasible. Let us consider the problem of minimizing the mean power per step subject to a constraint q on the mean outage per step. Then, we will choose ξ relay = 0. The mean outage per step under policy π * (ξ out , 0) will still decrease as ξ out increases (by Theorem 5). Hence, we can choose an A 2 which satisfies this condition. This condition will be used in showing that if (A 2 , 0) is a stationary point of the o.d.e., then (λ * (A 2 , 0), A 2 , 0) ∈ K(q, N ).
A 2 has to be chosen (according to the two criteria mentioned above) via prior computation, using the prior knowledge of the propagation environment; if we know the range of values of radio propagation parameters (e.g., η and σ), we can compute what value of A 2 will satisfy the criteria under all possible radio propagation parameters.
Once A 2 is chosen, we need to choose A 3 . The number A 3 has to be chosen so large that for any ξ out ∈ [0, A 2 ], we will have U * (ξ out , A 3 ) > 1 N (provided that 1 N < A + B). This is possible and obvious from the structure of OptEx-ploreLim (Algorithm 3); by choosing ξ relay large enough, we can achieve a mean placement distance equal to (A + B), provided that ξ out ∈ [0, A 2 ]. For example, if we choose A 3 = 100(A + B)(P M + A 2 ), then: and π * (ξ out , A 3 ) will always place at a distance of (A + B). This choice of A 3 ensures that the policy π * (ξ out , A 3 ) satisfies the constraint on the relay placement rate with strict inequality, and hence no point of the form (ξ out , A 3 ) is a stationary point of the o.d.e. The numbers A 2 and A 3 have to be chosen so large that there exists at least one (ξ out , ξ relay ) ∈ [0, A 2 ] × [0, A 3 ] such that (λ * (ξ out , ξ relay ), ξ out , ξ relay ) ∈ K(q, N ).
We can take care of stationary points of the form ξ relay ∈ (0, A 3 ), ξ out = 0 in a similar way.
Hence, the lemma is proved.

D. Proof of Theorem 13
We will only prove that lim sup x→∞ Eπ oelal Nx i=1 Γi x ≤ γ * almost surely.