Dynamic pricing has been widely studied, both theoretically and empirically. One classical approach is to formulate the problem as a bandit problem, where each action (or “arm”) corresponds to a price. The action space (the set of prices) is thus naturally continuous, leading to the framework of infinite-armed bandits.
Most of the existing literature assumes that the demand or valuation distribution is stationary over time. However, in many real-world applications, customers’ valuations evolve due to seasonality, trends, or external shocks. This motivates the study of non-stationary dynamic pricing.
Our recent NeurIPS paper (Nguyen et al., 2025) investigates the general problem of non-stationary Lipschitz bandits, a setting that had received little attention so far. In this note, we show how the results of this paper can be applied to dynamic pricing with time-varying valuations.
Suppose that you, the algorithm designer (or seller), have an unlimited supply of identical items and wish to maximize your cumulative revenue over time. At each selling opportunity, you must choose a price for the item. Setting a very low price (e.g., selling a 10km uber rides at 2\$) guarantees sales but leads to poor revenue, while setting a very high price (e.g., a 10 km uber rides at 100\$) risks losing customers altogether. The challenge is to learn the optimal pricing strategy from interaction data.
Learning model. The interaction between the seller and the buyers is described as follows. At each round $t = 1, \ldots, T:$
The seller only observes a binary feedback \[ y_t = \mathbf{1}\{v_t \geq x_t\}, \] and receives the reward \[ r_t = x_t \, y_t, \] which is equal to the posted price if a sale occurs, and $0$ otherwise.
We introduce the action space $\mathcal{X} = [0,1]$, which represents the set of all possible prices that the seller can post. For a given price $x \in \mathcal{X}$, the expected reward is defined as
\[ \mu(x) = \mathbb{E}_{v \sim \mathcal{D}}\!\left[x \, \mathbf{1}\{v \geq x\}\right] = x \, \mathbb{P}_{v \sim \mathcal{D}}(v \geq x) = x \big(1 - F(x)\big), \]
where $F$ denotes the cumulative distribution function (c.d.f.) of the valuation distribution $\mathcal{D}$.
We now introduce the following regularity assumption.
Assumption (Lipschitz continuity of the demand). The function $x \mapsto F(x)$ is $1$-Lipschitz on $[0,1]$, that is,
\[ \forall x, x' \in [0,1], \qquad |F(x) - F(x')| \leq L_F |x - x'|. \]
This assumption is natural in many economic models: if two prices are close, then the corresponding sale probabilities are also close. Indeed, the probability of a sale at price $x$ is $1 - F(x)$, which directly inherits the Lipschitz regularity of $F$.
Note that the Lipschitz continuity of the distribution function $F$ directly implies the Lipschitz continuity of the mean reward function $\mu$. Indeed, for all $x, x' \in [0,1]$, we have
\[ \begin{aligned} |\mu(x) - \mu(x')| &= \big| x(1 - F(x)) - x'(1 - F(x')) \big| \\ &\leq |x|\,|F(x') - F(x)| + |1 - F(x')|\,|x - x'| \\ &\leq |F(x') - F(x)| + |x - x'| \\ &\leq (L_F + 1)\, |x - x'|, \end{aligned} \]
where we used the fact that prices are bounded in $[0,1]$, that $|1 - F(x')| \leq 1$, and that $F$ is $L_F$-Lipschitz.
Our objective is to identify the optimal price that maximizes the expected revenue, namely
\[ x^* = \arg\max_{x \in [0,1]} \mu(x) = \arg\max_{x \in [0,1]} x \, \mathbb{P}(v \geq x). \]
A natural performance criterion is the cumulative regret over a horizon of $T$ rounds, defined as
\[ \mathbb{E}[R_T] = \mathbb{E}\!\left[\sum_{t=1}^T \big( \mu(x^*) - \mu(x_t) \big)\right]. \]
Since we have reduced the stochastic dynamic pricing problem to a Lipschitz bandit problem on the action space $[0,1]$ with an $(L_F + 1)$-Lipschitz mean reward function, the minimax-optimal regret rate is known to scale as
\[ \mathbb{E}[R_T] \asymp (L_F + 1)^{1/3} \, T^{2/3}. \]
In the non-stationary setting, the distribution of buyer valuations may vary over time. That is, the valuation $v_t$ is now sampled from a time-dependent distribution $\mathcal{D}_t$. This captures realistic scenarios such as demand shifts, seasonal effects, changing competition, or evolving customer preferences.
As a consequence, the expected revenue function also becomes time-dependent. For any price $x \in [0,1]$ and any round $t$, the mean reward is now given by
\[ \mu_t(x) = \mathbb{E}_{v \sim \mathcal{D}_t}\!\left[x \, \mathbf{1}\{v \geq x\}\right] = x \big(1 - F_t(x)\big), \]
where $F_t$ denotes the cumulative distribution function of $\mathcal{D}_t$.
Assumption (Regularity with respect to prices). For every round $t$, the function $x \mapsto F_t(x)$ is $L_F$-Lipschitz on $[0,1]$. Consequently, each revenue function $\mu_t$ is $(L_F+1)$-Lipschitz.
At each round $t$, the optimal price is now defined as
\[ x_t^* = \arg\max_{x \in [0,1]} \mu_t(x), \]
which may change over time as the valuation distribution evolves.
The performance of a pricing policy $(x_t)_{t=1}^T$ is measured through the dynamic regret, defined as
\[ R_T = \sum_{t=1}^T \big( \mu_t(x_t^*) - \mu_t(x_t) \big). \]
Unlike the stationary regret, the dynamic regret compares the learner to a oracle benchmark that adapts optimally at each time step. This makes the problem fundamentally more challenging.
Using the theory of non-stationary Lipschitz bandits developed in our paper, we can design an algorithm that adapt to the changing environment without any prior knowledge on the non-stationarity.
In particular, our algorithm achieving the following minimax-optimal rate on the dynamic regret:
\[ \mathbb{E}[R_T] \;\asymp\; (L_F + 1)^{1/3} \, (L_T+1)^{1/3}\, T^{2/3}\,. \]
up to logarithmic factors and numerical multiplicative constants. This bound shows that the regret interpolates between the stationary regime (when there is no significant shift, $L_T = 0$) and highly non-stationary environments (when $L_T = \mathcal{O}(T)$ and we thus have linear regret; but minimax optimality tells us that we cannot do better in terms of rates).