hsuyunchia/Reinforcement Learning for Dynamic Pricing Suggestion.md

## Reinforcement Learning for Dynamic Pricing Suggestion.md

      
    Raw
  

              Reinforcement Learning for Dynamic Pricing Suggestion.md
            
          
    Reinforcement Learning for Dynamic Pricing Suggestion

June 25, 2021 / by Wei, Kang and Yun-Chia, Hsu


This study proposes value iteration and Deep Q-learning (DQN) models to provide price suggestions for dynamic pricing online sellers.

Contents:


Motivation and Background
Methodology
Value Iteration Model
Deep Q-learning Models
Conclusion
Reference


Motivation and Background

With the rise of the internet and electronic devices, e-commerce has become one of the most popular shopping styles in modern society. Moreover, due to the fast-changing information and special characteristics of online shopping, more and more online sellers are adopting the practice of dynamic pricing. By doing so, the online sellers are likely to maximize their profits and stay on top in the market.  However, since the uncertainty of the environment can be really complicated, it is hard to decide when and how to adjust the price for the online sellers by themselves.
Meanwhile, reinforcement learning has been serving as the most common way to implement dynamic pricing techniques. Therefore, we decided to focus on this issue and try implementing a dynamic pricing suggestion system with reinforcement learning.
Methodology

Under the framework of Markov Decision Process, in our case, the state space consists of attributes of the product and the seller. The action space represents the behaviors of changing the price, for example, increase or decrease. As for the reward space, we referred to a thesis in 2019 which conducts dynamic pricing on an e-commerce platform with reinforcement learning, applying profit and difference of revenue conversion rate as the reward function, which would be explained more clearly later  in the Deep Q-learning chapter.
The algorithm will learn from historical data and help us to make the price adjustment decision, which makes it an ideal approach for implementing this concept. The following analysis can be broken down into two sections. First we apply value iteration, which requires the transition probability between each state to be given. The agent updates the value table of the states according to Bellman’s Equation in each iteration until convergence. Secondly, we expand the problem to using Deep Q-learning, the agent now no longer needs to know either the transition probability or the reward function, instead, we construct a neural network to generate the optimal action under the given state. Also, DQN allows us to attempt more complicated state settings such as continuous state space.


      Value Iteration
      Deep Q-learning
    
  
      State
       single variable 
 (discrete) 
      single / multiple variables 
 (continuous)
    
    
      Action
      five price values
      remain / increase / reduce price
    
    
      Reward
      best profit state: 1, 
 worst profit state: -1, 
else: 0
       profit / difference of conversion rate 
    
    
      Comparison
       - Require transition probability matrix 
- Agent knows the reward fnction 
       - Transition probability is not required 

 - Agent does not know reward funciton

    
Part I: Value Iteration

Contents:


Data Collection
Markov Decision Process Setup
Value Iteration Algorithm
Result

Data Collection

The dataset used for this study is a specific product demand data over 145 weeks, which is collected from an online food demand dataset on Kaggle.  The columns checkout_price, num_orders serve as the price and demand values . 

Markov Decision Process Setup


To obtain the state space, the price variable is discretized into five points. The agent can choose to move to any of the states as its action.
As for the reward function, a mapping between the price and demand variables is developed with regard to the historical data, thus, the predicted profit for each state is able to calculated respectively as the below figure shows. The reward of the state is set to be one for the highest profit case, and negative one for the lowest case, others zero.


Value Iteration Algorithm

The algorithm uses the Bellman’s Equation:


In each iteration, the algorithm calculates the expected reward for each action under the current price state. Then, it selects the action which brings the highest reward, and updates the value table of the current state with the maximal reward value. The algorithm will stop until the improvement is small enough, which implies convergence.

Result

As the below convergence diagram shows, the iteration process did converge within the first 100 iterations. Also, the value table shows that state zero, which has the highest profit, also has the highest value, and the state with lowest profit (state four) has the lowest value. The value table indicates that the agent tends to move forward to the states with a higher predicted profit, which meets our expectation.


From the analysis above we can easily see that the value iteration method helps determine pricing decisions and tells which state is better according to the reward function, however, due to the simple setting of the state space, the result is relatively trivial. Thus we adopted DQN as our second method to see if it can deal with a more complicated case.


Part II: Deep Q-learning

Content:

Data Collection
Study Overview
Result

Single-variable with Profit-reward Model
Multiple-variable with Profit-reward Model
Single-variable with DCR-reward Model
Multiple-variable with DCR-reward Model


Data Collection

The dataset is collected from an online shoes shop. It consists of two tables: order data and customer behavior data. The order data contains the purchased item amount and price of each order, and the behavior data contains the everyday page-view times and add-to-cart times. The customer behavior data thus serves as a great source for trying multiple state process.
Study Overview

According to Liu et al. (2019), experiment results suggest that difference of revenue conversion rates is a more appropriate reward function than revenue. Therefore, we define the daily conversion rate as the daily item amount sold divided by the page-view times.


The following study implements four models with two different state settings and two different reward functions.
State Settings

Single-variable state: The state is constructed by a vector of price variables looking back at a window_size of time. The window_size is set to be ten in our study. That is to say, the state is composed of the price of the past ten time periods.


Multiple-variables state: This state is constructed by vectors of price, add-to-cart times, page-view times variables. Same as above, each vector is composed of the past ten times data.


Reward Functions


The four models share the same action space, which is defined as below:


To predict the product amount sold, we simply construct a Random Forest model as a mapping between price and demand. The figures below show the expected relationship between price and the rewards.


Result

Single-variable with Profit-reward Model


The model shows no sign of convergence after 750 episodes, and the agent tends to choose the increasing price action in every condition. This can be concluded that, in this model, the state variable price alone is not enough to represent the current state, thus we moved on to the next model with multiple state variables.


Multiple-variable with Profit-reward Model


We can observe that the total profit has a higher converging tendency compared to the single state variable one. Moreover, the profits it gains generally surpass the profits obtained from the single state model.
The previous figure shows that  the peak of predicted profit is located at nearly 410 dollars. In the Q-table plot, we can see that most states with prices higher than 410 are suggested to reduce prices, and the increasing price suggestion takes place as prices being lower than 410, which meets our expectation.
Single-variable with DCR-reward Model


After adjusted the reward function to the difference of conversion rate, in this single state model, we can see the model converged really fast. While the total profits fluctuated between 405,000 and 400,000.
Multiple-variable with DCR-reward Model


This model gains better performance on the reward value, yet did not obtain higher profit. This may result from the model doing better on maximizing the difference of conversion rate, but did not fully transform into profit performance.


Conclusion

In our research, we implemented and compared different approaches of reinforcement learning on dynamic pricing issues, both of them are effective in providing pricing suggestions. However, as we’re facing more complicated situations and environments in real life retailing, Deep Q-learning might serve as the better method, since it is shown in our work that DQN has the ability to adopt more realistic state space settings, which enables it to take different conditions into consideration. Also, we found that having more state variables helps the DQN model gain higher profits, since the model shows more reasonable actions in the Q-table plot. Eventually, different settings on the reward function may influence the extent of convergence and lead to different optimizing results, yet we observe that there is a positive relationship between difference of conversion rate and profit, which makes it an alternative perspective on pricing suggestions.


Reference

How To Code The Value Iteration Algorithm For Reinforcement Learning, François St-Amant (2021)
增強式學習 (DQN) - 股票操作
Reinforcement Learning 進階篇：Deep Q-Learning
	Value Iteration	Deep Q-learning
State	single variable (discrete)	single / multiple variables (continuous)
Action	five price values	remain / increase / reduce price
Reward	best profit state: 1, worst profit state: -1, else: 0	profit / difference of conversion rate
Comparison	- Require transition probability matrix - Agent knows the reward fnction	- Transition probability is not required - Agent does not know reward funciton
No results found