Abstract:Dialogue policy is a key component in a task-oriented dialogue system, which outputs the next system action given the current dialogue state. In recent years, dialogue policy learning has been widely described as reinforcement learning. A common method allows the dialogue agent to interact with the user simulator to learn. However, building a reliable user simulator is not an easy task, and it is usually as difficult as building a good dialogue agent. In order to avoid the explicit construction of a user simulator, this paper proposes a PPO reinforcement learning based multi-agent dialogue policy learning method, which constructs both the system side and the user side as agents. This method mainly carries out policy learning through two agents:(1) the first stage id imitation learning, which adopts the way of behavior cloning in imitation learning to pre-train the system policy and user policy;(2) the second stage is multi-agent reinforcement learning, which uses a Proximal Policy Optimization (PPO) algorithm with higher data sample utilization and better robustness to learn the dialogue policy between the system and the user. Finally, experiments are conducted on the public multi-domain and multi-intent task-oriented dialogue corpus MultiWOZ to verify the effectiveness of the method, and the scalability in complex tasks are also analyzed. In addition, the dialogue policy learned are integrated into the ConvLab-2 platform to evaluate the overall effect.