Two-Stage Reinforcement Learning Policy Search for Grid-Interactive Building Control

Xiangyu Zhang, Yue Chen, Andrey Bernstein, Rohit Chintala, Peter Graf, Xin Jin, David Biagioni

Research output: Contribution to journalArticlepeer-review

9 Scopus Citations


This paper develops an intelligent grid-interactive building controller, which optimizes building operation during both normal hours and demand response (DR) events. To avoid costly on-demand computation and to adapt to non-linear building models, the controller utilizes reinforcement learning (RL) and makes real-time decisions based on a near-optimal control policy. Learning such a policy typically amounts to solving a hard non-convex optimization problem. We propose to address this problem with a novel global-local policy search method. In the first stage, an RL algorithm based on zero-order gradient estimation is leveraged to search for the optimal policy globally, due to its scalability and the potential to escape some poor performing local optima. The obtained policy is then fine-tuned locally to bring the first-stage solution closer to that of the original unsmoothed problem. Experiments on a simulated five-zone commercial building demonstrate the advantages of the proposed method over existing learning approaches. They also show that the learned control policy outperforms a pragmatic linear model predictive controller (MPC) and approaches the performance of an oracle MPC in testing scenarios. Using a state-of-the-art advanced computing system, we demonstrate that the controller can be learned and deployed within hours of training.

Original languageAmerican English
Pages (from-to)1976-1987
Number of pages12
JournalIEEE Transactions on Smart Grid
Issue number3
StatePublished - 2022

Bibliographical note

Publisher Copyright:
© 2010-2012 IEEE.

NREL Publication Number

  • NREL/JA-2C00-79559


  • demand response
  • Reinforcement learning
  • smart building
  • zero-order gradient estimation


Dive into the research topics of 'Two-Stage Reinforcement Learning Policy Search for Grid-Interactive Building Control'. Together they form a unique fingerprint.

Cite this