A multi-task deep reinforcement learning-based recommender system for co-optimizing energy, comfort, and air quality in commercial buildings with humans-in-the-loop

Stephen Xia; Peter Wei; Yanchen Liu; Andrew Sonta; Xiaofan Jiang

doi:10.1017/dce.2024.27

A multi-task deep reinforcement learning-based recommender system for co-optimizing energy, comfort, and air quality in commercial buildings with humans-in-the-loop

Published online by Cambridge University Press: 04 November 2024

Peter Wei ,

Andrew Sonta and

Stephen Xia: Affiliation:
Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, USA
Peter Wei: Affiliation:
Department of Electrical Engineering, Columbia University, New York, NY, USA
Yanchen Liu: Affiliation:
Department of Electrical Engineering, Columbia University, New York, NY, USA
Andrew Sonta: Affiliation:
School of Architecture, Civil and Environmental Engineering, EPFL, Lausanne, Vaud, Switzerland
Xiaofan Jiang*: Affiliation:
Department of Electrical Engineering, Columbia University, New York, NY, USA
*: Corresponding author: Xiaofan Jiang; Email: jiang@ee.columbia.edu

Article contents

Abstract
Impact Statement
Introduction
Related works
Challenges
Deep reinforcement learning-based recommender system
System implementation
Real-world considerations
Evaluation
Conclusion
Data availability statement
Author contribution
Funding statement
Competing interest
Ethical standards
Footnotes
References

Abstract

We introduce a novel human-centric deep reinforcement learning recommender system designed to co-optimize energy consumption, thermal comfort, and air quality in commercial buildings. Existing approaches typically optimize these objectives separately or focus solely on controlling energy-consuming building resources without directly engaging occupants. We develop a deep reinforcement learning architecture based on multitask learning with humans-in-the-loop and demonstrate how it can jointly learn energy savings, comfort, and air quality improvements for different building and occupant actions. In addition to controlling typical building resources (e.g., thermostat setpoint), our system provides real-time actionable recommendations that occupants can take (e.g., move to a new location) to co-optimize energy, comfort, and air quality. Through real deployments across multiple commercial buildings, we show that our multitask deep reinforcement learning recommender system has the potential to reduce energy consumption by up to 8% in energy-focused optimization, improve all objectives by 5–10% in joint optimization, and improve thermal comfort by up to 21% in comfort and air quality-focused optimization compared to existing solutions.

Keywords

air quality building co-optimization deep reinforcement learning energy savings recommender system thermal comfort

Type: Research Article
Information: Data-Centric Engineering , Volume 5 , 2024 , e26

DOI: https://doi.org/10.1017/dce.2024.27 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2024. Published by Cambridge University Press

Impact Statement

Optimizing building resources such as air conditioning, lights, and appliances is crucial for reducing energy consumption, enhancing occupant comfort, and maintaining a healthy environment in commercial buildings. While many works propose methods to optimize a subset of these objectives, they often limit themselves to controlling typical building resources. In this study, we introduce a multitask deep reinforcement learning recommender system that not only regulates building resources (setpoint temperature) but also engages occupants by suggesting movements to different locations throughout the day. We demonstrate that integrating both occupant and building actions into the optimization process can lead to greater improvements across multiple objectives, namely energy savings, comfort, and air quality.

1. Introduction

Commercial buildings are responsible for nearly 40% of total energy consumption in the United States (US Department of Energy, 2015). To work toward future sustainability, research communities, industry, and government agencies have developed projects and policies to improve energy efficiency in buildings. However, studies have shown that these efforts can still be improved (Seruto, Reference Seruto2010). In addition to energy consumption, comfort, and air quality are key targets for optimization in commercial buildings, since they can lead to many benefits such as increased productivity and occupant health (Lipczynska et al., Reference Lipczynska, Schiavon and Graham2018). Jointly optimizing energy, comfort, and air quality is challenging due to the complex and often conflicting nature of these objectives.

Prior works focus on managing building infrastructure and energy-consuming resources (e.g., air conditioning, lighting, and smart appliances) to optimize energy, comfort, and air quality. However, buildings are constructed to improve the people’s comfort, quality of life, and productivity. As we will show through our deployments in Section 7, only tuning building infrastructure and resources may achieve limited savings without considering humans in the loop. For example, optimizing the thermal comfort of multiple occupants in the same room who prefer different temperatures cannot be achieved by only changing the thermostat temperature. Because most energy-consuming infrastructure in a building is used to service occupants, the improvements we can achieve without directly engaging occupants are limited. This work explores how tuning human actions as a second knob, in conjunction with the control of traditional building infrastructure (namely heating, ventilation, and air conditioning), can provide greater optimization over energy, comfort, and air quality.

We present and deploy a novel recommender system that co-optimizes energy, comfort, and air quality (RECA) with humans-in-the-loop, by generating move and setpoint recommendations, and allows building managers to tune which aspects to focus on. The full architecture is shown in Figure 1. For example, tuning RECA to aggressively reduce energy consumption may prompt more occupants to move to shared spaces to reduce heating, ventilation, and air conditioning (HVAC) service to other spaces. This decreases energy but also decreases air quality and comfort, since more people occupy a single space and one temperature setpoint may not satisfy everyone’s preferences. As another example, tuning RECA to aggressively improve the overall comfort of each occupant may recommend more occupants move to separate spaces, where they can set their own HVAC preferences. However, this requires HVAC service to more spaces, which can drive up energy consumption.

Figure 1. RECA’s system architecture. To account for the challenge of cold-start, RECA leverages a simulation environment with statistical models to estimate future building states and generate more training examples from a past history of observed building states and recommendations.

Unlike traditional building resources, where an action can be directly controlled with a command or a piece of code, people will only perform a recommendation if the incentive is high or the inconvenience is low enough. For example, a person will more likely accept a recommendation to move to a space that s/he already commonly uses. We take a deep reinforcement learning approach to learn both human move recommendations and building thermostat setpoint actions that are effective at optimizing the objectives and likely to be accepted by users. We also take a multitask learning approach to simultaneously and efficiently adapt to three diverse, yet related, tasks. Existing works that introduce deep reinforcement learning techniques for building control generally focus on actuating traditional building resources, such as window blinds and HVAC (Wei et al., Reference Wei, Chen, Vega, Xia, Chandrasekaran and Jiang2017a; Ding et al., Reference Ding, Du and Cerpa2019). Instead, our work explores how human building interaction (HBI), namely human movement and location, can be used to improve the sustainability, quality of life, and health of our built environments. Our contributions are as followsFootnote ¹:

1. We introduce RECA, a deep reinforcement learning recommender system that co-optimizes energy savings, occupant comfort, and air quality with humans-in-the-loop in real commercial buildings. Unlike previous works, RECA is tunable, allowing building managers to prioritize energy savings, occupant comfort, and/or air quality, and engages occupants with actionable recommendations (move and thermostat setpoint) to improve these objectives.
2. We create a novel deep reinforcement learning architecture, using multitask learning to learn the effects of actions on energy, comfort, and air quality. Our architecture utilizes an embedding to efficiently learn the location configurations of occupants and relationships between different locations.
3. Over a four week study, we evaluate RECA in two commercial office buildings and show that our system can account for a wide range of configurations that emphasize combinations of energy, comfort, and air quality.

2. Related works

Building optimization is an important topic of interest and has received a significant amount of attention in the research community. Typically, studies focused on optimizing a combination of energy, comfort, and air quality first model a space or building, and then develop a control algorithm to perform the optimization. Modeling software, such as EnergyPlus (Crawley et al., Reference Crawley, Lawrie, Winkelmann, Buhl, Huang, Pedersen, Strand, Liesen, Fisher, Witte and Glazer2001) or TRNSYS (Fiksel et al., Reference Fiksel, Thornton, Klein and Beckman1995), have been used in various works (Reynders et al., Reference Reynders, Diriken and Saelens2014; Kwak et al., Reference Kwak, Huh and Jang2015; Sturzenegger et al., Reference Sturzenegger, Gyalistras, Morari and Smith2015; Delgarm et al., Reference Delgarm, Sajadi, Delgarm and Kowsary2016; Langevin et al., Reference Langevin, Wen and Gurian2016) to produce physics-based models of spaces, while data-driven models such as artificial neural networks (Kim et al., Reference Kim, Jeon and Kim2016; Moon and Jung, Reference Moon and Jung2016; Macarulla et al., Reference Macarulla, Casals, Forcada and Gangolells2017) use real-world data to create models of spaces. These building models are then integrated into IoT systems (Rastogi et al., Reference Rastogi, Barthwal and Lohani2019) or paired with different controllers, such as model predictive control (Maasoumy et al., Reference Maasoumy, Rosenberg, Sangiovanni-Vincentelli and Callaway2014; Kwak and Huh, Reference Kwak and Huh2016), adaptive algorithms (Benedetti et al., Reference Benedetti, Cesarotti, Introna and Serranti2016), or genetic algorithms (Delgarm et al., Reference Delgarm, Sajadi, Delgarm and Kowsary2016; Kim et al., Reference Kim, Jeon and Kim2016) to optimize energy consumption and comfort. In general, these works rely on controlling the building resources, rather than engaging occupants in the building co-optimization process.

A few recent works have proposed methods for optimizing energy and comfort by assigning occupants to certain locations depending on thermal preferences. In Nagarathinam et al. (Reference Nagarathinam, Vasan, Sarangan, Jayaprakash and Sivasubramaniam2018), the authors consider an open plan space where occupants can work in multiple locations, and optimize global thermal comfort by grouping occupants with similar thermal comfort preferences. Similarly in Nagarathinam et al. (Reference Nagarathinam, Vasan, Sarangan, Jayaprakash and Sivasubramaniam2021), the authors take a step further and implement an MPC to simultaneously optimize for energy consumption. A different work (Sonta et al., Reference Sonta, Dougherty and Jain2021) proposes a method for optimizing occupant workstation layouts to reduce lighting energy consumption. In contrast, we develop a recommender system to actively engage occupants throughout the day by delivering real-time actionable recommendations, allowing our system to adapt to changes in building resources and occupant behavior throughout the day. Moreover, our system not only uses human feedback to inform actions, but also recommends actionable steps to occupants in order to improve their own comfort, energy footprint, and/or air quality.

Reinforcement learning has become an important area for addressing dynamic environments. Many works introduce reinforcement learning-based strategies for controlling windows and HVAC resources to optimize energy consumption or comfort/air quality (Dalamagkidis et al., Reference Dalamagkidis, Kolokotsa, Kalaitzakis and Stavrakakis2007; Yang et al., Reference Yang, Nagy, Goffin and Schlueter2015; Chen et al., Reference Chen, Norford, Samuelson and Malkawi2018; Zhang et al., Reference Zhang, Chong, Pan, Zhang, Lu and Lam2018; An et al., Reference An, Xia, You, Lai, Liu and Chen2021). Wei et al. (Reference Wei, Xia and Jiang2018b) and Wei et al. (Reference Wei, Xia, Chen, Qian, Li and Jiang2020) engage occupants by providing recommendations (e.g., schedule changes) to optimize energy consumption through Q-table and deep Q-network based recommender systems and demonstrate that strategies, which do not engage occupants, may pass over significant optimization opportunities. In contrast, our work co-optimizes comfort and air quality, in addition to energy consumption, by incorporating humans in the optimization process.

3. Challenges

Optimizing multiple metrics in commercial buildings presents many challenges, which we distil into three key areas. First are the challenges in understanding the problem: how to model energy, comfort, and air quality in different locations using parameters we can measure? Second are challenges in optimization: what are the actions that can be done to help optimize energy, comfort, and air quality in different locations? Finally there are challenges in learning: how can we develop a recommender system to learn the best actions for reducing energy and improving comfort and air quality?

3.1. Building modeling

To perform accurate optimization, the system needs access to realistic building measurements and the ability to make accurate predictions of future states. Prior works have carefully studied energy and air quality monitoring by utilizing sensor nodes and accessing building management systems. However, thermal comfort as an objective is more challenging, as it requires either direct measurement of individuals or indirect modeling using environmental sensing. Indirect modeling is not always accurate and does not take into account personal thermal preferences. Thermal comfort sensing, on the other hand, has uncovered a number of challenges such as perspective and scalability.

Once the building state can be sensed, various models can be used to make predictions of future states. These predictions are critical to exploring the possible future solution space in optimization. However, standard computational methods for modeling commercial buildings such as EnergyPlus (Crawley et al., Reference Crawley, Lawrie, Winkelmann, Buhl, Huang, Pedersen, Strand, Liesen, Fisher, Witte and Glazer2001) tend to perform poorly in comparison to real measurements (Norford et al., Reference Norford, Socolow, Hsieh and Spadaro1994). In the context of energy simulation in our use case, standard physics-based tools like EnergyPlus experience a few key challenges. These models are typically evaluated at the facility level, making it difficult to understand differences in energy consumption across individual spaces or rooms in buildings. These models also tend to exhibit larger errors as temporal granularity increases, making it difficult to understand how short-term interventions impact energy. Finally, these models have been shown to experience difficulty in handling occupancy data, one of the key features that impact energy consumption and form a key recommendation strategy in our overall system. To address these challenges, researchers have pointed to data-driven surrogate modeling, typically leveraging supervised machine learning tools, as an alternative approach that can address the key issues with physics-based modeling outlined here. As discussed below, we implement a novel simulation environment based on data-driven statistical modeling to address these challenges associated with simulating different building states.

3.2. Co-optimizing energy, comfort, and air quality

Typical methods for building optimization, such as energy consumption, include changing setpoints in different locations. The effects of this are reductions in energy but also changes to thermal comfort and air quality. Various works have found that changing setpoints, and thus the room temperatures while keeping within thermal comfort boundaries, can reduce energy consumption. In this work, we also seek to optimize thermal comfort and air quality; thus, this strategy may not be optimal. We take a different approach by incorporating humans in the co-optimization process. We add an additional kind of recommendation, a move action, to move people from one location to another. To illustrate the potential advantages of the move recommendation, we provide the following example.

Consider two occupants A and B, in two different rooms 1 and 2, respectively. Without move recommendations, the best that can be done is to change the setpoint temperatures, which may reduce energy consumption at the cost of affecting thermal comfort and air quality. However, with move actions, occupant A can be recommended to move to room 2, and the setpoint of room 1 can be changed to reduce energy consumption.

The main challenge is how to determine the actions, and sequences of actions, which can result in optimization of the different metrics. Furthermore, emphasis on certain metrics may be more desired depending on the situation. Air quality may be a focus during a pandemic, while thermal comfort may be a focus in office environments where worker productivity is highly valued.

3.3. Deep reinforcement learning recommendation systems

Sequences of actions can lead to more optimal building states which cannot be estimated from single actions. In this work, we utilize deep reinforcement learning to learn the impacts of these sequences of actions; however, there are a number of challenges that limit the application of standard deep reinforcement learning techniques to building co-optimization of multiple objectives. First, energy, comfort, and air quality are all objectives which can be co-optimized. While prior works co-optimizing multiple objectives combine the objectives into a single reward, this creates a few issues. The importance of the different objectives may be different between people and information is lost about individual objective changes that can be presented to the occupants.

Second, our goal is to incorporate human actions into the co-optimization process. To do this, we require human location as part of the building state. However, the possible location configurations of occupants are a scaling problem for large numbers of people or rooms. For example, representing the locations of individuals as one hot vector results in a state space that is $ {\left|R\right|}^{\mid O\mid } $ , where R and O are the sets of rooms and occupants, respectively. To enable scaling to larger deployments, a network capable of incorporating location information in a dense manner for learning can lead to a better understanding of the state space.

Finally, training a deep network requires a large amount of training data. In this application, important training data consists of energy savings, comfort improvements, and air quality improvements due to certain actions, which are low in volume. To overcome this cold start problem, we require the ability to quickly generate training data that can reasonably estimate objective changes for different actions depending on the environment.

4. Deep reinforcement learning-based recommender system

We propose RECA, a deep reinforcement learning-based (DRL) recommender system for co-optimizing energy consumption, occupant thermal comfort, and air quality, as shown in Figure 1. There are several reasons why this problem is challenging. First, direct modeling of all dynamics in a commercial building that contributes to energy usage, comfort, and air quality is impractical, and the building’s resources and occupants change over time, making it even more difficult. Second, the effects of different actions on these objectives are difficult to quantify, especially if changes are not realized until multiple steps in the future. Lastly, different occupants may value certain objectives and recommendations differently.

Model-free deep Q-learning can help address these challenges: first, deep Q-learning utilizes a deep neural network to approximate the state-action function, which is beneficial for large state spaces with many occupants and building resources. Secondly, deep Q-learning can learn action returns long-term, independent of the policy being followed and without requiring an explicit model of the complex effects between the environment and occupants. Lastly, returns for actions specific to each occupant can be learned separately, which allows for the model to account for different preferences among users.

We first explain how we represent the building energy, comfort, and air quality co-optimization problem in the context of deep Q-learning (Section 4.1). Next, we introduce the deep Q-network model for generating actionable recommendations to occupants to co-optimize the three objectives (Section 4.2). Finally, we introduce our full recommender system that leverages the predictions provided by the Q-network to “recommend” actions to occupants to improve our objectives (Section 4.3).

4.1. Deep reinforcement learning formulation

We represent the building co-optimization problem as follows. At each time step, the network uses a policy to choose an action a from a set of possible actions $ a\in A $ based on the current state of the building, s, which constantly changes over time due to agent recommendations and external factors like outside temperature. This action is sent to the actor(s) (occupants in our case), which is then accepted or rejected. Our system then sees a reward, r, representing short-term changes given by the environment, which would be in improvements energy, comfort, and air quality. For example, our system may recommend the only occupant in room A to move to room B. The occupant accepts and moves to room B, allowing the building to turn down the HVAC and lights in room A. As a result, the system observes a reward of energy savings from room A.

The building environment continuously changes, due to agent recommendations and external factors such as occupant location changes and environmental factors. Thus, we can formulate the problem as a finite Markov Decision Process (MDP). Each action has a transition probability of occurrence at the current state and a reward ( $ r $ ), representing short-term changes due to the action.

The goal of the deep Q-network is to estimate long-term changes associated with each action given the current building state s. Standard reinforcement learning aims to maximize this return at time t, defined as $ {R}_t={\sum}_{i=t}^T{\gamma}^i{r}_i $ , where r_i is the reward observed at time i, $ \gamma \in \left(0,1\right) $ is the “discount factor”, and T is the end time (or end of day in commercial buildings). We leverage Q-learning, a widely-used model-free reinforcement learning method, where the agent seeks an action-value function Q(s,a) which represents the return of taking an action a at a given state s. If the state/action space is too large, Q(s,a) is too complex to be stored in a data structure, but can be approximated effectively using a deep neural network Mnih et al. (Reference Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra and Riedmiller2013)).

4.1.1. State-actions and recommendation types

RECA’s state is comprised of the following features, which capture the three objectives we are co-optimizing.

• Per space: Energy consumption, temperature, humidity, and thermostat setpoint temperature PM _2.5 and PM ₁₀.
• Per occupant: Location of each occupant in the building at the space or room level (e.g., the occupant is in lab space A), including a null indicator if the occupant is absent.

We divide the building into spaces based on functional purpose, like in Wei et al. (Reference Wei, Wang and Zhu2017b) and Wei et al. (Reference Wei, Chen, Vega, Xia, Chandrasekaran and Jiang2018a), because occupants generally use and refer to spaces in this way (e.g., group A’s workspace). There are three categories of actionable recommendations that our system recommends for occupants in a building with |O| occupants and |S| spaces.

• Move: This recommendation suggests users to move to a different location. For example, moving multiple occupants into a single location may allow the building to turn down HVAC in other locations and reduce energy. There are $ \mid O\mid \times \mid S\mid $ number of move actions that can be recommended. Suggesting occupants to move to areas that they would never reside in is unproductive. Instead, we introduce mechanisms th at help RECA adapt to user location preferences (Section 6.4).
• Thermostat setpoint changes: This recommendation suggests users to change the thermostat setpoint at their location. We recommend changes in setpoint by $ \approx \pm 2 $ degrees Fahrenheit (exact values discussed in Section 6.3.1). As such, there are $ 2\times \mid O\mid \times \mid S\mid $ total number of setpoint actions that can be recommended.
• Temperature and lighting relaxation: In empty rooms, we relax the temperature setpoint by 2 degrees Fahrenheit and turn off lights, without affecting occupants. This action is directly taken by the building when it observes empty spaces.

These categories of actionable recommendations allow our system to incorporate actions that existing works use to optimize energy savings and comfort while enabling more complex action sequences, described in Section 7.3, that are otherwise not possible.

4.1.2. Reward

Since we are co-optimizing energy savings, thermal comfort, and air quality, the reward is the improvement in these objectives from one time step to the next and depends on the current building state, current action, and the building state at the beginning of the next time step. Equation (1) shows the reward at the timestep n.

(1)

$$ {r}_n=-\alpha {E}_n-\beta {C}_n-\gamma {Q}_n $$

E_n refers to the total energy consumption of all energy-consuming resources, C_n refers to the total comfort of all occupants, and Q_n is the total air quality rating experienced by all occupants in the building at timestep n. Higher values of E_n, C_n, and Q_n correspond to higher energy consumption, lower overall occupant comfort, and lower overall air quality experienced by occupants. $ \alpha $ , $ \beta $ , $ \gamma $ are used as weights for the three objectives, allowing building managers to customize and select which objective(s) to prioritize. Next, we describe how we compute E_n, C_n, and Q_n.

1. Energy consumption: E_n is computed in Equation (2), where $ \Delta $ is the length of one-time step and $ {P}_d(t) $ is the power consumption of the energy-consuming resource d (e.g., HVAC and lights). E_n is therefore the total energy consumed by all energy-consuming resources d in the nth time window.

(2)

$$ {E}_n={\int}_{t=n\Delta}^{\left(n+1\right)\Delta}\sum \limits_d{P}_d(t) dt. $$

2. Thermal comfort: C_n is computed in Equation (3), where $ C\left({R}_o,t\right) $ is the comfort of occupant o at location R_o at time t. As such, C_n is the total comfort across all occupants in the building at timestep n. Measuring comfort is challenging, and we discuss how we measure comfort in our real deployments in Section 5.

(3)

$$ {C}_n=\frac{\sum_{o\in O}{\int}_{t=n\Delta}^{\left(n+1\right)\Delta}C\left({R}_o,t\right) dt}{\Delta \cdot \mid O\mid } $$

3. Air quality: Q _n is computed as shown in Equation (4). $ Q\left({R}_o,t\right) $ is the air quality experienced by occupant o at location R_o at time t. Much like thermal comfort, the reward only considers the air quality of the areas where occupants are present because the air quality of empty rooms will not affect any of the occupants. We discuss how we measure the air quality of different rooms, locations, and spaces in Section 5.

(4)

$$ {Q}_n=\frac{\sum \limits_{o\in O}{\int}_{t=n\Delta}^{\left(n+1\right)\Delta}Q\left({R}_o,t\right) dt}{\Delta \cdot \mid O\mid } $$

4.2. Deep Q-network for generating actionable recommendations

There are two challenges unique to building co-optimization that prevent direct application of a standard deep Q-network. The first challenge is representing occupant locations in a building. Typically, categorical data such as occupant locations are represented using one-hot encoding. However, the number of input nodes in the neural network quickly increases with the number of occupants and rooms. We address this challenge by incorporating an embedding layer (Section 4.2.1). The second challenge is the representation of the reward; there are many actions in each state and three different objectives (energy, comfort, and air quality) for each action. The network must learn the objectives for all states and actions. We address this challenge through multitask learning (Section 4.2.2).

4.2.1. Location embedding

One important observation of occupant locations is that there are hidden relationships in the encoding of the occupant locations. Consider an occupant who spends time in three different locations: two similar office spaces and one lab space. Let us assume that the energy consumption, setpoint temperature, humidity and air quality are similar for the two office spaces. A one-hot encoding of the locations will not uncover similarities between these spaces and will have to be learned through a number of layers.

One method for learning these similarities between spaces is to use an embedding layer, which has a few key advantages. First, embedding layers reduce the input size of the model by eliminating the need for a one-hot encoding. This can help reduce computation time, thus reducing training time. Second, an embedding layer has been proven to learn relationships between categories (Rong, Reference Rong2014), which can lead to improved learning. In our deep Q-network, we utilize an embedding layer to better learn the occupant location configurations than standard dense layers.

Instead of representing each occupant’s location as a one-hot vector, each location is assigned a unique numerical value corresponding to a row in the embedding layer. The embedding layer selects the rows corresponding to the location vector for each occupant, which are concatenated into a dense vector of size $ \mid O\mid \times d $ , where d is the embedding dimension and $ \mid O\mid $ is the number of occupants. This output can be a more informative input for future layers by encoding the relationships between locations. An illustration of the embedding layer is shown in Figure 2. In our deployments, we observe that d = 3 yielded the best tradeoff in performance vs computation.

Figure 2. The embedding layer in our network converts the physical representation of occupant locations into a feature representation through a learned embedding matrix.

4.2.2. Multi-task learning

Since there are three separate objectives (energy, comfort, and air quality), there are two options for learning. The first option is to combine energy, comfort, and air quality changes into a single reward (e.g., sum all rewards in Equation (1), and use the return as the target for the Q-network. The second option is to learn the objectives separately, and then combine them at the output of the Q-network using a ranker. The advantages of the second option are that optimization emphasis can be quickly changed without retraining the network. However, this method requires learning three times the number of outputs.

A key observation is that there are hidden relationships between energy, comfort and air quality. As an example, a variable air volume system that supplies cold air turns on to supply air to a room; this will increase the energy consumption, and increase airflow which may lead to improved air quality, and reduce the temperature in the room thus affecting comfort. We can take advantage of these relationships by using multitask learning. In multitask learning, the input features are fed into a number of “shared” layers which learn information about the state of the building and the individual locations. The output is then fed into “task-specific” layers, which are responsible for learning information specific to the objective. In our deep Q-network, we create task-specific layers for each of the energy, comfort, and air quality objectives in Equation (1). These task-specific layers learn to predict the expected energy consumption (Equation 2), comfort (Equation 3), and air quality (Equation 4) for each state-action pair rather than the full reward that weights each objective based on the importance assigned by the building manager. In other words, the reward used to train and tune the task-specific layers for the three objectives is the difference between the objective predicted by the model and the observation from our sensors and the environment. The ranker (Section 4.3.1) biases each recommendation based on the weights assigned by the building manager to generate the combined reward objective (Equation 1), which is then used to recommend actions to occupants.

4.2.3. Network architecture

The network input consists of the sparse location features of each occupant and dense features, including temperature, humidity, setpoint temperature, air quality, and energy consumption for each location. The complete network architecture is shown in Figure 3. The sparse location features are fed into the embedding matrix, and the output vectors are flattened and concatenated with the dense features. The new vector is fed to the shared layers, and then to the individual objective task layers. The output of each task is values representing the expected change in the energy consumption ( $ {E}_n^a $ ), comfort ( $ {C}_n^a $ ), and air quality ( $ {Q}_n^a $ ) objectives for all possible actions, a, at time step n.

Figure 3. Our deep Q-network architecture includes an embedding layer for learning occupant locations and has three output tasks for learning energy savings, comfort and air quality improvements for each action.

4.3. Recommender system design

The upper half of Figure 1 shows the full online system, consisting of the following components: the networked sensing layer (Section 5) that senses the building state (Section 4.1.1), the deep-Q network model (Section 4.2) and the ranker (Section 4.3.1) that uses the building state to generate recommendations for occupants, and the occupant-facing web client (Section 4.3.2) where users can view and accept/reject recommendations.

4.3.1. Ranker

Once the Q-network estimates the changes in energy consumption, comfort, and air quality objectives, the ranker arranges the recommendations for each occupant to maximize the overall energy savings, comfort, and air quality improvements. To accomplish this, the ranker generates a score for each action, a, according to Equation (5).

(5)

$$ {S}_n^a=-\alpha {E}_n^a-\beta {C}_n^a-\gamma {Q}_n^a $$

$ \alpha $ , $ \beta $ , and $ \gamma $ are weights that building managers can tune to prioritize certain objectives. There is a negative sign with each objective because a higher value is less desirable (Sections 4.1 and 5). This objective is identical to the reward of the system (Equation 1) and essentially allows the building manager to control and bias the system based on which objectives s/he emphasizes. In Section 7.4, we compare improvements in each objective using different weights in real deployments.

As RECA recommends two different categories of recommendations directly to users (move and setpoint), it is important to give the occupants a selection from both categories (diversity). The ranker selects two recommendations from each category that are displayed to the user. The ranker uses softmax action selection to select actions to recommend users at time step $ n $ , after normalizing scores, $ {S}_n^a $ , for all actions, $ a\in A $ , using the softmax function (Equation 6).

(6)

$$ p\left(a,n\right)=\frac{\exp \frac{S_n^a}{\tau }}{\sum_{b\in A}\exp \frac{S_n^a}{\tau }} $$

$ \tau $ is a temperature parameter. Sampling recommendations this way also allows RECA to incorporate exploration. Softmax selection has one key advantage over $ \unicode{x025B} $ -greedy methods, which select the action with the best-expected return with probability $ 1-\unicode{x025B} $ (exploitation) and a random action with probability $ \unicode{x025B} $ (exploration). During exploration, when $ \unicode{x025B} $ -greedy methods choose a random action, the actions with the worst expected return have the same probability of being chosen as the actions with the best-expected return. If some actions have zero probability of being performed (such as a move recommendation to a location that is not acceptable to the occupant), choosing this action over another action that may have a slightly lower return than the best return will slow training significantly. Softmax selection weights the probability of selecting an action during exploration based on its expected return, which reduces the slowdown effect of low-return actions during training.

4.3.2. User feedback

To serve recommendations to occupants in real-time, we developed a web interface (Figure 4) where occupants can browse a list of up-to-date recommendations, and select recommendations which are sent back to the system and stored in history as feedback. Additionally, we store the history of observed building states in a digital twin, similar to the one presented in Srinivasan et al. (Reference Srinivasan, Manohar and Issa2020). In Section 6, we discuss how we use feedback and the history of building states to address several challenges for deploying RECA in the real world.

Figure 4. The web interface for occupant feedback displays a list of recommendations. Estimated energy savings, comfort, and air quality improvements are shown with each recommendation.

5. System implementation

To co-optimize energy savings, thermal comfort, and air quality, we need to measure these quantities per space and occupant to compute the reward and observe the building state (Section 4.1). Because we engage occupants to perform this optimization (e.g., move recommendations), we also need to estimate the location of each occupant. We make these measurements through the networked sensing layer, which we discuss next.

1. Energy: We measure three types of energy consumption (Equation 2), at the room level: HVAC, lighting, and individual energy devices. To monitor HVAC, we interface with the building management system over BACNet, which provides energy consumption information for large units such as fan coil units and variable air volume. Because most modern commercial buildings have a building management system (BMS) that can be accessed via BACNet, this step can be adapted to many existing and new buildings. For HVAC units that are not monitored by the BMS, we deploy wind sensors to estimate energy consumption from airflow and temperature as in Balaji et al. (Reference Balaji, Xu, Nwokafor, Gupta and Agarwal2013). We also deploy light sensors and plug meters to monitor lighting energy and local devices.

2. Comfort: Comfort is more challenging to measure, as it is subjective. The standard metric to measure thermal comfort is the predicted mean vote (PMV) model from ASHRAE 55 (ASHRAE, 2013), where scores are generated from each occupant based purely on current environmental factors (e.g., temperature and humidity) and their own physical attributes. These scores are averaged to produce a value between –3 (cold) and 3 (hot). However, individuals may have temperature and space preferences that cannot be fully captured purely by the physical attributes of the individual or the building. As such, we construct personalized thermal regression models.

We integrate the thermal comfort estimation pipeline in Wei et al. (Reference Wei, Liu, Kang, Yang and Jiang2021), by deploying sensor nodes consisting of FLIR One Pro RGB-thermal cameras, Jetson Nanos, and temperature/humidity sensors into each room. We recorded thermal images of occupants over the course of two weeks to generate personalized comfort regression models. During this time, users also provide feedback or labels for their comfort levels, so we can correlate the observed thermal temperature with their perceived comfort. This pipeline estimates thermal comfort on the same scale as PMV ASHRAE 55, but more accurately by using facial temperature and feedback from users. The absolute value of this score is used as the overall comfort, C(R_o,t), of each occupant o, at location R_o, at time t in Equation (3).

Though this method is more accurate, there is considerable overhead to adapt to each person. For larger deployments without thermal cameras, using the PMV ASHRAE 55 model and substituting measurements for a typical person, just like in Wei et al., Reference Wei, Liu, Kang, Yang and Jiang2021, can still yield promising results, as we show in Section 7.1.

3. Air quality: To measure air quality, we use the US Environmental Protection Agency’s Air Quality Index (AQI), which incorporates PM_2.5 and PM₁₀ measurements (Mintz, Reference Mintz2016). The higher the value, the more pollution is present in the air. A value of less than 50 is healthy. We deploy PM_2.5 and PM₁₀ sensors at each location/room in our deployments. The air quality (Equation 4), Q(R_o,t), experienced by each occupant $ o $ at time t is the AQI of the space R_o, where occupant o is residing at time t.

4. Localization: The location of each occupant is critical to determining the impacts of different actions and engaging occupants in the co-optimization process. To localize occupants, we extract head bounding boxes in the RGB domain using the comfort estimation pipeline we integrated (P. Wei et al., Reference Wei, Liu, Kang, Yang and Jiang2021). We train a convolutional neural network based on VGG-16 to classify occupants by participant ID; the training data is hand-labeled using images taken over the course of one week. Because the number of occupants in our deployments is controlled, this solution is relatively simple to implement. Other methods such as wireless localization can be used in larger deployments, which we will explore in future work.

6. Real-world considerations

There are several challenges that need to be addressed to ensure our system performs robustly in real environments. The first challenge is the lack of training data. Without a large amount of feedback describing the benefits of different actions, the recommender system will initially provide poor random recommendations to occupants, which inhibits useful feedback. This problem, also known as cold start, can be mitigated by providing data that is semi-realistic. We use real data, building states, and feedback from users, as discussed in Section 4.3.2 to generate more training examples (Section 6.3).

The second challenge is accounting for how a person’s preferences can periodically change. For example, RECA may estimate the greatest reward if a user moves out of room A. However, if s/he needs to immediately perform lab work there, then s/he would not accept this recommendation (Section 6.4).

The third challenge is recommending conflicting actions to different users. For example, one occupant may choose to increase the setpoint temperature to reduce energy, while a second occupant at the same location and time may choose to decrease it (Section 6.5).

6.1. Digital twin

Digital twins provide a virtual representation of the real-time building state. Since each building behaves differently, due to various factors such as climate, location, and building occupants; a specific digital twin is a better representation than a general model such as EnergyPlus. For example, some buildings may have natural ventilation, whereas others may have occupancy-based HVAC control; these variations in building resources require a different representation to ensure that the objectives to optimize are measured correctly. The digital twin representation is largely useful for three reasons.

1. The system can save historical data for future analysis of historical patterns. Since the data is timestamped, the historical data could be read as a stream to reproduce the learning process of the optimizing model.
2. The system can be used to provide real-time insights for building occupants and building managers. The saved data can visualize the current building state with the occupants’ locations, comfort level and environmental status such as air quality and energy.
3. The digital twin’s output can be used as inputs for intelligent systems such as a recommender system. The digital twin is based on the collected real data so it can be used for data augmentation for the learning algorithm.

Because of these advantages, we create a digital twin for our deployments (Section 7), allowing us to save historical building states and leverage these states to perform data augmentation and alleviate the cold start problem (Section 6.3).

Our system has three optimization goals: energy, comfort, and air quality. Thus, in addition to monitoring the location of occupants, the digital twin should be able to accurately measure real-time energy consumption, occupant thermal comfort, and air quality. These are measured by the networked sensing layer (implementation discussed in Section 5).

To efficiently store and update the sensed energy, air quality, comfort, and location of occupants in a digital twin representation, we first model the building as a tripartite graph structure, as in Wei et al., Reference Wei, Chen, Vega, Xia, Chandrasekaran and Jiang2017a. This tripartite data structure separates the energy-consuming resources, physical spaces, and occupants into three object layers, where energy-consuming resources are connected to the spaces serviced, and occupants are connected to the spaces they occupy as shown in Figure 5. The objects in each layer are a one-to-one representation of elements in the physical deployment. This enables fast updates of individual object parameters when new data is received from the sensors, such as a fan coil unit power value or space temperature and humidity value. Additionally, as described by Wei et al. (Reference Wei, Chen, Vega, Xia, Chandrasekaran and Jiang2018a), updates to the graph can be quickly propagated to related elements in the graph, either immediately or on demand. As an example, a certain location may receive a new temperature value; this temperature value can be propagated to the connected occupant objects to update estimated thermal comfort immediately.

Figure 5. The digital twin is constructed using a tripartite graph structure, from Wei et al. (Reference Wei, Chen, Vega, Xia, Chandrasekaran and Jiang2017a) and Chen et al. (Reference Chen, Norford, Samuelson and Malkawi2018), to store relations between energy resources, spaces, and occupants. The building state can be extracted from the graph structure to build visualizations.

A real-time building state can be quickly extracted from the tripartite graph structure, which enables three important applications. First, different visualization tools can be built on the building state to alert building managers of potential improvements. The inclusion of personalized thermal comfort can provide additional information to inform of potential thermal comfort improvements. Second, the building state can be recorded in a database to build datasets for analyzing historical patterns and training prediction models. Finally, the real-time building state can be used as input to complex systems such as a recommender system. In this work, the digital twin is utilized for historical data to train simulation environment prediction models and to provide the building state as a real-time input to our recommender system.

6.2. Building modeling

Leveraging the digital twin, we can collect data on the current state of the building. From the building’s data streams, we collect the following features per room: HVAC energy consumption, lighting energy consumption, temperature, humidity, PM₁₀, PM_2.5, and room temperature setpoint. We also ascribe a room location to each building occupant, including a null indicator if the occupant is absent. These features encompass the building’s state in our modeling framework. Historical data on the building’s state can be used to train statistical models to predict future states for each of the rooms in the building. The key features that vary according to occupant actions, and therefore constitute our recommendations, are the room temperature setpoint and the occupants’ locations. These two features ultimately impact the building’s energy consumption as well as the occupants’ experiences of thermal comfort and air quality.

6.3. Data augmentation for exploring more building states and actions

From our digital twin, we develop a simulation environment that enables the creation of a large number of potential future states, which is used to evaluate recommendations and train our recommender system, as shown in Figure 6. The simulation environment leverages statistical models to simulate future building states based on occupant actions and the input building state, collected from our deployments and digital twin. We first discuss how we simulate and predict key modules: HVAC energy, lighting energy, indoor comfort, indoor air quality prediction, and comfort (Section 6.3.1); the reward is computed as standard in Equation (1) in Section 4.1. Next, we introduce how we incorporate this simulator into training our deep Q-network (Section 6.3.2).

Figure 6. To alleviate cold start, we create a simulation environment based on the digital twin, which takes as input the current building state and simulates the next state, energy savings, comfort improvement, and air quality improvement based on an action.

6.3.1. Predicting future states

1. HVAC energy: Predicting HVAC energy is challenging due to nonlinearities associated with HVAC control and outdoor environmental conditions. As discussed in Section 2, physics-based models such as EnergyPlus have limited ability to predict HVAC energy, especially when detailed design documentation is unavailable and granular temporal scales (i.e., hourly) are required, which both apply in our setting. Data-driven surrogate energy modeling is a promising framework for addressing these limitations. However, these models are typically applied at the building level rather than the room level. Here, we introduce data-driven surrogate HVAC energy modeling at the room level to support our simulation engine.

We identified three regression models that have been shown to be effective in energy prediction tasks: artificial neural networks, random forests, and gradient boosting (Sun et al., Reference Sun, Haghighat and Fung2020). In this task, the target is room-level HVAC energy consumption and the features are as follows (aggregated to the hourly level):

• Number of occupants
• Outdoor temperature (T ₀) and humidity
• Heating Degree Days ( $ HDD=65-{T}_0 $ in °F)
• Cooling Degree Days ( $ CDD={T}_0-65 $ in °F)
• Temporal features: day of the year, day of the week, hour of the day
• Historical HVAC energy data (48-h sliding window)

Energy prediction in this setting resembles time-series forecasting with exogenous features. Because historical energy data is available from the digital twin, we can also adapt the model to include historical HVAC data as a feature. For example, if we are only interested in one-step-ahead prediction, we can include all historical data up until the next time step as features.

We used a temporal 80–20% split to create training and testing sets. As a state-of-art comparison, we also built an EnergyPlus model based on historical building state data collected from our deployments and performed standard calibration measures, following the procedure in Miller et al. (Reference Miller, Thomas, Irigoyen, Hersberger, Nagy, Rossi and Schlueter2014). We used the coefficient of variation of the root mean squared error (CV(RMSE)) to compare the models:

(7)

$$ CV(RMSE)=\frac{1}{{\overline{Y}}_i}\sqrt{\frac{\sum_{i=1}^N{\left({Y}_i-{\hat{Y}}_i\right)}^2}{N}} $$

where $ Y $ is true energy in the test set, $ \hat{Y} $ is predicted energy in the test set, and $ \overline{Y} $ is the average of the true energy, for N predictions. CV(RMSE) is commonly used to assess energy prediction performance in buildings, where a value less than 30% is considered a well-calibrated model (ASHRAE, 2002). We found that the random forest model produced the best results, with an average CV(RMSE) of 28.6% across the rooms, compared to EnergyPlus’s 94.3%. We therefore implemented the random forest model for 1-h-ahead energy prediction. A comparison between the random forest prediction and the EnergyPlus prediction for one of the rooms in our testbed is shown in Figure 7.

Figure 7. Comparison of EnergyPlus and random forest (RF) prediction for a single room in our deployments.

Through the data-driven surrogate modeling paradigm, we were able to build a simulation engine that is far more accurate than the standard tools used for HVAC energy simulation. In addition to the random forest model, we also incorporated an HVAC modifier based on manual thermostat control in the building. Each room in our study and deployments has a fine-grained manual control option, whereby occupants can make adjustments from neutral to “cool” or “warm.” We investigated the relationship between manual thermostat changes and HVAC energy and found that thermostat changes tend to increase the HVAC energy consumption of the room. Over the course of our study, manually applying the “cool” setting tends to increase energy consumption more than manually applying the “warm” setting. We also incorporated these modifiers into our overall simulation engine.

2. Lighting energy: Based on the data from our collected building state history, we assume that lighting operation can directly follow occupancy patterns: lights are on whenever a room has at least one occupant and off whenever a room is unoccupied since most modern office lights are controlled with motion sensors. We note that there are situations in which this direct control is not realized, for example when occupants override the system or when motion sensors experience errors. For our modeling purposes, we leveraged the data from the building’s digital twin to identify the “on” and “off” states for each room’s lighting fixtures. Based on the data, we were able to assign a specific power consumption for each occupied room in our building model.

3. Indoor temperature: We model the relationship between thermostat settings and indoor temperature from data collected from our networked sensing deployment. Individual rooms in the building include thermostats that give occupants the option to change the temperature. To investigate the empirical relationship between these settings and actual indoor temperature, we built linear regression models for each room that include an indicator variable for each of the possible thermostat settings at each timestep as features and the actual temperature at each timestep as the response variable. In our deployments, the thermostats had three possible settings, “warm,” “cool,” and “neutral”. We found that, on average, setting the thermostat to “warm” increases temperatures by 1.88°F and setting the thermostat to “cool” decreases temperatures by 1.80°F.

4. Air quality: Our air quality prediction task closely follows that of indoor temperature. We would expect a change in the thermostat setting to increase airflow in the room because the heating and cooling system would need to supply additional air to affect air temperature. We would also expect the additional air to be cleaner, due to the filters in the HVAC system. We would therefore expect such thermostat actions to decrease PM concentrations. However, these dynamics were not clearly evident in the data—the regression models did not produce significant relationships for most rooms. For each room, we did include small factors based on the direction of the relationship between HVAC energy and PM concentrations.

5. Comfort: In real deployments, we more accurately measure thermal comfort on the PMV ASHRAE 55 scale using thermal cameras and indoor temperature (Section 5). Because it is difficult to predict future temperatures at each pixel for each thermal camera, we instead simulate the comfort levels of each occupant by directly using PMV ASHRAE 55 and substituting indoor temperature and standard values, just like in Wei et al., Reference Wei, Liu, Kang, Yang and Jiang2021.

6.3.2. Training

To train and remedy the cold start problem, we create an offline training environment by integrating our simulator, with the deep Q-network in a tightly coupled control loop, which allows for rapid data generation and learning. An illustration of the training environment is shown in Figure 8.

Figure 8. For a building state, the reinforcement learning agent provides an action to the simulation environment. The next state, energy savings, comfort, and air quality improvements are returned to the agent to tune the policy.

Training runs in episodes, with fixed $ \Delta =1 $ h time steps. In each episode, the simulation environment is instantiated with occupant locations and building states based on historical data. The state s is provided to the deep Q-network, which outputs predicted changes in energy, comfort, and air quality for each possible action from this state. An action is chosen based on softmax selection (Equation 6), just like how recommendations are displayed.

The action is sent to the simulation environment (Section 6.3) and performed in simulation, and the episode advances to the next step, producing a new building state (next state $ {s}^{\prime } $ ) to be sent to the deep Q-network. Expected rewards for energy, comfort, and air quality are estimated and stored with the state in a pool of samples for experience replay.

6.4. Adapting to user preferences

It is not productive to suggest actions that an occupant has consistently rejected. To address this, we estimate the probability that an occupant will accept a recommendation by counting the recommendations that occupants have rejected (“feedback” as described in Section 4.3.2). We add a penalty to the ranker scoring function (Equation 6) that penalizes recommendations with more rejections.

Second, a person’s preferences may change over time. As such, we allow building managers to begin retraining the recommender system after an interval of time. Once started, RECA will automatically augment the data observed and stored in history using our simulation environment (Section 6.3) to retrain the deep Q-network.

6.5. Conflicting actions

In real deployments, recommender systems serve multiple occupants concurrently with real-time recommendations. This poses two problems for a recommender system. First, different occupants may choose conflicting recommendations, which may cause suboptimal or even negative results. For example, one occupant may choose to increase the setpoint temperature to reduce energy, while a second occupant in the same location may choose to decrease the setpoint temperature to increase comfort. Second, once an occupant has chosen a type of recommendation, there should be a period of time before the occupant is shown that type of recommendation again. Receiving a move recommendation soon after selecting a different move recommendation can be a poor user experience, and can potentially counteract objective improvements.

To address these challenges, we temporarily remove any move or setpoint recommendations for an occupant at a location once a move or setpoint recommendation has been accepted, for a period $ {t}_s=1 $ h. Additionally, when a setpoint recommendation has been selected by an occupant for a location, no other occupants will be given setpoint recommendations for the remainder of t_s.

7. Evaluation

We evaluate our system in two parts. First, we evaluate our recommender system using our simulation environment (Section 6.3) with different objective weights to study the ability to learn rewards from energy, comfort and air quality for different actions. Next, we deploy our system into two commercial office buildings and conduct A/B tests to evaluate the improvements our system can achieve in real settings over the course of four weeks.

We compare against two baseline strategies. The setpoint only strategy involves changing the setpoint temperature at each location to improve energy savings (energy emphasis), comfort (comfort emphasis), or air quality (air quality emphasis). This strategy leverages the same pipeline as RECA, except move recommendations are removed. The service strategy involves only relaxing HVAC services in locations that have no occupants, based on Balaji et al., Reference Balaji, Xu, Nwokafor, Gupta and Agarwal2013.

7.1. Evaluation using simulation environment

As the simulation environment is responsible for augmenting data for training and retraining the deep Q-networks, it is critical to evaluate the performance of the models in simulation before deploying them in the real world. We tested our recommender system with different objective weights (Equation 1) using simulated building episodes based on the building environments in our actual deployments, described in Section 7.2.

Evaluating these models in simulation involves simulating two states: one state representing that an action was performed, and the other state representing no action. We compare the energy consumption, thermal comfort, and air quality of occupants between the states. We calculate energy savings $ {E}_R $ as the difference in energy consumption between the two states over the nth timestep of length $ \Delta $ , where $ {P}_d\left(\cdot \right) $ and $ {\hat{P}}_d\left(\cdot \right) $ denotes the power of resource d in the baseline state and the state with action (Equation 8).

(8)

$$ {E}_R^n={\int}_{t=n\Delta}^{\left(n+1\right)\Delta}\sum \limits_d\left({P}_d(t)-{\hat{P}}_d(t)\right) dt. $$

We calculate thermal comfort and air quality improvement by taking the average PMV $ \left( PMV\left(\cdot \right)\right) $ and air quality ratings ( $ AQ\left(\cdot \right) $ ) for each occupant $ o\in O $ in room R_o (Equations 9 and 10).

(9)

$$ {C}_R^n=\frac{\sum_{o\in O}{\int}_{t=n\Delta}^{\left(n+1\right)\Delta}\hat{PMV}\left({R}_o,t\right)- PMV\left({R}_o,t\right)\hskip0.1em dt}{\Delta \cdot \mid O\mid } $$

(10)

$$ {Q}_R^n=\frac{\sum_{o\in O}{\int}_{t=n\Delta}^{\left(n+1\right)\Delta}\hat{AQ}\left({R}_o,t\right)- AQ\hskip0.1em dt}{\Delta \cdot \mid O\mid } $$

We simulated the performance of different models on 10 000 episodes with semi-randomized occupant start locations. As shown in Table 1, we simulated episodes using the deep Q-network architecture with and without the embedding layer, and set weights, in Equation (1), for four different emphases: energy $ \left(\alpha >>\beta =\gamma \right), $ comfort $ \left(\beta >>\alpha =\gamma \right), $ air quality $ \left(\gamma >>\alpha =\beta \right) $ , and joint optimization $ \left(\alpha =\beta =\gamma \right). $ We note that the architecture without the embedding layer performed worse for each emphasis in comparison to the architecture with the embedding layer and would often recommend actions that negatively impact the objectives. For example, in the comfort emphasis, RECA with embedding improved average PMV by 0.31 on a 3-point scale, compared to 0.03 without the embedding. Although the energy savings achieved for the comfort emphasis without embedding (22.2 kWh) exceeded the savings with the embedding (14.9 kWh), the system was tuned to emphasize and achieve comfort improvements at the expense of other objectives. Similar trends can be observed for all other emphases, while the joint emphasis achieves a more balanced improvement across all three objectives because it weights all three similarly in importance.

Table 1. Comparison of our deep Q-network architecture with and without an embedding layer, against existing strategies, on simulated building episodes with four different weighting combinations to emphasize different optimizations

Additionally, the service strategy can only improve energy savings because it only turns down services in areas with no occupants. Our system also outperforms the setpoint only strategy across all emphases because our system can not only change setpoints but also recommend more complex action sequences, such as grouping occupants with similar temperature preferences together.

7.2. Recommender system study

We deployed RECA in two commercial buildings over four weeks to evaluate energy savings, comfort improvement, and air quality improvement of different strategies and recommendation policies. In these deployments, RECA ran on $ \Delta =1 $ h time steps. Building A is an office building consisting of 10 rooms, where seven are rooms of cubicles, three are lab areas, and one break room. Building B is another office building consisting of eight rooms, where four are open work areas, two are closed office spaces, and two are shared spaces. We recruited 10 occupants in Building $ A $ and 13 occupants in Building B, ranging in age from 22 to 40 from various academic disciplines, and collected recommendation feedback from participants over the course of four weeks. We obtained approval from Columbia University’s internal review board for all our deployments.

Since our deployment can only measure one building state sequence, we decided to use the sensed building state to measure the performance of RECA, while we simulate the energy consumption, comfort, and air quality as the baseline, using the tools from our simulation environment (Section 6.3). This is because once a user takes a recommendation, we can no longer observe the state of the building in the scenario where the occupant does not perform the action, which is required to measure improvements in energy consumption, comfort, and air quality. As such, we simulate scenarios where users do not accept any of the recommendations as the baseline. As opposed to the pure simulation environment, we directly measure changes in energy consumption, comfort, and air quality in our real deployments as a result of RECA.

Figure 9 shows the number of accepted recommendations, broken down by recommendation type (move and setpoint). We scheduled RECA to retrain itself at two weeks (Section 6.4), using data from the first two weeks. The control period refers to the first two weeks, while the adapted period refers to the latter two weeks after retraining. There is a dramatic increase in recommendation acceptances, across both types of recommendations, in the adapted period because the retraining process allowed RECA to learn recommendations that users are more likely to accept. Across both deployments, we saw an 80% increase in accepted recommendations during the adapted period.

Figure 9. Acceptance rate during the first two weeks (control) and second two weeks after retraining (adapted) for the setpoint (left) and move (right) recommendations.

7.3. Learned action sequences

Throughout our study, we noticed and categorized three types of regularly occurring action sequences, that RECA recommended, with greater effects than individual actions: location optimization, group consolidation, and group disbanding.

Location optimization: The most common complex action sequence that we observed is location optimization. This sequence is typically made up of a setpoint change and a move recommendation. The main purpose of this sequence is, in most cases, to reduce the energy consumption of the starting location, without incurring a thermal comfort or air quality penalty by moving the occupant to another location. An example of this action sequence is shown in Figure 10. Initially, a single occupant is working in location A. At a certain time, indicated by the red line, the occupant increases the setpoint temperature of location A and moves to location B. The normal consequence of increasing the setpoint is a reduction in thermal comfort for the occupant as the temperature gradually increases; however, moving to location B leads instead to an increase in thermal comfort due to the environment in location B being closer to the occupant’s thermal preference, as shown in the bottom left plot. The green highlights the improvement in the occupant’s measured PMV after the occupant moves to location B (lower value = more comfortable). Increasing the setpoint temperature in location A (upper left) reduces the energy consumption of the space, as shown by the green. After the occupant moves to location B, the lights and electricity needed to service location B increases, as shown by the red (upper right). However, this increase is more than offset by the savings in location A. The air quality experienced by the occupant remains relatively stable (lower right).

Figure 10. Location optimization: At the red line, the occupant moves to location B, and HVAC service is reduced in location A. Due to differences in environment, the occupant’s thermal comfort and air quality are improved.

Group consolidation: Action sequences involving multiple occupants can enable even more optimization opportunities. In group consolidation, several occupants are brought to the same location often with a reduction in energy consumption in the start locations. As illustrated in Figure 11, occupants 1 and 2 are in locations A and B, respectively. At different times, denoted by the blue and red lines, occupants 1 and 2 increase the setpoint temperatures of locations A and B, and subsequently move to location C. By increasing setpoint temperatures, locations A and B both experience a decrease in energy consumption as highlighted in green in the upper plots. The thermal comfort and air quality for both occupants change depending on the environmental differences at locations A–C. In this example, the temperature preferences of location C are more suited for occupant 1 than occupant 2. As such, we see that the thermal comfort of occupant 1 improves (middle left plot highlighted in green), while occupant 2 decreases (middle right plot highlighted in red). The air quality for both occupants is improved in location C, as compared to locations A and B (lower plots highlighted in green). Note that in location optimization and group consolidation, the location that the occupants move to is critical. For example, if the destination location is too hot or too cold, the result of the action sequences may lead to an overall decrease in thermal comfort.

Figure 11. Group consolidation: At the blue and red lines, occupants 1 and 2 move to location C (not shown). Locations A and B reduce HVAC and lighting service, leading to energy savings. Comfort and air quality for both occupants change due to environmental differences.

Group disbanding: The final category of action sequence that we observed is group disbanding. The primary challenge in optimizing thermal comfort specifically, is that each occupant has a different thermal preference. Two occupants with different thermal preferences in the same location will require a compromise in setpoint temperature to prevent significant discomfort from one or both of the occupants. Group disbanding seeks to resolve this challenge by separating the occupants into different locations and group occupants with similar thermal preferences. In Figure 12, occupants 1 and 2 are in location A and have different thermal preferences. At the red line, occupant 2 is recommended to move to location B. The thermal comfort and air quality of occupant 2 increase as a result of the change in location (middle and lower right plots highlighted in green). Note that after occupant 2 leaves, occupant 1 changes the setpoint temperature higher at the blue line, leading to reduced energy consumption (upper left highlighted in green), improved thermal comfort (middle left highlighted in green), with a slight decrease in air quality (lower left highlighted in red).

Figure 12. Group Disbanding: At the red line, occupant 2 moves from location A to location B. Since only occupant 1 remains, HVAC service is reduced in location A at the blue line, leading to a comfort improvement for occupant 1.

7.4. Joint optimization results

In this section, we show the strengths of RECA against existing solutions in two real commercial buildings. Specifically, we show the versatility of RECA, allowing building managers to configure RECA to improve any combination of energy, comfort, and air quality. We also demonstrate how incorporating humans-in-the-loop allows RECA to further improve these objectives over existing solutions that only reduce services and change temperature setpoints.

Case 1: Energy optimization: One of the most studied problems in commercial buildings is energy optimization. To study the performance of RECA on energy optimization, we select weights of the ranker to emphasize energy savings ( $ \alpha >>\beta $ and $ \gamma $ ).

Figure 13—top shows the percentage improvement of energy consumption, occupant thermal comfort, and occupant air quality. Since the baseline strategy only reduces service in locations with no occupants, only energy consumption is improved. In comparison, the setpoint-only strategy further improves energy consumption but reduces occupant comfort. The reason is that increasing setpoint temperature reduces HVAC service, but also increases the temperature, leading to a decrease in occupant comfort. Finally, by including move recommendations, the system achieves a further 8% and 6% increase in energy savings without sacrificing comfort in buildings A and B, respectively. As described in Section 7.3, moving occupants enables locations with high energy requirements to reduce service without incurring penalties to thermal comfort. However, we also observe a decrease in air quality in building A. There are two reasons for this. First, moving more people into the same room concentrates emissions; our system also reduces HVAC services in more rooms, which leads to a decrease in air quality because less air is being filtered. Second, in this scenario, the building manager configured our system to increase energy efficiency without considering other factors. As such, our system focused on reducing as much energy as possible, even at the cost of other factors.

Figure 13. Energy savings, comfort and air quality improvements, emphasizing energy savings (top), balanced improvements (middle), and comfort, and air quality (bottom) in two deployments (A and B).

Case 2: Joint co-optimization: In some cases, building managers may wish to save energy without sacrificing comfort or air quality. Joint co-optimization is a more complex problem, as many actions lead to tradeoffs between the three objectives. To evaluate the potential of our recommender system to jointly optimize all three objectives at once, we set the weights of the ranker to balance the three objectives ( $ \alpha =\beta =\gamma $ ). We also deployed the setpoint and service strategy, which combines both the setpoint only—comfort strategy with the service strategy, as a comparison.

As shown in Figure 13—middle, the setpoint and service strategy improves thermal comfort and energy consumption, while minimally impacting air quality. However, the complete recommender system makes additional improvements of 5%, 7%, and 6%, for building A, and 9%, 9%, and 8%, for building B, in energy consumption, thermal comfort, and air quality. With move recommendations enabled, the recommender system can utilize more complex action sequences to find significant optimization opportunities that are not possible in the setpoint-only strategy. Compared with case one, where a building manager may only be concerned about energy savings, we see that our system can jointly improve all three aspects at the same time by incorporating humans in the loop.

Case 3: Comfort and air quality co-optimization: In certain cases, it may be desirable to allow increases in energy consumption to significantly increase occupant comfort and air quality. Due to the recent COVID-19 pandemic, improving air quality is a priority in work environments, while improving occupant thermal comfort can improve productivity. We deployed our recommender system with higher comfort and air quality weights ( $ \beta =\gamma >>\alpha $ ) to encourage recommendations of actions with high thermal comfort and air quality improvements. In comparison, we deployed the setpoint only—with air quality emphasis strategy for A/B testing.

As shown in Figure 13—bottom, the setpoint only baseline is able to prioritize comfort and air quality improvements with the tradeoff of increased energy consumption. However, by treating occupants as immovable objects, we observed that many of the setpoint temperature changes were compromises for multiple occupants. In contrast, the recommender system that utilizes move recommendations shows a dramatic increase in comfort and air quality improvements of 21% and 5% for building A and 11% and 4% for building B. We noticed multiple instances of group disbanding, which allows for more personalized thermal comfort by separating groups of people with different thermal comfort preferences.

7.5. Scalability

The size of the network scales $ O\left(|O|\times |S|\right) $ , where $ \mid O\mid $ is the number of occupants we engage and $ \mid S\mid $ is the number of locations in the building. Although this is linear in both occupants and locations, it can become expensive if the building gets large. Table 2 shows the execution time and scalability of RECA as the number of occupants and spaces increases. We see that the majority of the computation comes from the DQN. However, even as the number of occupants and spaces increase to over 800 and 400, respectively, the total execution time is around 15 s. This latency is more than acceptable for our system, which updates once every hour.

Table 2. Execution time for each component of RECA as the number of people and spaces increase

In even larger deployments where computation time may exceed one time step, we can use the observation that people will typically use only a small portion of the building. As such, we can reduce the state-action space by eliminating infeasible actions and/or creating multi-agent systems with smaller models to manage portions of the building. We plan to explore these avenues in future work.

7.6. Discussion

Our deployments were in office buildings with large amounts of public areas (e.g., break rooms, recreational spaces, or shared lab space) or coworking spaces. In more traditional settings where employees are assigned a specific desk space or office, the number of feasible actions that an occupant would realistically take would be less; an occupant would likely not take a recommendation to go to his/her boss’s office without a work-related reason. However, as more workspaces transition to non-fixed, more flexible, and less traditional coworking spaces, the number of feasible actions occupants may accept would greatly increase, leading to much greater potential savings.

Moreover, while we explored three important aspects of built environments, there are numerous other objectives and recommendations that a building could consider, such as actions that incorporate each person’s schedule (e.g., time and location of meetings), physical health (e.g., suggestions for when to take a walk), or mental health (e.g., suggesting when to go outside).

8. Conclusion

We present RECA, a recommender system that generates real time, human-centric, actionable recommendations for joint optimization of energy, comfort, and air quality in commercial buildings. Our recommender system consists of a novel multitask learning-based deep Q-network to jointly learn the energy consumption, comfort, and air quality improvement potential of different actions and is tunable to allow building managers to emphasize different dimensions. We conduct a 4-week study in two real office buildings and demonstrate the ability of this system to achieve greater energy savings, comfort improvements, and air quality improvements over prior works by incorporating occupants in the co-optimization process. Our multi-task learning-based recommender system enables flexibility in optimization goals and discovers impactful actions that engage occupants in creating more energy-efficient, comfortable, and healthier built environments. We envision RECA being integrated with mobile sensing and actuation platforms (e.g. Xiaa et al., Reference Xiaa, Chandrasekaran, Liu, Yang, Rosing and Jiang2021) to achieve more savings in future smart buildings.

Data availability statement

Code and data will be made available upon request.

Author contribution

S.X.: conceptualization (equal), methodology (equal), validation (lead), formal analysis (equal), investigation (lead), writing—original draft (equal), writing—review and editing (lead), supervision (supporting). P.W.: conceptualization (equal), methodology (equal), validation (supporting), formal analysis (equal), investigation (supporting), writing—original draft (equal). Y.L.: formal analysis (supporting), methodology (supporting), investigation (supporting). A.S.: conceptualization (supporting), formal analysis (supporting), methodology (supporting), investigation (supporting), writing—original draft (supporting), writing—review & editing (supporting). X.J.: conceptualization (equal), supervision (lead), Funding acquisition (lead), writing—original draft (supporting), Writing—review & editing (supporting).

Funding statement

This research was partially funded by the National Science Foundation under Grant Number CNS-1943396. The views and conclusions contained here are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Columbia University, NSF, or the US Government or any of its agencies.

Competing interest

The authors confirm that no competing interests exist.

Ethical standards

The research meets all ethical guidelines, including adherence to the legal requirements of the study country.

Footnotes

¹ Parts of this work have been previously published in the 10th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (BuildSys 2024) (Xia et al., Reference Xia, Wei, Liu, Sonta and Jiang2023).

References

An, Y, Xia, T, You, R, Lai, D, Liu, J and Chen, C (2021) A reinforcement learning approach for control of window behavior to reduce indoor pm2. 5 concentrations in naturally ventilated buildings. Building and Environment, 200, 107978.CrossRef Google Scholar

ASHRAE (2002) ASHRAE Guideline 14-2002: measurement of energy and demand savings. American Society of Heating, Refrigerating and Air-Conditioning Engineers. Atlanta, GA: UFAD.Google Scholar

ASHRAE (2013) Guide: design, construction and operation of underfloor air distribution systems. American Society of Heating, Refrigerating and Air-Conditioning Engineers. UFAD: Atlanta, GA.Google Scholar

Balaji, B, Xu, J, Nwokafor, A, Gupta, R and Agarwal, Y (2013) Sentinel: occupancy based HVAC actuation using existing wifi infrastructure within commercial buildings. In Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems. New York, NY, USA: ACM, pp. 1–14.Google Scholar

Benedetti, M, Cesarotti, V, Introna, V and Serranti, J (2016) Energy consumption control automation using artificial neural networks and adaptive algorithms: proposal of a new methodology and case study. Applied Energy, 165, 60–71.CrossRef Google Scholar

Chen, Y, Norford, LK, Samuelson, HW and Malkawi, A (2018) Optimal control of HVAC and window systems for natural ventilation through reinforcement learning. Energy and Buildings, 169, 195–205.CrossRef Google Scholar

Crawley, DB, Lawrie, LK, Winkelmann, FC, Buhl, W, Huang, Y, Pedersen, CO, Strand, RK, Liesen, RJ, Fisher, DE, Witte, MJ and Glazer, J (2001) EnergyPlus: creating a new-generation building energy simulation program. Energy and Buildings. 33, 319–331. https://doi.org/10.1016/S0378-7788(00)00114-6CrossRef Google Scholar

Dalamagkidis, K, Kolokotsa, D, Kalaitzakis, K and Stavrakakis, GS (2007) Reinforcement learning for energy conservation and comfort in buildings. Building and Environment, 42, 2686–2698.CrossRef Google Scholar

Delgarm, N, Sajadi, B, Delgarm, S and Kowsary, F (2016) A novel approach for the simulation-based optimization of the buildings energy consumption using NSGA-ii: case study in Iran. Energy and Buildings, 127, 552–560.CrossRef Google Scholar

Ding, X, Du, W and Cerpa, A (2019) Octopus: deep reinforcement learning for holistic smart building control. In Proceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation. ACM, pp. 326–335.CrossRef Google Scholar

Fiksel, A, Thornton, J, Klein, S and Beckman, W (1995) Developments to the TRNSYS simulation program. Journal of Solar Engineering, 117, 123–127.CrossRef Google Scholar

Kim, W, Jeon, Y and Kim, Y (2016) Simulation-based optimization of an integrated daylighting and HVAC system using the design of experiments method. Applied Energy, 162, 666–674.CrossRef Google Scholar

Kwak, Y and Huh, J-H (2016) Development of a method of real-time building energy simulation for efficient predictive control. Energy Conversion and Management, 113, 220–229.CrossRef Google Scholar

Kwak, Y, Huh, J-H and Jang, C (2015). Development of a model predictive control framework through real-time building energy management system data. Applied Energy, 155, 1–13.CrossRef Google Scholar

Langevin, J, Wen, J and Gurian, PL (2016) Quantifying the human–building interaction: considering the active, adaptive occupant in building performance simulation. Energy and Buildings, 117, 372–386.CrossRef Google Scholar

Lipczynska, A, Schiavon, S and Graham, LT (2018) Thermal comfort and self-reported productivity in an office with ceiling fans in the tropics. Building and Environment, 135, 202–212.CrossRef Google Scholar

Maasoumy, M, Rosenberg, C, Sangiovanni-Vincentelli, A and Callaway, DS (2014) Model predictive control approach to online computation of demand-side flexibility of commercial buildings HVAC systems for supply following. In 2014 American Control Conference, pp. 1082–1089.CrossRef Google Scholar

Macarulla, M, Casals, M, Forcada, N and Gangolells, M (2017). Implementation of predictive control in a commercial building energy management system using neural networks. Energy and Buildings, 151, 511–519.CrossRef Google Scholar

Miller, C, Thomas, D, Irigoyen, SD, Hersberger, C, Nagy, Z, Rossi, D and Schlueter, A (2014) BIM-extracted EnergyPlus model calibration for retrofit analysis of a historically listed building in Switzerland. Proceedings of SimBuild 2014, 331–338. https://doi.org/10.13140/RG.2.1.1671.7285Google Scholar

Mintz, D (2016) Technical Assistance Document for the Reporting of Daily Air Quality: The Air Quality Index (AQI). https://stanford.idm.oclc.org/login?url=http://search.ebscohost.com/login.aspx?direct=true%5C&site=eds-live%5C&db=edsgpr%5C&AN=edsgpr.000990603%20http://www3.epa.gov/airnow/aqi-technical-assistance-documentmay2016.pdf%20http://purl.fdlp.gov/GPO/gpo71010. Accession Number: edsgpr.000990603; Corporate Authors: United States. Environmental Protection Agency. Office of Air Quality Planning and Standards, issuing body; Other Notes: Title from title screen (viewed 9 August 2016) “Contact: David Mintz.”; “May 2016.”; “EPA-454/B-16-002.”; Publication Type: Book; Physical Description: 1 online resource (19 pages): color illustrations; Language: English; OCLC: 956357530; US GPO Item Number: 0483-E-22 (online); Gov. Doc: EP 4.52:AI 7/10/2016; Report No.: EPA-454/B-16-002.Google Scholar

Mnih, V, Kavukcuoglu, K, Silver, D, Graves, A, Antonoglou, I,Wierstra, D and Riedmiller, M (2013) Playing atari with deep reinforcement learning. Preprint, https://people.engr.tamu.edu/guni/csce642/files/dqn.pdf.Google Scholar

Moon, JW and Jung, SK (2016) Development of a thermal control algorithm using artificial neural network models for improved thermal comfort and energy efficiency in accommodation buildings. Applied Thermal Engineering, 103, 1135–1144.CrossRef Google Scholar

Nagarathinam, S, Vasan, A, Sarangan, V, Jayaprakash, R and Sivasubramaniam, A (2018) Good set-points make good neighbors: user seating and temperature control in uberized workspaces. In Proceedings of the 5th Conference on Systems for Built Environments. New York, NY, USA: ACM, pp. 144–147.CrossRef Google Scholar

Nagarathinam, S, Vasan, A, Sarangan, V, Jayaprakash, R and Sivasubramaniam, A (2021) User placement and optimal cooling energy for co-working building spaces. ACM Transactions on Cyber-Physical Systems, 5, 1–24.CrossRef Google Scholar

Norford, LK, Socolow, RH, Hsieh, ES and Spadaro, GV (1994) Two-to-one discrepancy between measured and predicted performance of a ‘low-energy’ office building: insights from a reconciliation based on the DOE-2 model. Energy and Buildings, 21, 121–131. https://doi.org/10.1016/0378-7788(94)90005-1CrossRef Google Scholar

Rastogi, K, Barthwal, A and Lohani, D (2019) AQCI: an IoT based air quality and thermal comfort model using fuzzy inference. In 2019 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS). IEEE, pp. 1–6.Google Scholar

Reynders, G, Diriken, J and Saelens, D (2014) Quality of grey-box models and identified parameters as function of the accuracy of input and observation signals. Energy and Buildings, 82, 263–274.CrossRef Google Scholar

Rong, X (2014) Word2vec parameter learning explained. Preprint, arXiv:1411.2738.Google Scholar

Seruto, C (2010) Whole-building retrofits: a gateway to climate stabilization. ASHRAE Transactions, 116, 244.Google Scholar

Sonta, A, Dougherty, TR and Jain, RK (2021) Data-driven optimization of building layouts for energy efficiency. Energy and Buildings, 238, 110815.CrossRef Google Scholar

Srinivasan, RS, Manohar, B and Issa, RR (2020) Urban building energy cps (ube-cps): real-time demand response using digital twin. In Cyber-Physical Systems in the Built Environment. New York, NY, USA: Springer, pp. 309–322.CrossRef Google Scholar

Sturzenegger, D, Gyalistras, D, Morari, M and Smith, RS (2015). Model predictive climate control of a swiss office building: implementation, results, and cost–benefit analysis. IEEE Transactions on Control Systems Technology, 24, 1–12.CrossRef Google Scholar

Sun, Y, Haghighat, F and Fung, BC (2020). A review of the state-of-the-art in data-driven approaches for building energy prediction. Energy and Buildings, 221, 110022. https://doi.org/10.1016/j.enbuild.2020.110022CrossRef Google Scholar

US Department of Energy (2015). An assessment of energy technologies and research opportunities (accessed 29 May 2021).Google Scholar

Wei, P, Chen, X, Vega, J, Xia, S, Chandrasekaran, R and Jiang, X (2017a) Eprints: a real-time and scalable system for fair apportionment and tracking of personal energy footprints in commercial buildings. In Proceedings of the 4th ACM International Conference on Systems for Energy-Efficient Built Environments. ACM, pp. 1–10.Google Scholar

Wei, P, Chen, X, Vega, J, Xia, S, Chandrasekaran, R and Jiang, X (2018a) A scalable system for apportionment and tracking of energy footprints in commercial buildings. ACM Transactions on Sensor Networks (TOSN) 14, 1–25.CrossRef Google Scholar

Wei, P, Liu, Y, Kang, H, Yang, C and Jiang, X (2021). A low-cost and scalable personalized thermal comfort estimation system in indoor environments. In Proceedings of the First International Workshop on Cyber-Physical-Human System Design and Implementation, pp. 1–6. https://doi.org/10.1145/3458648.3460006CrossRef Google Scholar

Wei, P, Xia, S, Chen, R, Qian, J, Li, C and Jiang, X (2020). A deep-reinforcement-learning-based recommender system for occupant-driven energy optimization in commercial buildings. IEEE Internet of Things Journal, 7, 6402–6413.CrossRef Google Scholar

Wei, P, Xia, S and Jiang, X (2018b) Energy saving recommendations and user location modeling in commercial buildings. In Proceedings of the 26th Conference on User Modeling, Adaptation and Personalization, pp. 3–11.CrossRef Google Scholar

Wei, T, Wang, Y and Zhu, Q (2017b) Deep reinforcement learning for building HVAC control. In Proceedings of the 54th annual design automation conference 2017, pp. 1–6.CrossRef Google Scholar

Xia, S, Wei, P, Liu, Y, Sonta, A and Jiang, X (2023) Reca: a multi-task deep reinforcement learning-based recommender system for co-optimizing energy, comfort and air quality in commercial buildings. In Proceedings of the 10th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation. ACM, pp. 99–109.CrossRef Google Scholar

Xiaa, S, Chandrasekaran, R, Liu, Y, Yang, C, Rosing, TS and Jiang, X (2021). A drone-based system for intelligent and autonomous homes. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems, pp. 349–350.CrossRef Google Scholar

Yang, L, Nagy, Z, Goffin, P and Schlueter, A (2015) Reinforcement learning for optimal control of low exergy buildings. Applied Energy, 156, 577–586.CrossRef Google Scholar

Zhang, Z, Chong, A, Pan, Y, Zhang, C, Lu, S and Lam, KP (2018) A deep reinforcement learning approach to using whole building energy model for HVAC optimal control. In 2018 Building Performance Analysis Conference and SimBuild.Google Scholar

Figure 2. The embedding layer in our network converts the physical representation of occupant locations into a feature representation through a learned embedding matrix.

Figure 4. The web interface for occupant feedback displays a list of recommendations. Estimated energy savings, comfort, and air quality improvements are shown with each recommendation.

Figure 5. The digital twin is constructed using a tripartite graph structure, from Wei et al. (2017a) and Chen et al. (2018), to store relations between energy resources, spaces, and occupants. The building state can be extracted from the graph structure to build visualizations.

Figure 7. Comparison of EnergyPlus and random forest (RF) prediction for a single room in our deployments.

Figure 9. Acceptance rate during the first two weeks (control) and second two weeks after retraining (adapted) for the setpoint (left) and move (right) recommendations.

Figure 13. Energy savings, comfort and air quality improvements, emphasizing energy savings (top), balanced improvements (middle), and comfort, and air quality (bottom) in two deployments (A and B).

Table 2. Execution time for each component of RECA as the number of people and spaces increase

Submit a response

Comments

No Comments have been published for this article.

Article contents

A multi-task deep reinforcement learning-based recommender system for co-optimizing energy, comfort, and air quality in commercial buildings with humans-in-the-loop

Abstract

Keywords

Impact Statement

1. Introduction

2. Related works

3. Challenges

3.1. Building modeling

3.2. Co-optimizing energy, comfort, and air quality

3.3. Deep reinforcement learning recommendation systems

4. Deep reinforcement learning-based recommender system

4.1. Deep reinforcement learning formulation

4.1.1. State-actions and recommendation types

4.1.2. Reward

4.2. Deep Q-network for generating actionable recommendations

4.2.1. Location embedding

4.2.2. Multi-task learning

4.2.3. Network architecture

4.3. Recommender system design

4.3.1. Ranker

4.3.2. User feedback

5. System implementation

6. Real-world considerations

6.1. Digital twin

6.2. Building modeling

6.3. Data augmentation for exploring more building states and actions

6.3.1. Predicting future states

6.3.2. Training

6.4. Adapting to user preferences

6.5. Conflicting actions

7. Evaluation

7.1. Evaluation using simulation environment

7.2. Recommender system study

7.3. Learned action sequences

7.4. Joint optimization results

7.5. Scalability

7.6. Discussion

8. Conclusion

Data availability statement

Author contribution

Funding statement

Competing interest

Ethical standards

Footnotes

References

Comments

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests