Reinforced Elevators

December 08, 2024 - 10 min read

Years ago, I took an undergraduate class on artificial intelligence. My professor offhandedly mentioned one day about how someone had used reinforcement learning to train an elevator. Well, after remembering that random quote from years past, I think I have found that scientific paper.

After reading it, I want to try implementing what they’ve wrote published in 2008. This is my attempt.

What is it

Reinforcement Learning (RL for short) is a technique where an agent learns by interacting with the world around it. As opposed to neural nets, which learn by coming to an answer and then updating its assumptions based on how wrong or right it was, RL involves rewarding or punishing the model based on the actions that it takes.

Like a nun hitting a child’s knuckles when they misbehave, we will define some criteria for good behavior and do much the same. If our robot arm knocks over a box, we will punish it. If it knocks over a person, we punish it even more. However, if it successfully picks up the box, we will give it a reward. Instead of candy and bruises, our rewards and punishments will be in the form of an integer. A positive number is a reward, while a negative one is a punishment.

The benefit over other approaches, is that RL allows the model to learn in real time. We can continuously improve our models as they execute in real life. Further more, the model can learn how to balance risk vs reward. It will take

Basic Math

While there are many approaches to RL, I will be using Q-Learning. There is a fairly simple function behind this method:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]

Well at first glance it’s not simple, however we can break it down to make it easier.

  • ss is our current state (i.e. where we are in the world)
  • aa is the action we are taking (i.e. go left)
  • rr is the reward for doing aa in state ss (i.e. reaching the end of the maze)
  • ss’ is the next state after we do aa
  • aa’ is some action done after we get to the next state ss’
  • α\alpha is the learning rate (how much we should learn based on observations)
  • γ\gamma is the discount factor (how much we weigh future vs current rewards)

For each state we find ourselves in, we will find some action that will move us to a state with better rewards. Determining which action to take next is done by our policy function π\pi. For today, I am using a greedy approach that takes whatever action leads to a higher value in QQ

π(s)=argmaxaAQ(s,a)\pi(s) = \arg \max_{a \in A} Q(s, a)

However, we might not always follow the value our policy recommends. We can be exploitative, meaning we use π\pi, or we can explore, i.e. take random actions to see if we can find a better choice. Exploring is risky, but it can help us find better solutions.

Toy Problem

A good example to illustrate how we can use Q-Learning is solving a maze. For example, imagine we needed to navigate the guy below to the goal:

⬜⬜⬜⬜⬜⬜⬜
⬜⬜🟫⬜🟫🟫⬜
🟫⬜🟫🟫⬜⬜⬜
🏃⬜⬜🟫🏁⬜⬜
⬜⬜⬜🟫⬜⬜⬜
⬜⬜⬜🟫⬜⬜⬜
⬜⬜⬜⬜⬜⬜⬜

Using RL, we can specify each square in the grid as state based on location - coordinates like (x,y). We will set our reward function to be 1-1 times the distance of the current location of our guy to the goal. It is important to multiple it by negative one since we want to maximize the reward function. Further, if we reach the goal, we will give a nice and juicy reward of 100 points.

Starting out, we have no idea which direction we need to move. Our guy will walk randomly. However, if we run enough simulations, eventually we will learn the best direction to move at each spot.

➡️➡️➡️➡️➡️➡️⬇️
➡️⬆️🟫⬆️🟫🟫⬇️
🟫⬆️🟫🟫⬇️⬇️⬇️
🏃⬆️⬅️🟫🏁⬅️⬅️
➡️⬆️⬆️🟫⬆️⬅️⬅️
➡️➡️⬇️🟫⬆️⬅️⬅️
➡️➡️➡️➡️⬆️⬅️⬅️

No matter which square we find ourselves in, we can follow the arrows all the way to the goal!

While the above solution does help us reach the goal, you may notice that it is actually faster to move down from the starting square. This is an example of over-exploitation. This is why exploration is so important. If we increase the probability of exploration and try again, we can find a much better solution:

➡️➡️➡️➡️➡️➡️⬇️
➡️⬆️🟫⬆️🟫🟫⬇️
🟫⬆️🟫🟫⬇️⬇️⬇️
🏃⬇️⬇️🟫🏁⬅️⬅️
➡️⬇️⬇️🟫⬆️⬅️⬅️
➡️➡️⬇️🟫⬆️⬅️⬅️
➡️➡️➡️➡️⬆️⬅️⬅️

Code

I wrote the following code to perform training. Training will be split into epochs. We will keep our Q table between epochs, but we will start over at a new starting state. If the start state does not exist in the table, we will initialize it to some dummy values.

def train(self):
  for epoch in range(self.epochs):
    current_state = self.nextStartState(epoch)

    if current_state not in self.Q_table:
      self.Q_table[current_state] = [0 for _ in self._actions]

Next, we will run a loop that only terminates when we have reached our goal. In that loop, we start by executing the policy function π\pi. This simply tells us which action to take at our current state based on what we know. Based on our exploration rate, we may decide to take a random action instead.

    while not self.reachedGoal(current_state):

      if np.random.rand() < self.exploration_prob:
        action = np.random.randint(len(self._actions))
      else:
        action = np.argmax(self.Q_table[current_state])

Now that we have the action, we can apply that action to get the next state. For example, if we are solving a maze, we can tell the agent to move north, and it will be one spot further north. Besides some code to make sure the next state exists in our Q_table, I also will run the reward function. In the maze, this might be a reward for getting closer to the goal (e.g. Manhattan distance).

      next_state = self._applyAction(current_state, action)

      # Invalid states will not be acceptable
      if self.validState(next_state):
        if next_state not in self.Q_table:
          self.Q_table[next_state] = [0 for _ in self._actions]
        reward = self.reward(next_state)
      else:
        reward = -np.inf
        next_state = current_state

I have built in some logic to highly discourage bad behavior – like knocking into a human – by providing a negatively infinite punishment. This will also discourage our bot from trying to exit the bounds of the world or walk through walls.

      if reward == -np.inf:
        # Illegal actions are not encouraged
        self.Q_table[current_state][action] = -np.inf
      else:
        self.Q_table[current_state][action] += (
          self.learning_rate * (
            reward + self.discount_factor * np.max(self.Q_table[next_state]) - self.Q_table[current_state][action])
        )
        current_state = next_state

Finally, we will execute our Q function from above. Once we’ve updated the table, we also update the current state and then repeat. Once we’ve reached our goal, or iterate for too long, we will move onto the next epoch. In this way, I’ve implemented an abstract Q-Learner.

Elevators

Now that have code to run the Q-Learner, we only need two things to get started with elevators. The first is a way of storing the state, and the second is a reward function.

States

I am not proud of this, but I implemented the state as a nested tuple. Because I need a hash-able python type to become the keys to the dictionary representing the Q_table, lists and objects cannot be used. I tried mapping the state onto a 1-dimensional integer, but that math became untenable. Instead, all states are represented in the following format:

(
  (floors that have called the elevator…),
  (elevator 1 position, elevator 1 num riders, (elevator 1 rider destinations…))
  (elevator 2 position, elevator 2 num riders, (elevator 2 rider destinations…))
  ...
)

Yucky, but this gets the job done. As an example, here’s a representation for a 3 story building with one elevator empty on the first (i.e. 0th0^{th}) floor, another elevator with a passenger wanting to go to the first floor, and a person calling the elevator on the second floor:

((2), (0, 0, ()), (1,1,(0)))

I hope that makes sense. The important thing to remember is that the state is arbitrary and simply needs to be differentiable from other states - such as ((1), (0, 1, (2)), (2,0,())).

Something else I should mention is that I limit each elevator’s capacity. I did this because I found that as the number of possible states increases, the RL model needs to run longer and longer. Reducing the state space increases performance. For this reason, I sort the callers and the desired floors for each elevator. This means that ((1,3), (0, 2, (2,4))) is a valid state, while ((3,1), (0, 2, (4,2))) is not.

Reward Function

I’ve tried a few different reward functions, but as far as I can tell the best ones do the following:

  1. Encourage moving in the right direction (reward for elevators being close to floors with waiting callers)
  2. Discourage bad behavior (I noticed elevators tended to follow each other – I reward them for spreading apart)
  3. Highly reward the goal (Dropping a passenger off)

In the end, I used the following function to reward the model:

def reward(self, next_state):
  occupancy = 0
  drop_offs = 0
  buttkick = 0

  # For each elevator
  for e in next_state[1:]:
    occupancy += e[1]
    # If the elevator is not moving
    if not e[3]:
      # Check if one of the occupants wants to get out
      if e[0] in e[2]:
        drop_offs += e[2].count(e[0])

    # Don't hangout on the first floor if people have called the elevator
    if e[0] == 0 and len(next_state[0]):
      # Sometimes you need a good kick in the butt
      buttkick += 1

  return (
    # Punish for callers waiting to get on the elevator
    -len(next_state[0])
    - occupancy
    - buttkick
    + 10*(drop_offs**2)
    # Encourage elevators to not bunch up by rewarding spacing between them
    + self.elevator_spacing([e[0] for e in next_state[1:]])/2
  )

Something I noticed is that elevators would get stuck in the first floor - even if callers were waiting for them above. My reward function punishes them for doing so.

Moving Elevators

All the code in this post is available in full on github.

To start with, I defined a simple elevator that would carry 2, move between 3 floors.

controller = ElevatorController(learning_rate=0.3, num_elevators=1, num_floors=3, max_capacity=2, epochs=1000
controller.train()
controller.save("simple_elevator.pkl")

In order to save time, I will save all models so I can rerun them later. I built a fun visualizer using matplotlib’s animation class. The circles on the left represent each floor. When a passenger is waiting for pickup, the corresponding floor will light up with yellow. To the right, we can see the elevator moving up and down. When the elevator is white, it is not moving and the door is open. Otherwise, it will be filled in grey when moving. Blue circles represent passengers onboard the elevator. The number in the circle shows which floor the passenger wants to go to. To reduce training time, passengers will only ever want to go to the bottom floor.

That’s not bad! It some times will prioritize the first floor over the second. I tried tuning the reward function to penalize this, but it ended up not being worth it. Maybe someone else will improve on my work. Otherwise, I am happy with the performance.

Next, I tried to train on a much more complex setup with 5 floors and 2 elevators. I kept the capacity for each elevator at 2.

controller = ElevatorController(learning_rate=0.3, max_capacity=2, epochs=1000)
controller.train()
controller.save("complex_elevator.pkl")

This produces… interesting behavior:

That does not look good. We could try tuning our variables, however if we simply increase the number of epochs:

controller = ElevatorController(learning_rate=0.3, max_capacity=2, epochs=100_000)
controller.load("complex_elevator.pkl")

We get much better results!

We can pickup passengers at any floor, and then deposit them where they want to go. The two elevators tend to separate responsibilities. If one is at the top of the building, the other will stay lower. It’s not perfect. I’m not sure I would want to ride in this elevator. Sometimes a person wanting to go down, will start to go back up. However, I’m sure a few more hours of tuning and improvement can rough out these sharp edges.

A major challenge of using RL for elevators is the long training times, especially as the number of floors and elevators increases. This happens due to the rapid growth of the state space, which requires more time to explore and optimize.

For fun, I’ve also ran a simulation overnight for 10 floors with 2 elevators. I also increased the capacity to 4.

The results are less than optimal. Perhaps further tuning of the model could improve things, however the file is almost 1 GB at this point. This approach is not scalable for such a large state space. It was faster and easier to train on less floors.

As always, the code is available for this post on github.

Conclusion

While I didn’t succeed in making the perfect elevator controller, I did make a half decent one. Or more correctly, I trained a pretty good one.

I found that simple reward functions tended to produce more stable behaviors. If I tried punishing elevators for stopping without reason, no one wants to get on or off, then I found that either the punishment would be too small and ignored, or too large and the elevators would never stop.

Looking ahead, I’d like to revisit this problem and explore more advanced techniques, such as deep Q-learning, which shows great potential in scaling up RL for more complex environments.

However, for now at least, I am very happy with this experiment. It’s nice to know I have a fairly basic machine learning technique in my tool belt.


© 2025 - Curtis Lowder