Neural Network for Open.AI CartPole-v1 Challenge with Keras

In this article we will talk about my solution for Open.AI CartPole-v1 game challenge using Python with Keras 😀

If you don’t know what is the Open.AI, i strong suggest you to follow this link: https://openai.com/about/

First of all, you can access the original challenge here: https://gym.openai.com/envs/CartPole-v1/

If you are anxious for the code, click > https://github.com/guibacellar/OpenAi

So, lets start !!!!

Challenge Overview

The challenge goal is move the Cart to left or right and prevent the pole do falling over And leave the game screen. Looks simple, for humans.

The game engine provide us, on every movement, 4 variables:

observation> An array with game observation.
reward > Round reward, in this game is always fixed on 1 (int)
done > Boolean flag, indicating if the game is done (for good or bad)
info > Diagnostics info

The observation variable is the most import information that the game provide us, and in this challenge, observation contains:

Num    Observation             Min   Max
0      Cart Position           -4.8  4.8
1      Cart Velocity           -Inf  Inf
2      Pole Angle              -24°  24°
3      Pole Velocity At Tip    -Inf  Inf

and the data looks like that:

The whole process consist basically in to reset the game environment, make on movement, grab the variables and make another movement until you wins or loose.

The Setup

To accomplish this challenge we will use a Python 3.6+, Pandas, NumPy, Keras, Tensorflow and obviously the Gym library.

The Code, Hurray!!!!!

Getting Samples, Randomly

Our first step is to play some random games to got an satisfatory data to train out Neural Network algorithm. For that, we could use that function:

def play_random_games(games=100):
    """
    Play Random Games to Get Some Observations
    :param games:
    :return:
    """

    # Storage for All Games Movements
    all_movements = []

    for episode in range(games):

        # Reset Game Reward
        episode_reward = 0

        # Define Storage for Current Game Data
        current_game_data = []

        # Reset Game Environment
        env.reset()

        # Get First Random Movement
        action = env.action_space.sample()

        while True:

            # Play
            observation, reward, done, info = env.step(action)

            # Get Random Action (On Real, its get a "Next" movement to compensate Previous Movement)
            action = env.action_space.sample()

            # Store Observation Data and Action Taken
            current_game_data.append(
                np.hstack((observation, LEFT_CMD if action == 0 else RIGHT_CMD))
            )

            if done:
                break

            # Compute Reward
            episode_reward += reward

        # Save All Data (Only for the Best Games)
        if episode_reward >= MIN_REWARD:
            print('.', end='')
            all_movements.extend(current_game_data)

    # Create DataFrame
    dataframe = pd.DataFrame(
        all_movements,
        columns=['cart_position', 'cart_velocity', 'pole_angle', 'pole_velocity_at_tip', 'action_to_left', 'action_to_right']
    )

    # Convert Action Columns to Integer
    dataframe['action_to_left'] = dataframe['action_to_left'].astype(int)
    dataframe['action_to_right'] = dataframe['action_to_right'].astype(int)

    return dataframe

This code have an important check to ensure that only the best of best random plays will be used for training.

I define MIN_REWARD variable equal to 100, and perform 10k random plays.

df = play_random_games(games=10000)

At this point, we should have something like:

On our DataFrame we have the 4 observation values and 2 new columns indicating if our random agent moves the cart to left or to right.

One import thing to say is that we do not store the action (movement information on self) before play, but we store the action after making the movement. That’s because we want to use not the action it self but the next random action after the movement. Complex, no?

Imagine this way: our goal is to keep the pole and the cart in equilibrium, so, if we move the cart to the left the correct next movement that we want is to right (because the right movement will restore the pole/cart equilibrium).

Training Neural Network Model

Now, we need to make our Neural Network and pass all our data to training process.

def generate_ml(dataframe):
    # Define Neural Network Topology
    model = Sequential()
    model.add(Dense(64, input_dim=4, activation='relu'))
    model.add(Dense(64,  activation='relu'))
    model.add(Dense(32,  activation='relu'))
    model.add(Dense(2,  activation='sigmoid'))

    # Compile Neural Network
    model.compile(optimizer='adam', loss='categorical_crossentropy')

    # Fit Model with Data
    model.fit(
        dataframe[['cart_position', 'cart_velocity', 'pole_angle', 'pole_velocity_at_tip']],
        dataframe[['action_to_left', 'action_to_right']],
        epochs=20
    )

    return model

Our Neural Network are composed by 1 input layer, 2 hidden layers and 1 output layer. To training process we pass the 4 variables of random games obtained from the observations and the 2 options of movements. At this point you should be thinking how i define the movements, so, i define that way.

RIGHT_CMD = [0, 1]
LEFT_CMD = [1, 0]

After a some (automatically, kkkk) calculations our model should be correctly trained.

The Result

Now, only left to play real game using out Neural Network Model. But, to do that we need to make a small change in our original random agent code.

def play_game(ml_model, games=100):
    """
    Play te Game
    :param ml_model:
    :param games:
    :return:
    """

    for i_episode in range(games):

        # Define Reward Var
        episode_reward = 0

        # Reset Env for the Game
        observation = env.reset()

        while True:
            render = env.render()

            # Predict Next Movement
            current_action_pred = ml_model.predict(observation.reshape(1, 4))

            # Define Movement
            current_action = np.argmax(current_action_pred)

            # Make Movement
            observation, reward, done, info = env.step(current_action)

            if done:
                episode_reward += 1
                print(f"Episode finished after {i_episode+1} steps", end='')
                break

            episode_reward += 1

        print(f" Score = {episode_reward}")

And, the expected result are like this video:

By the way, you can access the working code at https://github.com/guibacellar/OpenAi