Within the part *Off-policy Monte Carlo Management *of the guide *Reinforcement Studying: An Introduction 2nd Version (web page 112)*, the writer left us with an fascinating train: utilizing the weighted significance sampling off-policy Monte Carlo technique to seek out the quickest method driving on each tracks. This train is complete that asks us to contemplate and construct virtually each element of a reinforcement studying job, just like the surroundings, agent, reward, actions, circumstances of termination, and the algorithm. Fixing this train is enjoyable and helps us construct a stable understanding of the interplay between algorithm and surroundings, the significance of an accurate episodic job definition, and the way the worth initialization impacts the coaching consequence. Via this publish, I hope to share my understanding and answer to this train with everybody involved in reinforcement studying.

As talked about above, this train asks us to discover a coverage that makes a race automobile drive from the beginning line to the ending line as quick as doable with out operating into gravel or off the observe. After fastidiously studying the train descriptions, I listed some key factors which can be important to finish this job:

**Map illustration**: maps on this context are literally 2D matrices with (row_index, column_index) as coordinates. The worth of every cell represents the state of that cell; as an illustration, we will use 0 to explain gravel, 1 for the observe floor, 0.4 for the beginning area, and 0.8 for the ending line. Any row and column index exterior the matrix may be thought of as out-of-boundary.**Automobile illustration**: we will instantly use the matrix’s coordinates to symbolize the automobile’s place;**Velocity and management**: the speed house is discrete and consists of horizontal and vertical speeds that may be represented as a tuple (row_speed, col_speed). The velocity restrict on each axes is (-5, 5) and incremented by +1, 0, and -1 on every axis in every step; subsequently, there are a complete of 9 doable actions in every step. Actually, the velocity can’t be each zero besides on the beginning line, and the vertical velocity, or row velocity, can’t be detrimental as we don’t need our automobile to drive again to the beginning line.**Reward and episode**: the reward for every step earlier than crossing the ending line is -1. When the automobile runs out of the observe, it’ll be reset to one of many beginning cells. The episode ends**ONLY**when the automobile efficiently crosses the ending line.**Beginning states**: we randomly select beginning cell for the automobile from the beginning line; the automobile’s preliminary velocity is (0, 0) in accordance with the train’s description.**Zero-acceleration problem**: the writer proposes a small*zero-acceleration problem*that, at every time step, with 0.1 likelihood, the motion won’t take impact and the automobile will stay at its earlier velocity. We will implement this problem in coaching as a substitute of including the characteristic to the surroundings.

The answer to the train is separated into two posts; on this publish, we’ll deal with constructing a racetrack surroundings. The file construction of this train is as follows:

`|-- race_track_env`

| |-- maps

| | |-- build_tracks.py // this file is used to generate observe maps

| | |-- track_a.npy // observe an information

| | |-- track_b.npy // observe b knowledge

| |-- race_track.py // race observe surroundings

|-- exercise_5_12_racetrack.py // the answer to this train

And the libraries used on this implementation are as follows:

`python==3.9.16`

numpy==1.24.3

matplotlib==3.7.1

pygame==2.5.0

We will symbolize observe maps as 2D matrices with totally different values indicating observe states. I wish to be loyal to the train, so I’m attempting to construct the identical maps proven within the guide by assigning matrix values manually. The maps will likely be saved as separate *.npy* information in order that the surroundings can learn the file in coaching as a substitute of producing them within the runtime.