Mario, learning to trade

Andrea C.
5 min readJun 28, 2021

--

Photo by Cláudio Luiz Castro on Unsplash

Mario is a newbie investor that wants to start trading at the stock market. There is only one stock available, the acme corporation stock, and its price follows a deterministic trend over time. However Mario does not know anything about trading: as a newbie, he needs to learn how to behave in order to make some profit.

In this article we will see, through a bottom-up example, what the main elements of a Markov Decision Process (MDP) are and how to implement a minimal algorithm for the value function iteration.

As a habit of mine, I will use hand-drawn formulas because I find it rather annoying using latex on Medium. I apologize in advance for the calligraphy.

The share

The end-of-day price series of acme acts as a triangle wave centered around a mean value of 100:

The unrealistic daily time series of ACME share

The game rules

Mario, as learning agent, observes the market for seven days, exploring all possible options. The set of all the situations in which an agent can stay is called the state space. For Mario, each situation is defined by three coordinates: the time measured in days, the share price and the number of shares owned. Therefore we have 21 possible states.

The state space

Mario can buy up to two shares, so most of the time he has different options to choose from:

  • if he has zero or one shares, he can buy a new share;
  • if he has one or two shares, he can sell a share;
  • finally, regardless how many shares he owns, he can just hold its position (the do-nothing option)

This is the action space, a mapping between states and the subset of actions made available to the agent in that state. For Mario, possible actions are BUY, SELL, or HOLD.

The state-action mapping

In our example the initial state and the chosen action determine univocally the successive state. The process is deterministic. In general this is not always true, and instead of having a mapping is necessary to use a function that tells us the probability of going to state s’ given current state s and action a. This function is called transition probability.

We can see also that the process is Markovian because in order to determine next state (or to calculate its probability) we only need to know the previous state and not to track the history of the past ones.

On the last day, Mario can no longer do anything: the corresponding states are terminal states as they provide no further actions to the agent.

To learn how to make a good investment strategy, Mario needs to receive feedbacks from the system (the environment). Feedbacks are provided by the rewarding function, which assigns a reward to any action taken from any state.

“Oh, Mario. How I wish I could control everyone the way I can with you. Hop, you little plumber! Hop, hop, hop!”

Sheldon Cooper

We want Mario to learn to buy when the stock’s price is low and to sell when it is high. We want also Mario to be a prudent investor, so avoid unnecessary trades we give him a small reward when he chooses the HOLD action. In other words we are setting a small penalty on each trade so to make sure it is worth doing.

The rewarding function

Finally, we need a policy function that tells Mario what to do: this is a mapping between states and actions.

The goal is to find the optimal policy: the policy that maximizes the expected discounted sum of the rewards (the value function) that the agent can get moving from an initial state s, according to the following recursive definition:

Recursive evaluation of value function

Value iteration algorithm

Value iteration is an method of computing the optimal policy and its value function.

At each step and for each state, we look at the value function and we find the best action we can take according to the current value function. The best action should lead to the highest value neighbor. If it doesn’t, policy function and value function are updated. We repeat this step until the change in the value function is very small and convergence has been reached.

We can easily control the convergence speed changing the threshold value. With the current one, code converges after a few hundred thousand iterations, displaying the best guess for policy and value function.

The policy found by the algorithm
The estimated value function

Looking at the results, we can draw a couple of considerations. The first is that the more days pass, the more the values decrease: later states give less opportunity to accumulate money over time.

The second one is that, as expected, highest values are found in states that allow for higher profits: being able to buy at the lowest price in the early days to be able to sell afterwards.

In just seven days, Mario has learned that we should never buy when the price is at its highest nor sell when it’s at its lowest: a simple concept, and yet something that even seasoned traders sometimes forget.

--

--

Andrea C.

Developer and data scientist for a European central bank. Views my own.