Reinforcement Learning (RL)
Machine learning is divided into three types, supervised, unsupervised and reinforcement learning. Some writers have also mentioned semi-supervised learning as the fourth type but, we do agree on three. In supervised learning, label data is given so, on the basis of the label data (Input, Output) model gives you outputs for the unknown inputs you are giving to it. Some examples are classification and regression.
The second type is unsupervised learning and un-label data is given to training the model. The third type is the one in which the model learning is closest to human learning. Before going into the detail of reinforcement learning, we will define some of the terms related to this.
- Policy: policy is defined as the behavior of an agent in a specific state.
- Agent: The component which decides what action to be taken. Usually this is the system.
- Rewards: On the basis of the action, the environment sends a single number (good or bad) to the agent called reward. The objective is to maximize the reward in reinforcement learning.
- Value Function: The total number of rewards an agent can gather starting from that state.
- Model of the Environment: This mimics the behavior of the environment. When says environment, it’s the user in RL.
The working mechanism of reinforcement Machine Learning is that on the basis of some specific state let’s say, the agent performs some action called policy and it’s rewarded by a number (negative or positive). On the basis of this reward from the environment, the agent trains itself and the state will be updated to st+1. The goal of agent is to maximize reward in the long run.
There are some concerns about this type of learning. Let’s take the example of a football game. Let’s considered that the goal is scored and the agent(model) is regarded as positive. It’s not the single step that’s important, there are a number of steps that finally brought you the score. Some of the steps are fruitful and there will be some steps that may be not profitable but if the reward is positive now the model will learn all these steps and if the outcome is negative all the steps are dropped (not learned). This type of problem is called the Credit assignment problem. This is the problem of determining the action that leads to a certain outcome. The learning speed of the model may vary with the parameter and too much reinforcement can lead to an overload of states which can diminish the results.
In order to maximize the long-term outcome, the agent should know the steps which are profitable means which leads to a positive outcome and those which lead to a negative outcome.
To solve this kind of problem, backward propagation can be one of the techniques in artificial neural networks (ANNs). The problem in this way is that the performance (time-based) decreases with an increase in the number of artificial neurons to attain a certain convergence rate.
One of the applications of Reinforcement Learning is the Recommender system, which is used by Google for YouTube videos, google maps searches, google searches, etc. Earlier recommendation systems were developed with the help of supervised learning. With this they have some limitations like Bias (rewards for only the seen ones) and Myopia Recommendation (show the catchy and Familiar videos). So, to minimize these kinds of limitations, recommendation systems are developed using Reinforcement learning which trains itself on the basis of the feedback from the environment with time.