A Brief Introduction to Machine Learning Tree
Artificial intelligence(AI) is a branch of computer science in which we aim to build intelligent machines. Machines are trained and given power so that they can take decisions on their own. AI has many branches but, here we will narrow down the line take only Machine Learning. Machine Learning is a branch of Artificial intelligence in which we are not implicitly programming the machines(computers) programmatically, machines learn from data. When we reached to Machine Learning, we need to define the terms closely related to ML. Machine Learning is divided into three parts, supervised learning, unsupervised learning and reinforcement learning. Before going into more depth, we will be introduced some of the concepts here.
The image shows five objects A, B, C, D and E. Each object has some features like Height, Weight and Age. The horizontal rows represent the instances e.g. the one in green color. Vertical columns show features of objects like height is one of the features.
Machine Learning methods learn from examples. machine learning, model is trained by a labeled dataset. A dataset can be defined as a collection of instances with similar features. Data can be audio, video or text and it can be structured or unstructured. Dataset is divided into two main parts, the first one is training data and the second is testing data. Mostly 80% of the dataset is used as a training dataset and the remaining 20% is used as testing or validation dataset. The training dataset is the one, which we feed to our machine learning method (Model) to train it. A testing dataset is the one that is used to test the accuracy of the model not to train the model.
ML is purely based on data so; the first thing is data gathering. If luckily you have the dataset then it makes your work very easy. If not, you need to gather dataset for yourself. Datasets can be video, audio or text. Before training the model, you need to decide that where is your data warehouse, from where data will come to the model. Data collection is the first and most important step. While collecting data you should think of different conditions that can face in reality. You need to cater as many conditions as you can so that the accuracy of your model can be improved as much as you can.
After collecting the required data, you need to first clean up the data. This means you need to remove the dirty data, which can decrease your accuracy or which is irrelevant. When completely cleaned, then you should label the data. Labeling means assigning a category to an object.
Ground Truth is one of the ML services provided by AWS, which is used to build a highly accurate training dataset.
Build, Train, and Test
After labeling the cleaned dataset, it’s the turn to Build the machine learning algorithm. While building your algorithm you need to take care of Overfitting and Underfitting. Overfitting means the model performs best for a particular data set and performs not good for a general dataset. On the other hand, underfitting means that the model is unable to
capture the structure of data this means the model does not fit the data. Your model should be the best fit, which means it should perform well on the particular data set as well as the general data.
The second thing is to train the model you developed by the dataset you gathered. Training is the name of feeding the algorithm with the dataset. The algorithm captures the structure of the input dataset and trains itself. After the algorithm is trained by some specific data, it’s now called the model.
As discussed earlier that from the overall dataset, 80% is used as training and the remaining 20% is used for testing. The performance of the model is tested using this set of data. On the basis of the results from this test, the model is further improved to make the model more accurate and efficient.
Amazon provides one of the best tools named SageMaker. This enables the developers to create, train and deploy the models easily.
Supervised Learning is the one in which the model (Machine learning algorithm) is trained by labeled data. Labeled data means a unique output for specific input. After trained by this labeled data, the model is in a position to predict the output for an input. Regression problems and Classification problems are discussed under this umbrella.
On the other hand, it’s unsupervised learning. In this type of learning, the model is learned by unlabeled data. Cluster analysis, Hidden Markov chain and Association Analysis are the methods discussed under this branch of ML.
The third branch of Machine Learning is Reinforcement learning. In reinforcement learning the model learns from its mistakes. The model performs some action based on some state. Based on this action, the environment gives rewards to the model. A reward can be positive and negative, the negative reward is also known as a penalty. If the reward is positive, models learn this and if a reward is negative, the model does not learn it.