نوشته Splash of Color: Instance Segmentation with Mask R-CNN and TensorFlow اولین بار در مجله شهاب. پدیدار شد.
]]>Back in November, we open-sourced our implementation of Mask R-CNN, and since then it’s been forked 1400 times, used in a lot of projects, and improved upon by many generous contributors. We received a lot of questions as well, so in this post I’ll explain how the model works and show how to use it in a real application.
I’ll cover two things: First, an overview of Mask RCNN. And, second, how to train a model from scratch and use it to build a smart color splash filter.
Code Tip:
We’re sharing the code here. Including the dataset I built and the trained model. Follow along!
Instance segmentation is the task of identifying object outlines at the pixel level. Compared to similar computer vision tasks, it’s one of the hardest possible vision tasks. Consider the following asks:
This is a standard convolutional neural network (typically, ResNet50 or ResNet101) that serves as a feature extractor. The early layers detect low level features (edges and corners), and later layers successively detect higher level features (car, person, sky).
Passing through the backbone network, the image is converted from 1024x1024px x 3 (RGB) to a feature map of shape 32x32x2048. This feature map becomes the input for the following stages.
Code Tip:
The backbone is built in the function resnet_graph(). The code supports ResNet50 and ResNet101.
While the backbone described above works great, it can be improved upon. The Feature Pyramid Network (FPN) was introduced by the same authors of Mask R-CNN as an extension that can better represent objects at multiple scales.
FPN improves the standard feature extraction pyramid by adding a second pyramid that takes the high level features from the first pyramid and passes them down to lower layers. By doing so, it allows features at every level to have access to both, lower and higher level features.
Our implementation of Mask RCNN uses a ResNet101 + FPN backbone.
Code Tip:
The FPN is created in MaskRCNN.build(). The section after building the ResNet.
RPN introduces additional complexity: rather than a single backbone feature map in the standard backbone (i.e. the top layer of the first pyramid), in FPN there is a feature map at each level of the second pyramid. We pick which to use dynamically depending on the size of the object. I’ll continue to refer to the backbone feature map as if it’s one feature map, but keep in mind that when using FPN, we’re actually picking one out of several at runtime.
The RPN is a lightweight neural network that scans the image in a sliding-window fashion and finds areas that contain objects.
The regions that the RPN scans over are called anchors. Which are boxes distributed over the image area, as show on the left. This is a simplified view, though. In practice, there are about 200K anchors of different sizes and aspect ratios, and they overlap to cover as much of the image as possible.
How fast can the RPN scan that many anchors? Pretty fast, actually. The sliding window is handled by the convolutional nature of the RPN, which allows it to scan all regions in parallel (on a GPU). Further, the RPN doesn’t scan over the image directly (even though we draw the anchors on the image for illustration). Instead, the RPN scans over the backbone feature map. This allows the RPN to reuse the extracted features efficiently and avoid duplicate calculations. With these optimizations, the RPN runs in about 10 ms according to the Faster RCNN paper that introduced it. In Mask RCNN we typically use larger images and more anchors, so it might take a bit longer.
Code Tip:
The RPN is created in rpn_graph(). Anchor scales and aspect ratios are controlled by RPN_ANCHOR_SCALES and RPN_ANCHOR_RATIOS in config.py.
The RPN generates two outputs for each anchor:
Using the RPN predictions, we pick the top anchors that are likely to contain objects and refine their location and size. If several anchors overlap too much, we keep the one with the highest foreground score and discard the rest (referred to as Non-max Suppression). After that we have the final proposals (regions of interest) that we pass to the next stage.
Code Tip:
The ProposalLayer is a custom Keras layer that reads the output of the RPN, picks top anchors, and applies bounding box refinement.
This stage runs on the regions of interest (ROIs) proposed by the RPN. And just like the RPN, it generates two outputs for each ROI:
Code Tip:
The classifier and bounding box regressor are created in fpn_classifier_graph().
There is a bit of a problem to solve before we continue. Classifiers don’t handle variable input size very well. They typically require a fixed input size. But, due to the bounding box refinement step in the RPN, the ROI boxes can have different sizes. That’s where ROI Pooling comes into play.
ROI pooling refers to cropping a part of a feature map and resizing it to a fixed size. It’s similar in principle to cropping part of an image and then resizing it (but there are differences in implementation details).
The authors of Mask R-CNN suggest a method they named ROIAlign, in which they sample the feature map at different points and apply a bilinear interpolation. In our implementation, we used TensorFlow’s crop_and_resize function for simplicity and because it’s close enough for most purposes.
Code Tip:
ROI pooling is implemented in the class PyramidROIAlign.
If you stop at the end of the last section then you have a Faster R-CNNframework for object detection. The mask network is the addition that the Mask R-CNN paper introduced.
The mask branch is a convolutional network that takes the positive regions selected by the ROI classifier and generates masks for them. The generated masks are low resolution: 28×28 pixels. But they are soft masks, represented by float numbers, so they hold more details than binary masks. The small mask size helps keep the mask branch light. During training, we scale down the ground-truth masks to 28×28 to compute the loss, and during inferencing we scale up the predicted masks to the size of the ROI bounding box and that gives us the final masks, one per object.
Code Tip:
The mask branch is in build_fpn_mask_graph().
Unlike most image editing apps that include this filter, our filter will be a bit smarter: It finds the objects automatically. Which becomes even more useful if you want to apply it to videos rather than a single image.
Typically, I’d start by searching for public datasets that contain the objects I need. But in this case, I wanted to document the full cycle and show how to build a dataset from scratch.
I searched for balloon images on flickr, limiting the license type to “Commercial use & mods allowed”. This returned more than enough images for my needs. I picked a total of 75 images and divided them into a training set and a validation set. Finding images is easy. Annotating them is the hard part.
Wait! Don’t we need, like, a million images to train a deep learning model? Sometimes you do, but often you don’t. I’m relying on two main points to reduce my training requirements significantly:
First, transfer learning. Which simply means that, instead of training a model from scratch, I start with a weights file that’s been trained on the COCO dataset (we provide that in the github repo). Although the COCO dataset does not contain a balloon class, it contains a lot of other images (~120K), so the trained weights have already learned a lot of the features common in natural images, which really helps. And, second, given the simple use case here, I’m not demanding high accuracy from this model, so the tiny dataset should suffice.
There are a lot of tools to annotate images. I ended up using VIA (VGG Image Annotator) because of its simplicity. It’s a single HTML file that you download and open in a browser. Annotating the first few images was very slow, but once I got used to the user interface, I was annotating at around an object a minute.
If you don’t like the VIA tool, here is a list of the other tools I tested:
There isn’t a universally accepted format to store segmentation masks. Some datasets save them as PNG images, others store them as polygon points, and so on. To handle all these cases, our implementation provides a Dataset class that you inherit from and then override a few functions to read your data in whichever format it happens to be.
The VIA tool saves the annotations in a JSON file, and each mask is a set of polygon points. I didn’t find documentation for the format, but it’s pretty easy to figure out by looking at the generated JSON. I included comments in the code to explain how the parsing is done.
Code Tip:
An easy way to write code for a new dataset is to copy coco.py and modify it to your needs. Which is what I did. I saved the new file as balloons.py
My BalloonDataset class looks like this:
class BalloonDataset(utils.Dataset):
def load_balloons(self, dataset_dir, subset):
...
def load_mask(self, image_id):
...
def image_reference(self, image_id):
...
load_balloons reads the JSON file, extracts the annotations, and iteratively calls the internal add_class and add_image functions to build the dataset.
load_mask generates bitmap masks for every object in the image by drawing the polygons.
image_reference simply returns a string that identifies the image for debugging purposes. Here it simply returns the path of the image file.
You might have noticed that my class doesn’t contain functions to load images or return bounding boxes. The default load_image function in the base Dataset class handles loading images. And, bounding boxes are generated dynamically from the masks.
Code Tip:
Your dataset might not be in JSON. My BalloonDataset class reads JSON because that’s what the VIA tool generates. Don’t convert your dataset to a format similar to COCO or the VIA format. Insetad, write your own Dataset class to load whichever format your dataset comes in. See the samples and notice how each uses its own Dataset class.
To verify that my new code is implemented correctly I added this Jupyter notebook. It loads the dataset, visualizes masks and bounding boxes, and visualizes the anchors to verify that my anchor sizes are a good fit for my object sizes. Here is an example of what you should expect to see:
Code Tip:
To create this notebook I copied inspect_data.ipynb, which we wrote for the COCO dataset, and modified one block of code at the top to load the Balloons dataset instead.
The configurations for this project are similar to the base configuration used to train the COCO dataset, so I just needed to override 3 values. As I did with the Dataset class, I inherit from the base Config class and add my overrides:
class BalloonConfig(Config):
# Give the configuration a recognizable name
NAME = "balloons"
# Number of classes (including background)
NUM_CLASSES = 1 + 1 # Background + balloon
# Number of training steps per epoch
STEPS_PER_EPOCH = 100
The base configuration uses input images of size 1024×1024 px for best accuracy. I kept it that way. My images are a bit smaller, but the model resizes them automatically.
Code Tip:
The base Config class is in config.py. And BalloonConfig is in balloons.py.
Mask R-CNN is a fairly large model. Especially that our implementation uses ResNet101 and FPN. So you need a modern GPU with 12GB of memory. It might work on less, but I haven’t tried. I used Amazon’s P2 instances to train this model, and given the small dataset, training takes less than an hour.
Start the training with this command, running from the balloon
directory. Here, we’re specifying that training should start from the pre-trained COCO weights. The code will download the weights from our repository automatically:
python3 balloon.py train --dataset=/path/to/dataset --model=coco
And to resume training if it stopped:
python3 balloon.py train --dataset=/path/to/dataset --model=last
Code Tip:
In addition to balloons.py, the repository has three more examples: train_shapes.ipynb which trains a toy model to detect geometric shapes, coco.py which trains on the COCO dataset, and nucleus which segments nuclei in microscopy images.
The inspect_balloon_model notebook shows the results generated by the trained model. Check the notebook for more visualizations and a step by step walk through the detection pipeline.
Code Tip:
This notebook is a simplified version of inspect_mode.ipynb, which includes visualizations and debugging code for the COCO dataset.
Finally, now that we have object masks, let’s use them to apply the color splash effect. The method is really simple: create a grayscale version of the image, and then, in areas marked by the object mask, copy back the color pixels from original image. Here is an example:
Code Tip:
The code that applies the effect is in the color_splash() function. And detect_and_color_splash() handles the whole process from loading the image, running instance segmentation, and applying the color splash filter.
نوشته Splash of Color: Instance Segmentation with Mask R-CNN and TensorFlow اولین بار در مجله شهاب. پدیدار شد.
]]>نوشته Computer Vision Tutorial: Implementing Mask R-CNN for Image Segmentation + Python Code اولین بار در مجله شهاب. پدیدار شد.
]]>
I am fascinated by self-driving cars. The sheer complexity and mix of different computer vision techniques that go into building a self-driving car system is a dream for a data scientist like me.
So, I set about trying to understand the computer vision technique behind how a self-driving car potentially detects objects. A simple object detection framework might not work because it simply detects an object and draws a fixed shape around it.
That’s a risky proposition in a real-world scenario. Imagine if there’s a sharp turn in the road ahead and our system draws a rectangular box around the road. The car might not be able to understand whether to turn or go straight. That’s a potential disaster!
Instead, we need a technique that can detect the exact shape of the road so our self-driving car system can safely navigate the sharp turns as well.
The latest state-of-the-art framework that we can use to build such a system? That’s Mask R-CNN!
So, in this article, we will first quickly look at what image segmentation is. Then we’ll look at the core of this article – the Mask R-CNN framework. Finally, we will dive into implementing our own Mask R-CNN model in Python. Let’s begin!
We learned the concept of image segmentation in part 1 of this series in a lot of detail. We discussed what is image segmentation and its different techniques, like region-based segmentation, edge detection segmentation, and segmentation based on clustering.
I would recommend checking out that article first if you need a quick refresher (or want to learn image segmentation from scratch).
I’ll quickly recap that article here. Image segmentation creates a pixel-wise mask for each object in the image. This technique gives us a far more granular understanding of the object(s) in the image. The image shown below will help you to understand what image segmentation is:
Here, you can see that each object (which are the cells in this particular image) has been segmented. This is how image segmentation works.
We also discussed the two types of image segmentation: Semantic Segmentation and Instance Segmentation. Again, let’s take an example to understand both of these types:
All 5 objects in the left image are people. Hence, semantic segmentation will classify all the people as a single instance. Now, the image on the right also has 5 objects (all of them are people). But here, different objects of the same class have been assigned as different instances. This is an example of instance segmentation.
Part one covered different techniques and their implementation in Python to solve such image segmentation problems. In this article, we will be implementing a state-of-the-art image segmentation technique called Mask R-CNN to solve an instance segmentation problem.
Mask R-CNN is basically an extension of Faster R-CNN. Faster R-CNN is widely used for object detection tasks. For a given image, it returns the class label and bounding box coordinates for each object in the image. So, let’s say you pass the following image:
The Fast R-CNN model will return something like this:
The Mask R-CNN framework is built on top of Faster R-CNN. So, for a given image, Mask R-CNN, in addition to the class label and bounding box coordinates for each object, will also return the object mask.
Let’s first quickly understand how Faster R-CNN works. This will help us grasp the intuition behind Mask R-CNN as well.
Once you understand how Faster R-CNN works, understanding Mask R-CNN will be very easy. So, let’s understand it step-by-step starting from the input to predicting the class label, bounding box, and object mask.
Similar to the ConvNet that we use in Faster R-CNN to extract feature maps from the image, we use the ResNet 101 architecture to extract features from the images in Mask R-CNN. So, the first step is to take an image and extract features using the ResNet 101 architecture. These features act as an input for the next layer.
Now, we take the feature maps obtained in the previous step and apply a region proposal network (RPM). This basically predicts if an object is present in that region (or not). In this step, we get those regions or feature maps which the model predicts contain some object.
The regions obtained from the RPN might be of different shapes, right? Hence, we apply a pooling layer and convert all the regions to the same shape. Next, these regions are passed through a fully connected network so that the class label and bounding boxes are predicted.
Till this point, the steps are almost similar to how Faster R-CNN works. Now comes the difference between the two frameworks. In addition to this, Mask R-CNN also generates the segmentation mask.
For that, we first compute the region of interest so that the computation time can be reduced. For all the predicted regions, we compute the Intersection over Union (IoU) with the ground truth boxes. We can computer IoU like this:
IoU = Area of the intersection / Area of the union
Now, only if the IoU is greater than or equal to 0.5, we consider that as a region of interest. Otherwise, we neglect that particular region. We do this for all the regions and then select only a set of regions for which the IoU is greater than 0.5.
Let’s understand it using an example. Consider this image:
Here, the red box is the ground truth box for this image. Now, let’s say we got 4 regions from the RPN as shown below:
Here, the IoU of Box 1 and Box 2 is possibly less than 0.5, whereas the IoU of Box 3 and Box 4 is approximately greater than 0.5. Hence. we can say that Box 3 and Box 4 are the region of interest for this particular image whereas Box 1 and Box 2 will be neglected.
Next, let’s see the final step of Mask R-CNN.
Once we have the RoIs based on the IoU values, we can add a mask branch to the existing architecture. This returns the segmentation mask for each region that contains an object. It returns a mask of size 28 X 28 for each region which is then scaled up for inference.
Again, let’s understand this visually. Consider the following image:
The segmentation mask for this image would look something like this:
Here, our model has segmented all the objects in the image. This is the final step in Mask R-CNN where we predict the masks for all the objects in the image.
Keep in mind that the training time for Mask R-CNN is quite high. It took me somewhere around 1 to 2 days to train the Mask R-CNN on the famous COCO dataset. So, for the scope of this article, we will not be training our own Mask R-CNN model.
We will instead use the pretrained weights of the Mask R-CNN model trained on the COCO dataset. Now, before we dive into the Python code, let’s look at the steps to use the Mask R-CNN model to perform instance segmentation.
It’s time to perform some image segmentation tasks! We will be using the mask rcnn framework created by the Data scientists and researchers at Facebook AI Research (FAIR).
Let’s have a look at the steps which we will follow to perform image segmentation using Mask R-CNN.
First, we will clone the mask rcnn repository which has the architecture for Mask R-CNN. Use the following command to clone the repository:
git clone https://github.com/matterport/Mask_RCNN.git
Once this is done, we need to install the dependencies required by Mask R-CNN.
Here is a list of all the dependencies for Mask R-CNN:
You must install all these dependencies before using the Mask R-CNN framework.
Next, we need to download the pretrained weights. You can use this link to download the pre-trained weights. These weights are obtained from a model that was trained on the MS COCO dataset. Once you have downloaded the weights, paste this file in the samples folder of the Mask_RCNN repository that we cloned in step 1.
Finally, we will use the Mask R-CNN architecture and the pretrained weights to generate predictions for our own images.
Once you’re done with these four steps, it’s time to jump into your Jupyter Notebook! We will implement all these things in Python and then generate the masks along with the classes and bounding boxes for objects in our images.
Sp, are you ready to dive into Python and code your own image segmentation model? Let’s begin!
To execute all the code blocks which I will be covering in this section, create a new Python notebook inside the “samples” folder of the cloned Mask_RCNN repository.
Let’s start by importing the required libraries:
Next, we will define the path for the pretrained weights and the images on which we would like to perform segmentation:
If you have not placed the weights in the samples folder, this will again download the weights. Now we will create an inference class which will be used to infer the Mask R-CNN model:
What can you infer from the above summary? We can see the multiple specifications of the Mask R-CNN model that we will be using.
So, the backbone is resnet101 as we have discussed earlier as well. The mask shape that will be returned by the model is 28X28, as it is trained on the COCO dataset. And we have a total of 81 classes (including the background).
We can also see various other statistics as well, like:
You should spend a few moments and understand these specifications. If you have any doubts regarding these specifications, feel free to ask me in the comments section below.
Next, we will create our model and load the pretrained weights which we downloaded earlier. Make sure that the pretrained weights are in the same folder as that of the notebook otherwise you have to give the location of the weights file:
Now, we will define the classes of the COCO dataset which will help us in the prediction phase:
Let’s load an image and try to see how the model performs. You can use any of your images to test the model.
This is the image we will work with. You can clearly identify that there are a couple of cars (one in the front and one in the back) along with a bicycle.
It’s prediction time! We will use the Mask R-CNN model along with the pretrained weights and see how well it segments the objects in the image. We will first take the predictions from the model and then plot the results to visualize them:
Interesting. The model has done pretty well to segment both the cars as well as the bicycle in the image. We can look at each mask or the segmented objects separately as well. Let’s see how we can do that.
I will first take all the masks predicted by our model and store them in the mask variable. Now, these masks are in the boolean form (True and False) and hence we need to convert them to numbers (1 and 0). Let’s do that first:
Output:
(۴۸۰,۶۴۰,۳)
This will give us an array of 0s and 1s, where 0 means that there is no object at that particular pixel and 1 means that there is an object at that pixel. Note that the shape of the mask is similar to that of the original image (you can verify that by printing the shape of the original image).
However, the 3 here in the shape of the mask does not represent the channels. Instead, it represents the number of objects segmented by our model. Since the model has identified 3 objects in the above sample image, the shape of the mask is (480, 640, 3). Had there been 5 objects, this shape would have been (480, 640, 5).
We now have the original image and the array of masks. To print or get each segment from the image, we will create a for loop and multiply each mask with the original image to get each segment:
This is how we can plot each mask or object from the image. This can have a lot of interesting as well as useful use cases. Getting the segments from the entire image can reduce the computation cost as we do not have to preprocess the entire image now, but only the segments.
Below are a few more results which I got using our Mask R-CNN model:
Looks awesome! You have just built your own image segmentation model using Mask R-CNN – well done.
I love working with this awesome Mask R-CNN framework. Perhaps I will now try to integrate that into a self-driving car system.
Image segmentation has a wide range of applications, ranging from the healthcare industry to the manufacturing industry. I would suggest you try this framework on different images and see how well it performs. Feel free to share your results with the community.
نوشته Computer Vision Tutorial: Implementing Mask R-CNN for Image Segmentation + Python Code اولین بار در مجله شهاب. پدیدار شد.
]]>نوشته Weight Initialization in Deep Learning اولین بار در مجله شهاب. پدیدار شد.
]]>Building a neural network is a tedious task and upon that tuning it to get better result is more challenging. The first challenging task that comes into consideration while building a neural network is initialization of weights, if the weights are initialized correctly, then optimization will be achieved in least time, Otherwise converging to minima is impossible.
Let us have an overview of whole neural network process and the reason why initialization of weights impact’s our model
Whole neural network process can be explained in 4 steps :
With the weights,inputs and bias term, we multiply the weights with the input and we will add the bias term and then we will perform summation and then we pass this to activation function. This process continues to all the neurons and finally we will get predicted y_hat. This process is called forward propagation.
Difference between the predicted y_hat and the actual y is called loss term. It captures how far our predictions are from the actual target. Our main objective is to minimize the loss function.
Here, we compute the gradients and update the weights with respect to loss function . We perform the updation of weights until we get minimum loss.
Steps 2–۴ are repeated for n-iterations till we get minimized loss.
By seeing the above neural network process, we can easily say that, steps 2,3 and 4 functionality is same for any network i.e., we do same operations until we converge to minimum loss, only the big difference for faster convergence to minima in any neural network is right initialization of weights .
Now, let us see the different types of initialization of weights. Before going into the topic ,let me introduce you to some terminologies
Fan-in :
Fan-in is the number of inputs that are entering into the neuron.
Fan-out :
Fan-out is number of outputs that are going from the neuron.
There are two inputs that are entering into the neuron. Hence, fan-in=2.
One output is going away from neuron. Hence, fan-out=1 .
Uniform distribution is a type of probability distribution in which all outcomes are equally likely i.e., each variable has the same probability that it will be outcome.
Normal distribution is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than the data far from the mean.
Now, let us dive deep into the different initialization techniques. From here, we will go in a practical aspect i.e., Let us take MNIST dataset, and we will initialize the weights with different initialization techniques and let us see what’s happening with output.
MNIST dataset is one of the most common datasets used for image classification. This dataset contains hand written number images and we have to classify them into any one of the 10 classes(i.e., 0 – 9).
For simplicity, we will consider only a 2 layer neural network i.e., 1st hidden layer with 128 neurons , ۲nd hidden layer with 64 neurons and we will a softmax classifier to classify the outputs. Here, we will use ReLU as an activation unit. Ok , Lets get started.
Weights are initialized with zero. Then, all the neurons of all the layers performs same calculation, giving same output. The derivative with respect to loss function is same for every weight. The model won’t learn anything. The weight’s won’t get update at all. Here, we are facing vanishing gradients problem.
model = Sequential() model.add(Dense(128, activation='relu', input_shape=(input_dim,), kernel_initializer='zeros')) model.add(Dense(64, activation='relu', kernel_initializer='zeros')) model.add(Dense(output_dim, activation='softmax'))
Epoch 1/5 60000/60000 [==============================] - 3s 55us/step - loss: 2.3016 - acc: 0.1119 - val_loss: 2.3011 - val_acc: 0.1135 Epoch 2/5 60000/60000 [==============================] - 3s 47us/step - loss: 2.3013 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135 Epoch 3/5 60000/60000 [==============================] - 3s 46us/step - loss: 2.3013 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135 Epoch 4/5 60000/60000 [==============================] - 3s 47us/step - loss: 2.3013 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135 Epoch 5/5 60000/60000 [==============================] - 3s 46us/step - loss: 2.3013 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135
Here, the train loss and test loss are not changing . Hence, we can easily conclude that no change in weights of neuron. From this, we can conclude that, our model is effected with vanishing gradients problem.
Instead of initializing all the weights to zeros, here we are initializing all the values to random values. Random initialization is better than zero initialization of weights. But, in random initialization we have chance of facing two issues i.e., vanishing gradients and exploding gradients. If the weights are initialized very high, then we will be facing issue of exploding gradients. If the weights are initialized very low, then we will be facing issue of vanishing gradients.
model = Sequential() model.add(Dense(128, activation='relu', input_shape=(input_dim,), kernel_initializer='random_uniform')) model.add(Dense(64, activation='relu', kernel_initializer='random_uniform')) model.add(Dense(output_dim, activation='softmax'))
Epoch 1/5 60000/60000 [==============================] - 3s 55us/step - loss: 0.3929 - acc: 0.8887 - val_loss: 0.1889 - val_acc: 0.9432 Epoch 2/5 60000/60000 [==============================] - 3s 45us/step - loss: 0.1570 - acc: 0.9534 - val_loss: 0.1247 - val_acc: 0.9622 Epoch 3/5 60000/60000 [==============================] - 3s 53us/step - loss: 0.1069 - acc: 0.9685 - val_loss: 0.0994 - val_acc: 0.9705 Epoch 4/5 60000/60000 [==============================] - 3s 54us/step - loss: 0.0810 - acc: 0.9761 - val_loss: 0.0986 - val_acc: 0.9710 Epoch 5/5 60000/60000 [==============================] - 3s 54us/step - loss: 0.0629 - acc: 0.9804 - val_loss: 0.0877 - val_acc: 0.9755
Here, the train loss and the test loss are changing much i.e., they are converging to the minimum loss value. Hence, we can clearly say that random initialization is better than zero initialization of weights. But, when we rerun the model, we will be getting different results because of random initialization of weights.
This is an advanced technique in initialization of weights. There are two types of initialization in this i.e., Xavier Glorot normal initialization and Xavier Glorot uniform initialization.
Here the weights belong to a uniform distribution with in range of +x and -x, where x=(sqrt(6/(fan-in+fan-out)))
model = Sequential() model.add(Dense(128, activation='relu', input_shape=(input_dim,), kernel_initializer='glorot_uniform')) model.add(Dense(64, activation='relu', kernel_initializer='glorot_uniform')) model.add(Dense(output_dim, activation='softmax'))
Epoch 1/5 60000/60000 [==============================] - 4s 68us/step - loss: 0.3317 - acc: 0.9072 - val_loss: 0.1534 - val_acc: 0.9545 Epoch 2/5 60000/60000 [==============================] - 3s 55us/step - loss: 0.1303 - acc: 0.9614 - val_loss: 0.1124 - val_acc: 0.9679 Epoch 3/5 60000/60000 [==============================] - 3s 54us/step - loss: 0.0889 - acc: 0.9731 - val_loss: 0.0978 - val_acc: 0.9711 Epoch 4/5 60000/60000 [==============================] - 3s 54us/step - loss: 0.0668 - acc: 0.9795 - val_loss: 0.0863 - val_acc: 0.9735 Epoch 5/5 60000/60000 [==============================] - 3s 55us/step - loss: 0.0529 - acc: 0.9840 - val_loss: 0.0755 - val_acc: 0.9771
Here, with this Xavier Glorot uniform initialization, our model tends to perform very well. Although ,we can run it multiple times, our output won’t change.
Here the weights belongs to a normal distribution with mean=0 and variance= sqrt(2/(fan-in+fan-out)).
model = Sequential() model.add(Dense(128, activation='relu', input_shape=(input_dim,), kernel_initializer='glorot_normal')) model.add(Dense(64, activation='relu', kernel_initializer='glorot_normal')) model.add(Dense(output_dim, activation='softmax'))
Epoch 1/5 60000/60000 [==============================] - 4s 66us/step - loss: 0.3296 - acc: 0.9064 - val_loss: 0.1628 - val_acc: 0.9492 Epoch 2/5 60000/60000 [==============================] - 3s 50us/step - loss: 0.1359 - acc: 0.9597 - val_loss: 0.1119 - val_acc: 0.9658 Epoch 3/5 60000/60000 [==============================] - 3s 51us/step - loss: 0.0945 - acc: 0.9721 - val_loss: 0.0929 - val_acc: 0.9706 Epoch 4/5 60000/60000 [==============================] - 3s 52us/step - loss: 0.0731 - acc: 0.9776 - val_loss: 0.0804 - val_acc: 0.9741 Epoch 5/5 60000/60000 [==============================] - 3s 51us/step - loss: 0.0576 - acc: 0.9824 - val_loss: 0.0707 - val_acc: 0.9783
Here, with this Xavier Glorot normal initialization, our model also tends to perform very well. Although ,we can run it multiple times, our output won’t change .
The weights we set here are neither too big nor two small. Hence, we won’t face the problem of vanishing gradients and exploding gradients. Also, Xavier Glorot initialization helps in faster convergence to minima.
It is pronounced as hey initialization . This is also an advanced technique in initialization of weights. ReLU activation unit performs very well with this initialization .We consider only ,number of inputs in He- initialization .In He-initialization also, we have two types i.e., He-normal initialization and He-uniform initialization
Here the weights belongs to a uniform distribution within the range of +x and -x, where x=(sqrt(6/fan-in)).
model = Sequential() model.add(Dense(128, activation='relu', input_shape=(input_dim,), kernel_initializer='he_uniform')) model.add(Dense(64, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(output_dim, activation='softmax'))
Epoch 1/5 60000/60000 [==============================] - 4s 72us/step - loss: 0.3252 - acc: 0.9050 - val_loss: 0.1524 - val_acc: 0.9546 Epoch 2/5 60000/60000 [==============================] - 3s 52us/step - loss: 0.1314 - acc: 0.9611 - val_loss: 0.1104 - val_acc: 0.9671 Epoch 3/5 60000/60000 [==============================] - 3s 54us/step - loss: 0.0928 - acc: 0.9718 - val_loss: 0.0978 - val_acc: 0.9697 Epoch 4/5 60000/60000 [==============================] - 3s 53us/step - loss: 0.0703 - acc: 0.9786 - val_loss: 0.0890 - val_acc: 0.9740 Epoch 5/5 60000/60000 [==============================] - 3s 53us/step - loss: 0.0546 - acc: 0.9828 - val_loss: 0.0860 - val_acc: 0.9740
Here, in He-uniform initialization of weights we are only using the number of inputs. But, only with number of inputs, our model is performing quite descent with the He-uniform initialization of weights.
Here the weights belongs to a normal distribution with mean=0 and variance= sqrt(2/(fan-in)).
model = Sequential() model.add(Dense(128, activation='relu', input_shape=(input_dim,), kernel_initializer='he_normal')) model.add(Dense(64, activation='relu', kernel_initializer='he_normal')) model.add(Dense(output_dim, activation='softmax'))
Epoch 1/5 60000/60000 [==============================] - 4s 61us/step - loss: 0.3163 - acc: 0.9087 - val_loss: 0.1596 - val_acc: 0.9508 Epoch 2/5 60000/60000 [==============================] - 3s 45us/step - loss: 0.1319 - acc: 0.9610 - val_loss: 0.1163 - val_acc: 0.9625 Epoch 3/5 60000/60000 [==============================] - 3s 44us/step - loss: 0.0915 - acc: 0.9725 - val_loss: 0.0897 - val_acc: 0.9727 Epoch 4/5 60000/60000 [==============================] - 3s 45us/step - loss: 0.0693 - acc: 0.9795 - val_loss: 0.0878 - val_acc: 0.9735 Epoch 5/5 60000/60000 [==============================] - 3s 44us/step - loss: 0.0537 - acc: 0.9836 - val_loss: 0.0764 - val_acc: 0.9769
Here, in He-normal initialization of weights we are only using the number of inputs. But, only with number of inputs, our model is performing well.
In He- initialization also,we set weights neither too big nor two small. Hence, we won’t face the problem of vanishing gradients and exploding gradients. Also, this initialization helps in faster convergence to minima.
As their is no strong theory for choosing right weight initialization, we just have some rule of thumb methods i.e.,
Mostly, Convolutional neural network will use ReLU activation function and it use’s he-initialization.
نوشته Weight Initialization in Deep Learning اولین بار در مجله شهاب. پدیدار شد.
]]>نوشته Activation Functions in Deep Learning اولین بار در مجله شهاب. پدیدار شد.
]]>“The expert in anything was once a beginner” -Helen Hayes
Yes, let me begin the initial step of yours in deep learning by teaching you the two basic and important concepts in deep learning i.e.,Activation functions and weight initialization in deep learning.
Introduction
For everything there is biological inspiration. Activation functions and neural networks are one of the beautiful ideas inspired from humans. When we feed lots of information to our brain, it tries hard to understand and classify the information between useful and not so useful information. In the same way, we need similar mechanism to classify the incoming information as useful and not useful in case of neural networks. Only some part of information is much useful and rest may be some noise. Network tries to learn the useful information. For that, we need activation functions. Activation function helps the network in doing segregation. In simpler words, activation function tries to build the wall between useful and less useful information.
Let me introduce you to some terminologies, in order to simplify understanding .
Let me give a simple example and later i will connect the dots with the theory. Suppose, we are teaching an 8 year old kid to perform addition of two numbers. First of all, he will receive the information about how to perform addition from the instructor. He now tries to learn from the information given and finally , he performs addition. Here, the kid can be thought as neuron, it tries to learn from the input given and finally from the neuron we will get output.
In biological perspective, this ideal is similar to human brain . Brain receives the stimulus from outside world, does processing on the input and then generates the output. As the task gets more complex, multiple neurons form a complex network passing information among themselves.
The blue circles are the neurons. Each neuron has weight,bias and activation function. Input is fed to the input layer. The neuron then performs a linear transformation on the input by the weights and biases. The non linear transformation is done by the activation function.The information moves from input layer to hidden layer. Hidden layer would do the processing and gives output. This mechanism is forward propagation.
In neural network, we would update the weights and biases of the neurons on the biases of error. This process is known as back propagation. Once the entire data has gone through this process, final weights and biases are used for predictions.
Generally, adding more number of hidden layers in the network will allows it to learn more complex functions, thus it performs well.
But, here comes the problem, when we do back propagation i.e., calculating and updating the weights in backward direction,the gradients tends to get smaller and smaller as we keep on moving backwards in the network. This means the weights of the neurons in the earlier layers learn very slowly or sometimes they won’t change at all .But earlier layers in the network are much important because they are responsible for detecting simple patterns. If the earlier layers give inappropriate results,then how can we expect our model to perform well in later layers. This problem is called vanishing gradient problem.
We know that, when we have more number of hidden layers, our model tends to perform well. When we do back propagation, if the gradients become larger and larger, then the weights of the neurons in the earlier stages change much. We know that the earlier layers are much important. Because of this larger weights, the neurons in the earlier layers will give inappropriate results. This problem is called exploding gradients problem.
Now, let us dive deep into core concept of activation functions.
“ An activation function is a non-linear function applied by the neuron to introduce non-linear properties in the network.”
Let me explain in detail. There are two types of functions i.e., linear and non-linear functions.
If the change in the first variable corresponds to a constant change in the second variable, then we call it as linear function.
If the change in the first variable doesn’t necessarily correspond with a constant change in the second variable, then we call it as non-linear function.
In simple case of any neural network, we multiply weights with the input, add bias and apply an activation function and pass the output to the next layer and we do back propagation to update the weights.
Neural networks are functions approximators. The main goal of any neural network is to learn complex non-linear functions. If we don’t apply any non-linearity in our neural network, we are just trying to separate the classes using a linear hyper plane. As we know, nothing is linear in this real world.
If we perform simple linear operation i.e., multiply the input by weight,add a bias term and sum them across all the inputs arriving to the neuron. In some cases, the output of the above values is very large. When, this output is fed to the further more layers, the values become even more larger , making things computationally uncontrollable. This is where the activation function plays a major role i.e., activation function squashes the input real number to a fixed interval i.e., (between -1 and 1) or (between 0 and 1) .
Let us discuss about the different activation functions and their problems
Sigmoid is a smooth function and is continuously differentiable. This is a non-linear function and it looks like S- shape.Main reason to use sigmoid function is, its value exists between 0 and 1. Therefore, it is especially used for models where we have to predict the probability as an output. Since probability of anything exists between the range 0 and 1, sigmoid is right choice.
As we know, sigmoid function squashes the output values between 0 and 1. In mathematical representation, a large negative number passed through the sigmoid function becomes 0 and and a large positive number becomes 1.
. The values of sigmoid function is high between the values of -3 and 3 but gets flatter in other regions.
. Sigmoid function is easily differentiable and the values are dependent on x values. This means that during back propagation, we can easily use sigmoid function to update weights.
Gradient values of sigmoid range between 0 and 0.25 .
def sigmoid(z): return 1 / (1 + np.exp(-z))
When we write code for sigmoid, we can use this code for both forward propagation and to compute derivatives .
Let me explain you how sigmoid function face the problem of vanishing gradients and exploding gradients
Tanh function is similar to sigmoid function. Working of tanh function is also similar to the sigmoid function but it is symmetric over the origin . It is continuous and differentiable at all points. It basically takes a real valued number and squashes values to between -1 and 1. Similar to sigmoid neuron, it saturates at large positive and negative values. The output of tanh is always zero centered. Tanh functions are preferred in hidden layers over sigmoid.
. Tanh function takes the real valued function and outputs the values between -1 and 1.
.The derivative of the tanh function is steeper as compared to the sigmoid function.
. Graph of the tanh function is flat and the gradients are very low.
Code for tanh function in python
def tanh(z): return np.tanh(z)
Gradient values of tanh
Gradient values of tanh range between 0 and 1.
Let me explain you how Tanh function face the problem of vanishing gradients and exploding gradients
ReLU means Rectified Linear Unit. This is the mostly used activation unit in deep learning. R(x)=max(0,x) i.e., if x<0, R(x)=0 and if x≥۰,R(x)=x. It also accelerates the convergence of stochastic gradient descent as compared to sigmoid or tanh activation functions. Main advantage of using the ReLU function is, it does not activate all the neurons at the same time i.e., if the input is negative, it will convert to zero and the neuron does not get activated. This means only a few neurons are activated making the network easy for computation. It also avoids and rectifies vanishing gradient descent problem. Almost all deep learning models use ReLU activation function nowadays.
How do you say ReLU is a non-linear function ?
Linear functions are straight line functions. But, ReLU is not a straight line function because it has bend a at value zero. Hence,we can say that ReLU is a non-linear function. Please have a look at graph of ReLU function.
. If the value of x is greater than or equal to zero then the we take ReLU(x)=x.
. If the value of x is less than zero then we take ReLU(x)=0.
If the value of x is greater than zero, then the derivative of the ReLU(x) i.e., ReLU’(x)=1.
If the value of x is less than zero, then the derivative of the ReLU(x) i.e., ReLU’(x)=0.
If the units are not activated initially, then during back propagation, zero gradients flow through them. Hence, neurons that already died won’t respond to the variation in the output and the weights will never get updated during back propagation. This problem is called as dead neurons problem.
def relu(z): return z * (z > 0)
Leaky ReLU is an improved version of ReLU function. We know that in ReLU, the gradient is 0, for x<0. Here in Leaky ReLU, instead of defining the ReLU function as 0, for x<0, we define it as a multiple of small linear component of x i.e., 0.01x (Generally we take linear component as 0.01). The main advantage in Leaky ReLU is, we are just replacing horizontal line on x-axis to non-zero and non horizontal line. We are doing this to remove zero gradient. So, by removing the zero gradients, we won’t face any issue of dead neurons.
If the value of x is greater than zero, then the Leaky ReLU(x)=x.
If the value of x is less than zero, then the Leaky ReLU(x)=0.01*x.
If the value of x >0, then the derivative of Leaky ReLU(x) i.e., Leaky ReLU’(x)=1.
If the value of x <0, then the derivative of Leaky ReLU(x) i.e., Leaky ReLU’(x)=0.01.
Here alpha is the small linear component of x . Typically we take alpha value as 0.01 .
def leaky_relu(z): return np.maximum(0.01 * z, z)
Let me keep all the graphs at one place. So, that you can easily understand the difference between them.
Some complex terms like Maxout and ELU are not covered.
Let me keep all the activation function equations and their derivatives at one place, So that you can easily catch up and rewind them easily.
Depending upon the properties of the given problem, we might be able to make a choice and can make a faster convergence of the network.
As a rule of thumb, we can begin with ReLU activation function and we can move to other activation functions, if ReLU does not perform well in our network.
Reference:
نوشته Activation Functions in Deep Learning اولین بار در مجله شهاب. پدیدار شد.
]]>