Ultimately, you want your neural network to generalize to “unseen data” so that it can be used in the real world. A neural network's ability to generalize to unseen data depends on two factors:
Your network is too simple to understand the training data’s salient features. This is called underfitting the training set.
Here’s a common trick to avoid underfitting: deepen your neural network by adding more layers.
Your network is complex enough to fully memorize the mapping between the training data and the training labels. However, it does not generalize well to unseen data because it has merely over-memorized the salient features of the training set. This is called overfitting the training set.
The best way to help your model generalize is to gather a larger dataset, but this is not always possible. When you do not have access to a large dataset, you can use regularization methods. Let’s learn the intuition behind these methods.
One of the widely used regularization method is called early stopping. Recall that optimizing a network to find the correct parameters is an iterative process. If you evaluate your model’s error on the training and dev set after every training epoch, you might see such curves:
Based on this observation, you can estimate that after the 30,000th epoch, your model starts overfitting to the training set. Early stopping means saving the model’s parameters at the 30,000th epoch. The saved model is the best performing model on the dev set and will likely generalize better to the test set.
In order to avoid overfitting the training set, you can try to reduce the complexity of the model by removing layers, and consequently decreasing the number of parameters. As shown by the work of Krogh and Hertz (1992), another way to constrain a network and lower its complexity is to:
You want to prevent the weights from growing too large, unless it is really necessary. Intuitively, you are reducing the set of potential networks to choose from.
The update rules are different. While the L2 “weight decay” penalty is proportional to the value of the weight to be updated, the L1 “weight decay” is not.
For L2, the smaller the , the smaller the penalty during the update of and vice-versa for larger .
For L1, the penalty is independent of the value of , but the direction of the penalty (positive or negative) depends on the sign of . This results in an effect called “feature selection” or “weight sparsity”. L1 regularization makes the non-relevant weights 0.
You can play with the visualization below to see the impact of L1 and L2 regularization on the weights during training.
Use the following selections to view a histogram of the weight values with and without regularization.
Choose a value for in the update equations above.
Load data and train both regularized and unregularized networks.
As you can see, L1 and L2 regularizations have a dramatic effect on the weights values:For the L1 regularization:
The weight sparsity effect caused by L1 regularization makes your model more compact in theory, and leads to storage-efficient compact models that are commonly used in smart mobile devices.
Deep learning frameworks such as Keras allow you to add L1 or L2 regularization to your network in one line of code. The difference in the optimization process is implemented automatically. Here's an example: L1 and L2 regularization in Keras.
It’s useful to build some intuition about what L1 and L2 regularization do. You can play with the visualization below to observe variations of the cost landscape subject to regularization from a top view.
As you can see, L1 and L2 regularizations have a dramatic effect on the geometry of the cost function:
When applying regularization methods, you need a metric to track your model's improvement and generalization ability. The bias/variance tradeoff enables you to measure the efficiency of your regularization.
Successfully training a model on complex tasks is complicated. You need to find a model architecture that can encompass the complexity of the dataset. Once you find such an architecture, you can work on improving generalization. Exploring, and even combining, different regularization techniques is an unmissable step of the training process. It helps you build intuition on the ability of a model to generalize in the real-world.
↑ Back to top