- The importance of regularization
- Early stopping
- L1 and L2 regularizations
- Dropout regularization

Ultimately, you want your neural network to generalize to “unseen data” so that it can be used in the real world. A neural network's ability to generalize to unseen data depends on two factors:

- The information in the training data. For example, a dataset of images taken from the front-facing camera of a car contains intuitively more complex information than a dataset of images of clouds in the sky.
- The complexity of the network. By complexity, we mean the complexity of the function your neural network can mimic. Usually, the more parameters your network uses, the more complex it is.

Your network is too simple to understand the training data’s salient features. This is called underfitting the training set.

Here’s a common trick to avoid underfitting: deepen your neural network by adding more layers.

Your network is complex enough to fully memorize the mapping between the training data and the training labels. However, it does not generalize well to unseen data because it has merely over-memorized the salient features of the training set. This is called overfitting the training set.

The best way to help your model generalize is to gather a larger dataset, but this is not always possible. When you do not have access to a large dataset, you can use regularization methods. Let’s learn the intuition behind these methods.

Before we explore the main regularization methods, you must partition your dataset. In order to estimate the ability of your model to generalize, you will split your dataset into three (or sometimes more) sets: training, dev and test. If your model is trained on the training set, tuned on the dev set, and still performs well on the test set, it is able to generalize successfully.

Train Accuracy | Dev Accuracy | Test Accuracy | Conclusion |
---|---|---|---|

99% | 75% | 70% | Overfitting (high variance) |

88% | 85% | 83% | Underfitting (high bias) |

95% | 93% | 92% | Appropriate (correct bias/variance trade-off) |

You want to close the performance gap between your test set and your dev and training set while keeping the training performance as high as possible. Let’s delve into the methods that will help you do so.

One of the widely used regularization method is called early stopping. Recall that optimizing a network to find the correct parameters is an iterative process. If you evaluate your model’s error on the training and dev set after every training epoch, you might see such curves:

Based on this observation, you can estimate that after the 30,000th epoch, your model starts overfitting to the training set. Early stopping means saving the model’s parameters at the 30,000th epoch. The saved model is the best performing model on the dev set and will likely generalize better to the test set.

- It is usually easier to do. Deep learning frameworks, such as Tensorflow or Pytorch, offer options to save your model’s parameters regularly during training. In real world applications, determining when a chosen quantity (e.g. validation accuracy) has stopped improving is sometimes hard due to noisy batches of data. Here's a Python code walking through a possible implementation of early stopping.

```
def early_stopping(model, X_train, Y_train, X_dev, Y_dev, eval_freq = 1, stop_wait = 100):
"""
Python implementation of the early stopping algorithm.
Arguments:
model -- Your model class which comes with functions:
model.train() -- trains the model for one step and returns the updated model.
model.eval() -- evaluates the model on a set and returns the error.
model.copy() -- copies the model.
X_train -- Training set
Y_train -- Training labels
X_dev -- Dev set
Y_dev -- Dev labels
eval_freq -- integer, ratio of training step per evaluation step.
stop_wait -- number of train step without dev error improvement before stopping.
Returns:
best_model -- the model with minimal dev error and early stopping
best_dev_error -- the best dev error achieved by best_model
best_model_index -- the best_model's saving step
"""
current_model = model.copy() # Track your current model
best_model = model.copy() # Track your best model
num_train_steps = 0 # number of training steps
num_steps_since_best_model = 0 # number of training steps since the last best model was saved
best_dev_error = np.inf # the best dev error achieved so far (by best_model)
best_model_index = 0 # the best_model's saving step
# Loop unless there's no model improvement on the dev set for stop_wait steps.
while num_steps_since_best_model < stop_wait:
# Train current_model for "train_per_eval_ratio" steps
for s in range(train_per_eval_ratio):
current_model.train(X_train, Y_train)
num_train_steps = num_train_steps + 1
# Evaluate your current_model on the dev set.
dev_error = current_model.eval(X_dev, Y_dev)
# If the dev error is lower than the best dev error previously achieved, then save the current_model as the best_model
if dev_error < best_dev_error:
num_steps_since_best_model = 0
best_model = current_model.copy()
best_model_index = num_train_steps
current_model = best_model
# Otherwise increment num_steps_since_best_model
else:
num_steps_since_best_model = num_steps_since_best_model + 1
return best_model, best_dev_error, best_model_index
```

Despite being practical, early stopping is not satisfying from a scientific standpoint. There exist other regularization methods, such as L1 (/L2) regularization and dropout, or optimization methods, such as learning rate annealing, that rely more on theory and are often more reliable.

In order to avoid overfitting the training set, you can try to reduce the complexity of the model by removing layers, and consequently decreasing the number of parameters. As shown by the work of Krogh and Hertz (1992), another way to constrain a network and lower its complexity is to:

You want to prevent the weights from growing too large, unless it is really necessary. Intuitively, you are reducing the set of potential networks to choose from.

L1 and L2 regularizations can be achieved by simply adding a term that penalizes large weights to the cost function. If you were training a network to minimize the cost with (let’s say) gradient descent, your weight update rule would be:

Instead, you would now train a network to minimize the cost where and . Your weight update rule would be:

For L1 regularization, this would lead to the update rule:

For L2 regularization, this would lead to the update rule:

At every step of L1 and L2 regularization the weight is pushed to a slightly lower value because , causing weight decay.

The update rules are different. While the L2 “weight decay” penalty is proportional to the value of the weight to be updated, the L1 “weight decay” is not.

For L2, the smaller the , the smaller the penalty during the update of and vice-versa for larger .

For L1, the penalty is independent of the value of , but the direction of the penalty (positive or negative) depends on the sign of . This results in an effect called “feature selection” or “weight sparsity”. L1 regularization makes the non-relevant weights 0.

You can play with the visualization below to see the impact of L1 and L2 regularization on the weights during training.

As you can see, L1 and L2 regularizations have a dramatic effect on the weights values:

- “Too small” constant: there’s no apparent effect.
- “Appropriate” constant: many of the weights become zeros. This is called “sparsity of the weights”. Because the weight penalty is independent of the weight values, weights with value 0.001 are penalized just as much as weights with value 1000. The value of the penalty is (generally very small). It constrains the subset of weights that are “less useful to the model” to be equal to 0. For example, you could effectively end up with 200 non-zero weights out of 1,000, which means that 800 weights were less useful for learning the task.
- “Too large" constant: You can observe a plateau, which means that the weights uniformly take values around zero. In fact, because is large, the penalty is much higher than the gradient . Thus, at every update, is pushed by in the opposite direction of its sign. For instance, if is around zero, but slightly positive, then it will be pushed towards when the penalty is applied. Hence, the plateau’s width should be .

- “Too small” constant: there’s no apparent effect.
- “Appropriate” constant: the weight values decrease following a centered distribution that becomes more and more peaked throughout training.
- “Too large” constant: All the weights are rapidly collapsing to zeros, and the model obviously underfits because the weight values are too constrained.

The weight sparsity effect caused by L1 regularization makes your model more compact in theory, and leads to storage-efficient compact models that are commonly used in smart mobile devices.

Deep learning frameworks such as Keras allow you to add L1 or L2 regularization to your network in one line of code. The difference in the optimization process is implemented automatically. Here's an example: L1 and L2 regularization in Keras.

It’s useful to build some intuition about what L1 and L2 regularization do. You can play with the visualization below to observe variations of the cost landscape subject to regularization from a top view.

As you can see, L1 and L2 regularizations have a dramatic effect on the geometry of the cost function:

- Using the first landscape (Himmelblaus Function), notice that adding L2 regularization prevents the network from converging to a local minimum.
- Using the last landscape (Rastrigin Function), notice that adding regularization results in a more convex cost landscape and diminishes the chance of falling into a non-desired minimum.

Although L1 and L2 regularization are simple techniques to reducing overfitting, there exist other methods, such as dropout regularization, that have been shown to be more effective at regularizing larger and more complex networks. If you had unlimited computational power, you could improve generalization by averaging the predictions of several different neural networks trained on the same task. The combination of these models will likely perform better than a single neural network trained on this task. However, with deep neural networks, training various architectures is expensive because:

- Tuning the hyperparameters is time consuming.
- Training the networks requires a lot of computations.
- You need a large amount of data to train the models on different subsets of the data.

Dropout is a regularization technique, introduced in Srivastava et al. (2014), that allows you to combine many different architectures efficiently by randomly dropping some of the neurons of your network during training.

Dropout is discussed in detail in our Deep Learning Specialization (Course 2: “Improving Neural Networks”, Week 2: “Practical aspects of Deep Learning”). We invite you to check it out to understand the intuition behind dropout.

When applying regularization methods, you need a metric to track your model's improvement and generalization ability. The bias/variance tradeoff enables you to measure the efficiency of your regularization.

The bias/variance tradeoff is discussed in our Deep Learning Specialization (Course 2: “Improving Neural Networks”, Week 2: “Practical aspects of Deep Learning”). We invite you to check it out!

Successfully training a model on complex tasks is complicated. You need to find a model architecture that can encompass the complexity of the dataset. Once you find such an architecture, you can work on improving generalization. Exploring, and even combining, different regularization techniques is an unmissable step of the training process. It helps you build intuition on the ability of a model to generalize in the real-world.

↑ Back to top