This document contains comments and remarks on the following publication:
Deep neural networks have an inbuilt Occam’s razor: https://www.nature.com/articles/s41467-024-54813-x
A boolean function maps binary vectors of size
Such a random boolean function which maps each of the 7bit strings to a
def algorithm_0(input_strings):
"""
input_strings: a list of n-bit strings
"""
y = []
# for each string in the list
for x in input_strings:
# assign a random label, no dependence between x and the label value
y.append(np.random.choice([0, 1]))
return np.asarray(y)
The above code is equivalent to:
def algorithm_1(input_strings):
return np.random.choice([0, 1], size=len(input_strings))
However, the previous mapping functions will produce most of the time strings with approximately
the same number of
def algorithm_2(input_strings, k):
"""
k: the number of 1s in the output string
"""
# initialize a string with k 1s
y = [1] * k
# pad the string with zeros until the desired length is reached
y.extend([0] * (len(input_strings) - k))
# shuffle the string - notice again no dependence between input_strings and labels
np.random.shuffle(y)
return np.asarray(y)
Datasets created by random boolean mappings have the feature that there is no consistent relation between an input string and its label. There is no similarity between train and test dataset nor between examples of the same class. Although neural networks are able to minimize the error on the training set (possibly by memorization), the test set does not bear any resemblance to the train set, thus making it impossible to generalize. However, figures in the publication show a decreasing generalization error as the LZ complexity (KLZ in short) of the mapping function decreases. This creates a tension between what ones expects from random datasets and these results.
One possible explanation is due to the particular characteristic of KLZ.
Figure 1 shows a few context related quantities: LZ complexity,
entropy and class imbalance.
For the 128bit strings KLZ ranges roughly between 0 and 160.
Notice that it is highly correlated with the number of
where
Consider the following scenario: we pick a string with
A neural network trained on random input/output binary
mappings will learn the class statistics and at inference time would predict a
Below are 3 variations on the same idea.
A neural network trained on random input/output binary
mappings will learn the class statistics and at inference time would predict a
A neural network trained on random input/output binary
mappings will learn the class statistics and at inference time would predict a
A toy model: always bet on the class with the highest probability.
Figure 2 shows the predictions of the three hypotheses superimposed on the results shown in the paper.
- red crosses are the means and standard deviations for Hypothesis 1
- green crosses are the means and standard deviations for Hypothesis 2
- gray crosses are the means and standard deviations for Hypothesis 3
Notice the very good overlap between the means and error bars in the lower-left panel which compares
Hypothesis 2 with models trained with
Hypothesis 1 also matches the experimental curves in the right panels, though not as well. There is a slightly higher generalization error in the experimental curves, which indicates that the models did not capture the statistics exactly. Perhaps training the models a few extra epochs after the train error has reached 0 helps in this regard.
The fact that a simple statistical model is able to capture so well the experimental results is most certainly due to the adopted definition of accuracy. It does not take into account imbalanced datasets and is not sensitive enough to point out issues arising from this. Therefore, while it appears that the models do decrease the generalization error, it is not exactly what one thinks of normally. It is just a statistical effect of the class imbalance, and it should be so because the data produced by random mappings, on average, do not have anything particularly interesting beside the class occurence frequency in the train data.
On the more philosophical side, it seems that simplicity wins: if one cannot find an algorithm which maps inputs to outputs, then we can be better off by betting on the class with the highest probability as calculated from the training set. The toy model appears on average as a better predictor for random boolean mappings than any of the neural models.