Convolutional Kernel Optimization for Deep Neural Networks using Constructivist Augmented Machine Learning (CAML) Methodology
Hongjin Yu and Corey Clark
Guildhall, Southern Methodist University
ABSTRACT
Deep convolutional networks have achieved state of the art results in various areas. Specifically, Leela Zero, a reproduction of the famous AlphaGo Zero has achieved superhuman performance. Despite these recent achievements, the inner workings of these networks remain a black box. This has made it difficult to apply human knowledge directly to the networks. This poster proposes a method to introduce human knowledge directly into the network that mimics the instructor student relationship seen in Constructivist learning theory. This poster utilizes Constructivist Augmented Machine Learning (CAML) methodology to replace existing kernels in a DNN with ideal kernels constructed by humans. Initially mean shift clustering is applied on the convolution kernels to reduce the problem space. This allows the human-in-the-loop methodology to identify and modify convolutional kernels to an ideal state. Our experiments show that a significant number of network kernels converge towards the ideal kernels in later versions of the network. This demonstrates that humans can identify improved convolution filters and suggests that with the aid of human knowledge the networks can be improved upon.
METHODS
Leela Zero network was chosen for the following reasons
- Ideal for clustering
- Small convolutional kernels of the same size (3*3)
- Large amount of kernels (5,242,880 kernels)
- Readily available trained networks
- Known network strength, the strength of the networks is in ascending order
In this work 7 networks were chosen (LZ-187 to LZ-193) and clustering was performed only on LZ-190. Kernels were normalized and then conditionally flip the sign of the entire kernel to ensure that the center value is always positive. The norms and signs are saved so that this process can later be reversed to obtain the original kernels.
Mean shift clustering is applied to the normalized kernels. The single input parameter for mean shift is adjusted to produce clusters with an average of hundreds of data points. Plotting the top centroids shows these kernels have recognizable patterns such as pass through filters, edge detectors, gradients, etc.
Top 32 centroid kernels are adjusted with the following methods:
- Sharpening
- Forcing symmetry
- Removing noise by setting small values to 0
These adjustments are done based upon human knowledge of what an ideal kernel might look like. i.e. a pass through filter would have the center value as 1 and all other values as 0.
A distance of d=1.2 is chosen empirically. Kernels with a distance lesser than d to their cluster centroid are considered as ‘core kernels’. It is hypothesized that these kernels are more likely to eventually converge to an ideal kernel.
RESULTS
The average distances of core kernels to the original cluster centroids over the different networks is plotted. Show in the graph the lowest minimum distance occurs at LZ_190 in all but one case. This is not surprising since the clustering was done at LZ_190. This graph shows that all kernels selected in the cluster converged over time towards the network where clustering was performed. This shows that kernel can evolve towards patterns identifiable by humans through the CAML process.
The average distances of core kernels to the adjusted cluster centroids over the different networks is plotted. It can be observed that for several of the clusters the minimum distance has shifted to newer networks, this indicates that the adjusted kernels were closer to a newer version of the network and demonstrates that the human was able correctly predict the direction the kernel would shift in the training process. It can also be noted that several of the graphs are flatter on the right side. This indicates that while the network did not converge on the adjusted kernel, it did diverge less than the original kernel thus also marking an improvement. It can also be observed that 3 of the minimum distances shifted to an older network, this indicates that the cluster was diverging from the adjusted kernel.
Of the 32 kernels that were adjusted, 7 had their minimum distance shifted to a newer network and and 6 showed flattening in newer networks. Thus 40.6% of the kernels were improved upon while 9.4% were worse than the original, the remaining showing no significant change.
This improvement to the original network shows that humans could predict where the kernels would converge if additional training data was present.
CONCLUSIONS
This work shows that by utilizing clustering, humans can identify patterns in convolution kernels in trained networks, and can construct ideal kernels that newer networks converge towards. This in essence is the transfer of human intuition and pattern recognition capabilities into a neural network. This mirrors the Constructivism learning theory where knowledge gained from past experiences is supplemented by social interaction and collaboration with an expert. This suggests that with the aid of human knowledge, the network training process can be accelerated and the model can be improved even after training. The method shown here could prove to be especially valuable in the case where training data is limited.