MLC@Home: Machine Learning Comprehension @ Home

Page is still under construction, please read with that in mind.

Applying Data Science to Data Science

The spirit of MLC@Home is building tools and running analyses to understand machine learning models. The first project launched under MLC@Home is the Machine Learning Dataset Generator (MLDS). MLDS aims to build (and make public!) a massive dataset of thousands of neural networks trained on similar, highly controlled data. By doing so, we can turn the lens inward, and apply some of the same data science techniques we already to build these models to understand of the models.

To our knowledge this is the first dataset of its kind.

Why Are Neural Networks So Hard To Understand?

Imagine this simple equation:

10 + X - Y + 4*Z ~= 0

There are an infinite number of possibilities for (X, Y, Z) that will solve that equation, such as (10, 10, -2.5), or (0, 10, 0), or (5632346, 5632356, 0). All are valid, and their only commonality is that they solve the equation when substituted.

Neural network training is a bit like the above equation. A network has an input (a starting point, like 10 above), a defined set of operations (its structure, like the +,-,* above), an output (like 0 above), and a set of weights (the X,Y,Z above). Training the network starts with randomly assigning values to the weights, and then changing those values to match as many of the training examples as possible. Then repeat the process over and over until the predicted value is close or equal to the desired value. Except, unlike above, where we as humans can quickly substitute the values and make sure they work, the networks we deal with contain millions of parameters. The end result is that even networks with the same structure that perform similarly can have wildly different weights.

And yet, buried in that structure and those piles of learned weights, is all the information necessary to solve the problem it was trained for. We just need to unlock it.

Current Dataset Status
Dataset 1
NameCompleteIn ProgressTotal
SingleDirectMachine 10002 0 10004
EightBitMachine 10001 4 10007
SingleInvertMachine 10001 0 10003
SimpleXORMachine 10000 0 10002
ParityMachine 869 9135 10012
Dataset 2
NameCompleteIn ProgressTotal
ParityModified 260 9745 10005
EightBitModified 6442 3564 10006
SimpleXORModified 10005 0 10005
SingleDirectModified 10004 0 10004
SingleInvertModified 10002 0 10002
Dataset 3
Overall Completed: : 34767/40125
Milestone 1 (100x100) : COMPLETE (10000/10000)
Milestone 2 (1000x100) :
0% < 25% 25% - 75% >75% 100%
Last update: 2020-10-25 04:00:20.386674
What Datasets is MLDS Building

MLDS is taking a phased approach to generating a dataset, starting with a simple "Dataset 1", and adding more datasets with increased complexity going forward. Each dataset download contains a README file describing that's included and technical details of the training.

Dataset 1 (status: In Progress)

On June 30th, 2020, MLDS starting building a very simple dataset of neural networks based on 5 simple machines as detailed in this paper from 2018. These are simple RNN networks with 4 GRU layers followed by 4 linear layers that translate a series of commands into a series of outputs. They were chose because a) the authors had the datasets readily available, they're relatively small and easy to train, and make a good test case for the entire MLC@Home infrastructure. The current plan it to collect 10000 samples of each network type.

Machine# parameters# Samples
Dataset 2 (status: In Progress)

Dataset 2 is envisioned to contain similar samples to Dataset 1, but modified for subtle changes in behavior, such as giving the wrong output for a few samples if a certain sequence of inputs is sent. This is to determine if and how easily we can detect networks that have been trained with such "extra" information that may not be easily detectable otherwise.

Machine# parameters# Samples
Dataset 3: (status: pre-MLDS Testing)

Dataset 3 will move beyond these simple machines to learning randomly generated transducers with a number of hidden states. This will allow generating a large number of networks trained to mimic very similar but not identical machines.

Other Datasets

Other potential ideas for future datasets include re-doing the example datasets with transformers instead of RNNs, or comparing how different optimizers make classification easier or harder.

Ideas For Experiments With This Data

There are lots of ideas for how to use this dataset to gain insight into how networks learn. Can we build a classifier that will classify which training data was used to train which network? What if some networks were trained with subtle differences, could we detect such differences and separate them from the "normal" networks? Networks are also directed, weighted, cyclic information flow graphs, would graph classification be more successful than a simple classifier built by concat-ing the weights together into a big feature vector? How about comparing network types, like an RNN versus a Transformer? Is one or the easier to classify?

Analysis and Results

None yet, although preliminary informal results show promise in being able to classify which networks in dataset 1 were trained with which data.