Dataset 3, what is it and when is it coming?

Message boards : Science : Dataset 3, what is it and when is it coming?
Message board moderation

To post messages, you must log in.

AuthorMessage
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 463 - Posted: 9 Sep 2020, 22:57:59 UTC

It never fails. The minute you make a public post about how something isn't working, you figure out the issue and it starts working.

We have dataset 3 data ready and am creating WUs out of them. Here are a few things to know:

What is Dataset 3
Dataset 3 aims to go "wide" while datasets 1 and 2 went "deep". In that, we'll be looking at the same type of networks (RNNs) but instead of mimic-ing simple machines/functions, we'll instead be modeling the behavior of 100 different randomly-generated deterministic finite automata. These have several advantages: (1) they're easy to generate arbitrary new ones, (2) they're harder to learn, requiring much larger networks, (3) they're more abstract, meaning there's less chance of results being influenced by some unexpected artifact of the particular machine we're studying. I'd like to have at least 100, if not 1000 examples from each network.

What do you mean DFA?
A DFA is a graph and a state transition diagram. The "machine' starts at an initial state (a node on the graph), when a new command is issued, it transitions along the edge associated with that node to a new state. In our case, some of the states emit an "output" that represents that state. So given a sequence of inputs, you can travel along the graph to get a series of outputs.

The DFAs targeted for Dataset 3 have 16 internal states, 14 of which emit output. 2 are hidden, to make the problem more complicated as the output doesn't immediately reflect the current state of the system. There are four potential commands from each node. The automata are constructed so that there is at least one Hamiltonian cycle.

Here's a graph of one:


Any updates to the client necessary?
While not strictly necessary, We're going to require a client v 9.6x or higher for Dataset 3 WUs. This is unreleased at the moment, but includes a new LSTM-based RNN that is more consistent at learning the network than the GRU-based model used in datasets 1 and 2.

What about space/time requirements
The networks for dataset3 are much larger, with 500,000+ parameters versus just 4000+ parameters for datasets 1 and 2. However, we've also been making several behind-the-scenes tweaks to the client that will pay off in Dataset 3. For starters, Dataset 3 WUs will require less runtime memory than datasets 1 and 2. This is due to storing the dataset in memory as 8 bit ints instead of 32-bit floats, and cutting the batch size down by 4 as well. As such, while we're still testing and tweaking the right size, Dataset 3 WUs should only require ~400MB memory at runtime, instead of ~750.

As for runtime, it looks like, per epoch, the runtime is about 30% longer, on average than dataset 1 and 2. This reflects the larger network we're training, and we're still determining how many epochs are necessary to converge in most cases (should be similar to the 128 currently used by datasets 1 and 2). Given the average runtime of a dataset 1/2 WU is around 2.2 hours, this shouldn't be too onerous.

Credit awarded will be adjusted accordingly.

What's next?
Client 9.6x and test dataset 3 WUs should be rolling out within a day or so for Linux, with full deployment this weekend (hopefully).

As mentioned in the news, Dataset 4 is also in the works. This will work on images and train several variants of mnist. That too will likely require an updated client, although the basics are already part of v9.6x in development, including a basic LeNet-based CNN model. The data will come from both the clean and modified versions of MNIST from the TrojAI project.

Thanks again for your patience, and your support.
ID: 463 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 142
Credit: 11,536,204
RAC: 3
Message 464 - Posted: 10 Sep 2020, 9:10:28 UTC - in response to Message 463.  
Last modified: 10 Sep 2020, 9:10:57 UTC

Awesome! Glad that it worked out this way :)

To be fair I don't understand everything, so I just have a few questions and would be very pleased if you could take some minutes to clarify that for me as well as others that might be interested in this.

As far as I understand, we are recreating the same type of RNNs but just training wide rather than deep.

1) Can you verify that in simple words what you are trying to achieve in run 3 is basically to first generate a set of 100 different randomly generated but deterministic finite automata which is as specific kind of ANN structure in itself that is then trying to be modelled/learnt by generating another set of RNN which are then used to model the behaviour of each DFA?
- DFAs can be described through a state transition model as seen in your original post with output emitting and hidden states
- RNNs are basically special ANNs in which neurons are linked to other neurons within the respective or prior layer, so that they are not straight feed-forward networks but can relay back information through those feedback links

Is that mostly correct?

2) How do you ensure that the trained networks converge safely after the mentioned 128 epochs? Do you verify this in pre-beta testing by running some WUs on your local machine?

3) Having mentioned the 500,000 vs. 4,000 trained parameters, I understand that the trained networks in run 3 vs. the prior 1st and 2nd run have much more variability in the resulting trained models. Wouldn't that directly support the idea of yours to favor 1,000s of trained networks per finite automata generated structure that is modelled?

4) Can the state transition model of the DFAs be compared to stationary Markov chains? How are probabilities of the state transitions modelled? Are they uniformly distributed?

5) Any specific reasoning going into the consideration of the # hidden internal states and/or the ratio of 14:2 output vs. hidden states?

6) What does any of the so-called "commands", labeled from a to d, in the DFA represent? Just some kind of state transition?

7) Don't we lose accuracy in the network training by transitioning from 32 bit floats to 8 bit ints or don't we need this much memory to represent the states/state transitions?

Great that the new client version is underway, and even though not having been restricted by RAM before, it's awesome that you could make it more efficient! Looking forward to your answers and crunching some run 3 and 4 WUs!

Sry for the detailed questions. I don't mean to turn this into a data science and ML class but my curiosity just couldn't be satisfied through a couple google queries :)

Thanks!!

ID: 464 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 470 - Posted: 12 Sep 2020, 20:14:52 UTC - in response to Message 464.  

I'm thinking it would be a good idea to create a 3 minute youtube video to explain MLDS.

As far as I understand, we are recreating the same type of RNNs but just training wide rather than deep.

1) Can you verify that in simple words what you are trying to achieve in run 3 is basically to first generate a set of 100 different randomly generated but deterministic finite automata which is as specific kind of ANN structure in itself that is then trying to be modelled/learnt by generating another set of RNN which are then used to model the behaviour of each DFA?
- DFAs can be described through a state transition model as seen in your original post with output emitting and hidden states
- RNNs are basically special ANNs in which neurons are linked to other neurons within the respective or prior layer, so that they are not straight feed-forward networks but can relay back information through those feedback links

Is that mostly correct?


Almost. Don't get hung up on the DFA. Its sole purpose is to provide a randomly generated "machine" with a consistent number of inputs, outputs, and internal states. It's not an ANN structure.

I'm specifically interested in modelling machines that take a sequence of inputs and generate a sequence of outputs. DFAs do that. Plus they have the benefit of being all abstract and computer-sciencey for writing CS papers. RNNs provide one method of modelling "machines" that have state, where both the current and previous inputs influence the next output generated. There are other methods (transformers, neural turing machines, temporal CNNs to name a few), but they all share the notion of having a "memory" of some sort to keep track of the current internal state of a network.


2) How do you ensure that the trained networks converge safely after the mentioned 128 epochs? Do you verify this in pre-beta testing by running some WUs on your local machine?


Yes, although in this case we may have jumped the gun a little bit. While we had two randomly generated DFAs that were learned in about 50 epochs, they may have been outliers, which is why I've released only a few testing WUs at the moment. If it turns out the average number of epochs needed to learn these models is significantly more than 128, we'll turn down the complexity (less states or less inputs). So expect some more tweaking based on initial results.


3) Having mentioned the 500,000 vs. 4,000 trained parameters, I understand that the trained networks in run 3 vs. the prior 1st and 2nd run have much more variability in the resulting trained models. Wouldn't that directly support the idea of yours to favor 1,000s of trained networks per finite automata generated structure that is modelled?


More examples is better, but one of the interesting findings will be how many examples do you need before you can start differentiating? So I'm going to start with 100, and add more as necessary.


4) Can the state transition model of the DFAs be compared to stationary Markov chains? How are probabilities of the state transitions modelled? Are they uniformly distributed?


DFA's are deterministic, markov chains are stochastic. If you get an input, you will transition to the next state in the DFA along that input (even if the "next state" is a loop back to the same current state).


5) Any specific reasoning going into the consideration of the # hidden internal states and/or the ratio of 14:2 output vs. hidden states?


Honestly? Not much. We wanted it to be more than 1 and less than the total number of states. The less the number, the harder the DFA is to model, as each emitter state give insight into the internal state of the network. So it's a balance that may be tuned based on the initial results.


6) What does any of the so-called "commands", labeled from a to d, in the DFA represent? Just some kind of state transition?


They're the inputs. Inputs to the DFA are a string of commands, so ['a','c','d','c','a','b'] will traverse the DFA starting at the start node, and following the edge with the commands label. When it reached the edges' destination, if its an emitter node, it will change its output at that timestep, otherwise it'll emit the last known emitted otuput again. So for the above example, input sequence ['a','c','d','c','a','b'] results in an output sequence [12, 13, 5, 6, 11, 7] (assuming I'm reading the cramped graph right).

The training data each WU receives is a collection of randomly-generated input sequences, and the resulting outputs sequences from the DFA.. I generate this offline. What you do is take that training data, and train an RNN to mimic the behavior of the DFA, as close as possible.


7) Don't we lose accuracy in the network training by transitioning from 32 bit floats to 8 bit ints or don't we need this much memory to represent the states/state transitions?


The network still uses 32-bit floats, so we won't use any accuracy. However, the training data (all these sequences of inputs and outputs I mentioned) is loaded into memory, and kept in memory, is now stored at 8-bit ints. And right before a batch is sent to the network, it's converted to 32-bit floats, used for that computation, and then discarded. So it's just an optimization to keep the memory size of each WU down, but does *not* effect precision, since when actually used for computation it's updated.

The datasets are all one-hot encoded, so every value is actually 0 or 1. So the dataset could, in theory be even more compact as you only need one bit. But really at this point the memory usage is driven by batch size and model size.

Hope that helps.
ID: 470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 142
Credit: 11,536,204
RAC: 3
Message 487 - Posted: 17 Sep 2020, 16:04:48 UTC - in response to Message 470.  
Last modified: 17 Sep 2020, 16:05:47 UTC

You can't imagine how much I appreciate your patience and work ethics! Thank you for trying to educate us on the science background on top of running the project here.

1) Well explained on the DFAs. That's basically all news to me. Never really handled DFAs before. Also I acknowledged that RNNs are basically a network type that can be categorized into "memory"-like networks, with RNNs being special in the way you described.

2) Interesting feedback! Was really concerned with this to really assure accurate and statistically relevant results. Thanks.

3) Haven't thought of this type of question before but must admit that it's definitely an interesting one worth further consideration.

4) Sorry for my mistake here, my bad. Having had Markov chains in mind in robotics/AI state transition use cases, stochastic state models make much more sense. Given the great explanation in Q1 where you already outlined the motivation and inherent characteristics of DFAs, it makes total sense now.

5) I see the logic behind that!

6) That's where most of my confusion was coming from. Thinking of "actions" that trigger stochastic/probabilistic state transitions in Markov Chains/Hidden Markov models is what triggered this question.

7) Awesome thought experiment behind this optimization! Neatly implemented I must say :) Great to see that the training precision is not impacted thanks to the "behind the scenes" forward memory loading and backward conversion for the network training.

Very keen to add a MLC badge soon to my steadily growing set as I particularly like this project and expanding my ML knowledge on the go!

ID: 487 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pianoman [MLC@Home Admin]
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Jun 20
Posts: 462
Credit: 21,406,548
RAC: 0
Message 502 - Posted: 20 Sep 2020, 15:59:10 UTC

A more general update on Dataset 3.

The mldstest application has been great for shaking out the bugs with dataset3 WUs, so thanks for all those who are running mldstest. All 100 automata are ready to go, and the WUs are pretty much set. The only thing we're trying to fix at the moment is the back-end validation, which is shared with dataset 1 and 2... and getting all three to work with common code is proving to be a bit harder than expected. Still, the client side appears to be fine, so once a few minor tweaks are tested, we should be good to release.

The final WUs take about 2.25x longer to run, but are much more consistent now, to where a single WU should train an entire network reliably within 192 epochs. Credit will be asjusted accordingly. Memory usage is actually a bit lower than planned, coming in under 350MB per WU.

Once the server-side validation is worked out (difficult to test without actually running WUs to clients) the full dataset3 will be unleashed.
ID: 502 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 142
Credit: 11,536,204
RAC: 3
Message 504 - Posted: 20 Sep 2020, 16:07:35 UTC - in response to Message 502.  

Great! Thanks for the update. Even less memory and more consistency in the runtimes is quite some accomplishment.

My system will start crunching as soon as they are released. 2.25 x prior runtime means ~7.5 hrs per WU on my old Xeon chip. We'll see.

Cheers
ID: 504 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 9 Jul 20
Posts: 142
Credit: 11,536,204
RAC: 3
Message 999 - Posted: 30 Dec 2020, 10:36:23 UTC

Will you be "introducing" the ds4 experiment in a similar manner? Would highly appreciate a few details like you did here for ds3!
ID: 999 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Science : Dataset 3, what is it and when is it coming?

©2024 MLC@Home Team
A project of the Cognition, Robotics, and Learning (CORAL) Lab at the University of Maryland, Baltimore County (UMBC)