Trending topics
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.

vytal
Founder/CEO @liquidtf | @CTAP_GG
Head of @asiafortress
MSc AI @SussexUni
PhD AI @ -
AukyjhMSmN5VEGjQ9npeu6Eu9X21feL1qCbZSPeJpump
Hey everyone, thanks for the interest so far.
Here's an explanation of what we've done
TLDR: This is PPO plus living neurons in a closed loop. The policy “speaks” via stimulation, the cells “reply” via spikes, and the value function provides a surprise signal that I feed back through stimulation so the policy can communicate how good or bad an action was.
Before DOOM, there was Pong, which relied on hand-crafted mappings. In a tiny environment, you can manually define what feedback means and keep it consistent.
As the environment becomes more complex, handcrafted signals get harder and become inconsistent. The number of contexts where a signal must mean the same thing explodes, and you start reinventing invariance by hand.
DOOM is 3D and compositional. Walk + turn + shoot can happen at the same time. The right mapping cannot be a pile of rules, so I needed a generator of signals that stays coherent as behavior changes.
That is why I used PPO. The spikes are non-differentiable, and PPO’s value function gives us a way to objectively define a combined “surprise” for the policy and the cells to turn it into an online feedback language. The policy does not directly output “move forward” or “shoot.” The policy outputs stimulation. The cells respond with spikes. Those spikes are what select the game action, via a linear readout.
On top of that, the value function gives you an online estimate of the return, which lets you compute surprise as the prediction error. Based on this action surprise, we adjust the frequency and amplitude accordingly for our different feedback schemas. E.g If an action was positive and the value function said “high surprise”, then we reduce the frequency of the positive action feedback for that action, making actions more “predictable” which the cells prefer.
57
Today's stream was great, people asked tons of amazing questions, and it opened up a lot more potential research avenues/experiments. Some of them include:
1. Could we transfer the knowledge from the brain cells of a CL1 to another CL1 via language model style distillation?
2. Could we train a small language model head with the CL1 to handle out of distribution tasks better?
10
Hey everyone, thanks for the interest so far. Here's an explanation of what we've done
TLDR: This is PPO plus living neurons in a closed loop. The policy “speaks” via stimulation, the cells “reply” via spikes, and the value function provides a surprise signal that I feed back through stimulation so the policy can communicate how good or bad an action was.
Before DOOM, there was Pong, which relied on hand-crafted mappings. In a tiny environment, you can manually define what feedback means and keep it consistent.
As the environment becomes more complex, handcrafted signals get harder and become inconsistent. The number of contexts where a signal must mean the same thing explodes, and you start reinventing invariance by hand.
DOOM is 3D and compositional. Walk + turn + shoot can happen at the same time. The right mapping cannot be a pile of rules, so I needed a generator of signals that stays coherent as behavior changes.
That is why I used PPO. The spikes are non-differentiable, and PPO’s value function gives us a way to objectively define a combined “surprise” for the policy and the cells to turn it into an online feedback language.
The policy does not directly output “move forward” or “shoot.” The policy outputs stimulation. The cells respond with spikes. Those spikes are what select the game action, via a linear readout.
On top of that, the value function gives you an online estimate of the return, which lets you compute surprise as the prediction error. Based on this action surprise, we adjust the frequency and amplitude accordingly for our different feedback schemas.
E.g If an action was positive and the value function said “high surprise”, then we reduce the frequency of the positive action feedback for that action, making actions more “predictable” which the cells prefer.
9
Top
Ranking
Favorites
