Neural Simulator
Lightweight Multi-Modal World Models for Interactive Environments
Live Interactive Demo
Playable Neural Simulation
Background Info
This project treats simulation as something that can be learned from interaction. A compact model observes a world, receives an action, and predicts what should happen next.
The same idea can be used for autonomous driving and game simulation: learn the behavior of an environment well enough to roll it forward interactively, then make that learned simulator small enough to run close to the user.
Methodology
For each simulation, I collected examples of an agent moving through an environment, then trained models to predict what should happen next from those interaction traces.
CARLA Driving
For driving, I used CARLA, an autonomous driving simulator with roads, traffic, sensors, and controllable vehicles. I let autopilot drive around maps and recorded privileged simulator state: camera frames, LiDAR, ego telemetry, controls, and route context.
TinySkies Flight
TinySkies is an open-source browser flying game with simple 3D visuals and a third-person ego plane orbiting a procedurally generated planet. Here the model infers motion and physics from transitions in latent RGB inputs rather than explicit simulator sensors.
Core Differentiators
Tiny On-Device Runtime
Where large generative environment models typically require server-scale inference, this project targets compact ONNX/browser deployment so the simulator can run locally and scale horizontally across many cheap clients.
Human-Visible World Model
Unlike purely latent training worlds, the rollout is visible and playable. A human can drive inside the learned simulator and see what the model believes the environment state has become.
Neural Game Engine
The model is treated as an executable environment layer: useful for policy training, synthetic simulation populations, game mechanics variation, and new modes with minimal hand-authored code.
Project Overview
This work explores compact neural simulators: small world models that learn environment behavior from interaction data.
Learned Dynamics
The model watches an environment evolve, receives a control input, and predicts the next state.
Playable Rollout
The simulator is visible and interactive, so users can drive inside the model instead of inspecting only offline samples.
Compact Runtime
The target is browser-scale deployment: small enough to run close to the user with low latency.
Can a compact neural network learn enough of an environment's dynamics to function as a usable simulator?
Latent Dynamics
The simulator learns in a compressed representation of the world, where visual structure, motion, controls, and sensor context can be modeled as dynamics instead of raw pixels alone.
Changing Worlds
Training on interactive trajectories forces the model to learn how environments respond over time, not just how individual frames look.
Reusable State
Latent representations can become compact state descriptions for policies, planning, replay, search, and synthetic rollout generation.
Shared Abstraction
The same latent space can connect vision, controls, telemetry, and future prediction into one downstream-friendly interface.
Core Thesis
Traditional simulators are hand-authored. Neural simulators learn the transition rules directly from experience.
Observe
Use the current visual state and supporting sensor context as the model's view of the world.
Act
Condition each transition on steering, throttle, braking, or gameplay movement.
Predict
Roll the world forward one step, then recursively use the generated state as the next input.
S_t + A_t -> S_{t+1}
Why This Matters
The practical question is whether parts of a heavy simulator can become portable learned behavior.
165GB Simulator
A full CARLA installation can require heavyweight assets, runtime services, and dedicated compute.
Small Neural Runtime
A learned transition model can represent a narrow slice of behavior in a much smaller deployable package.
Portable Rollouts
Once the simulator is compact, it can move into browsers, edge devices, and distributed training loops.
High-Level Architecture
The system is built as a recursive multi-modal transition model. It fuses RGB observations, LiDAR projections, telemetry streams, waypoint trajectories, and control vectors into a latent transition representation.
Decoder heads predict future environment state, allowing autonomous rollout through learned latent imagination. The same generalized latent-world modeling framework spans autonomous driving and interactive gameplay domains.
- Visual frames are encoded through convolutional layers
- LiDAR is independently encoded into a compact latent vector
- Telemetry, waypoint, and control embeddings are projected into learned feature spaces
- Modalities are fused into a shared latent transition representation
- Decoder heads predict the next RGB frame, LiDAR latent, telemetry state, and waypoint trajectory
S_hat_{t+1} -> S_hat_{t+2} -> S_hat_{t+3}
VAE
The project uses a Variational Autoencoder (VAE) to compress high-dimensional visual observations into a compact latent representation before simulation learning occurs. Rather than asking the world model to simultaneously learn rendering, compression, and environment dynamics end-to-end, the VAE isolates the vision problem into a dedicated representation-learning stage. The encoder learns a smooth latent manifold of visual structure, motion, topology, and scene geometry from raw RGB frames, while the decoder reconstructs the original observations from this compressed space. The downstream simulator then operates directly on these latent representations instead of raw pixels, allowing the transition model to focus on predicting how environment state evolves over time under actions and control inputs. This separation of concerns improves efficiency, stabilizes training, reduces dimensionality, and creates a cleaner abstraction between visual perception and learned environment dynamics.
Architectural Philosophy
Preserve Behavior
The model is judged by whether controls produce coherent state changes, not by photorealism alone.
Stay Inspectable
The rollout remains human-visible, making failures and strengths easier to see during interaction.
Deploy Nearby
The architecture favors small, low-latency runtimes that can run close to the player or policy.
Metrics of Success
The project evaluates world models through practical dimensions that matter for interactive simulation.
- Transition consistency over long recursive rollouts
- Policy transferability from generated trajectories back to the source environment
- Latency close to interactive framerates
- Compression ratio between classical simulator footprint and neural transition model size
- Cross-domain generalization across autonomous driving, browser games, and lightweight gameplay environments
Research Direction
The goal is not to replace traditional engines entirely. It is to learn the parts of simulation that can become predictive, portable, and interactive.
Hybrid Engines
Combine hand-authored structure with learned transition behavior.
Generated Worlds
Train policies and explore mechanics inside compact learned rollouts.
Neural Replay
Turn recorded interaction streams into controllable predictive environments.
Data And Training Infrastructure
Record
Capture interaction traces from driving simulation and browser games.
Index
Use manifests to query useful sample windows without scanning every raw asset.
Train
Run fast experiments in notebooks and longer sweeps on dedicated GPU workers.
Export
Convert selected models into lightweight browser and local inference runtimes.
Data Model
Vision
RGB frame sequences provide the visual state the model learns to roll forward.
LiDAR
Projected spatial tensors encode geometry and scene structure as a parallel sensor stream.
Telemetry
State vectors describe motion, dynamics, orientation, and environment metadata.
Waypoints
Structured path priors represent short-horizon navigation and future route geometry.
Controls
Action vectors condition transitions on steering, throttle, brake, and gameplay movement input.
Manifest
Chunk metadata links modalities, labels, time windows, and training splits for quick sampling.
Relationship to Literature
Dreamer
arxiv.org/abs/1912.01603Demonstrated policy learning through latent imagination inside learned world models. This project keeps the rollout human-visible and playable, so the learned simulator can be inspected directly.
Genie
arxiv.org/abs/2402.15391Introduced interactive generative environments trained from video. This project explores the opposite deployment pressure: tiny models intended for browser, edge, and on-device execution.
GAIA-1
arxiv.org/pdf/2309.17080GAIA-1 presents a generative world model for autonomous driving. This project explores a smaller interactive version of that direction, with browser-scale deployment and playable rollouts.
Evaluation Metrics
Online Play Evaluation
Evaluate the simulator by how it plays in real time: steering response, rollout stability, control coherence, and whether the generated state remains useful under live interaction.
Low Latency
Target interactive browser performance around 30fps, with low inference latency and stable frame pacing during recursive rollout.
Per-Modality Loss
Track online and offline loss by modality, including RGB prediction, LiDAR latent alignment, telemetry evolution, waypoint consistency, and control-conditioned transition quality.
Value Proposition
More Rollouts
Small simulators can run many lightweight environment instances in parallel.
Faster Variation
Game-like environments can explore new modes and behaviors with less hand-authored systems code.
Human Feedback
Playable simulation lets people inspect the model's behavior directly, not only through aggregate metrics.