Drift Sim

Live Interactive Demo

Playable Neural Simulation

connecting | sample 0/0 frame=0 fps=0.0 infer=0.0ms

thr=0.0 brake=0.0 steer=0.0

Compact World Models for Interactive Simulation

This project explores lightweight neural simulators: models that learn how interactive environments behave by watching gameplay and simulation data. Instead of rendering every frame through a traditional engine alone, the system predicts future environment state from prior frames, controls, and sensor signals in real time.

The work applies one neural simulation thesis across autonomous driving and game-like environments: learn the next moment of an interactive world from what the model can see, what it knows, and what the user does.

This simulator shows RGB output while also generating LiDAR, waypoints, and telemetry.

WASD or arrows drive

Space brakes

N/P changes sample

R resets

Background Info

This project treats simulation as something that can be learned from interaction. A compact model observes a world, receives an action, and predicts what should happen next.

The same idea can be used for autonomous driving and game simulation: learn the behavior of an environment well enough to roll it forward interactively, then make that learned simulator small enough to run close to the user.

Methodology

For each simulation, I collected examples of an agent moving through an environment, then trained models to predict what should happen next from those interaction traces.

CARLA Driving

For driving, I used CARLA, an autonomous driving simulator with roads, traffic, sensors, and controllable vehicles. I let autopilot drive around maps and recorded privileged simulator state: camera frames, LiDAR, ego telemetry, controls, and route context.

TinySkies Flight

Tiny Skies is an open-source browser flying game with simple 3D visuals and a third-person ego plane orbiting a procedurally generated planet. Here the model infers motion and physics from transitions in latent RGB inputs rather than explicit simulator sensors.

Core Differentiators

Tiny On-Device Runtime

Where large generative environment models typically require server-scale inference, this project targets compact ONNX/browser deployment so the simulator can run locally and scale horizontally across many cheap clients.

Human-Visible World Model

Unlike purely latent training worlds, the rollout is visible and playable. A human can drive inside the learned simulator and see what the model believes the environment state has become.

Neural Game Engine

The model is treated as an executable environment layer: useful for policy training, synthetic simulation populations, game mechanics variation, and new modes with minimal hand-authored code.

Project Overview

This work explores compact neural simulators: small world models that learn environment behavior from interaction data.

Learned Dynamics

The model watches an environment evolve, receives a control input, and predicts the next state.

Playable Rollout

The simulator is visible and interactive, so users can drive inside the model instead of inspecting only offline samples.

Compact Runtime

The target is browser-scale deployment: small enough to run close to the user with low latency.

Can a compact neural network learn enough of an environment's dynamics to function as a usable simulator?

Latent Dynamics

The simulator learns in a compressed representation of the world, where visual structure, motion, controls, and sensor context can be modeled as dynamics instead of raw pixels alone.

Changing Worlds

Training on interactive trajectories forces the model to learn how environments respond over time, not just how individual frames look.

Reusable State

Latent representations can become compact state descriptions for policies, planning, replay, search, and synthetic rollout generation.

Shared Abstraction

The same latent space can connect vision, controls, telemetry, and future prediction into one downstream-friendly interface.

Core Thesis

Traditional simulators are hand-authored. Neural simulators learn the transition rules directly from experience.

Observe

Use the current visual state and supporting sensor context as the model's view of the world.

Act

Condition each transition on steering, throttle, braking, or gameplay movement.

Predict

Roll the world forward one step, then recursively use the generated state as the next input.

S_t + A_t -> S_{t+1}

Why This Matters

The practical question is whether parts of a heavy simulator can become portable learned behavior.

165GB Simulator

A full CARLA installation can require heavyweight assets, runtime services, and dedicated compute.

Small Neural Runtime

A learned transition model can represent a narrow slice of behavior in a much smaller deployable package.

Portable Rollouts

Once the simulator is compact, it can move into browsers, edge devices, and distributed training loops.

High-Level Architecture

The system is built as a recursive multi-modal transition model. It fuses RGB observations, LiDAR projections, telemetry streams, waypoint trajectories, and control vectors into a latent transition representation.

Decoder heads predict future environment state, allowing autonomous rollout through learned latent imagination. The same generalized latent-world modeling framework spans autonomous driving and interactive gameplay domains.

Visual frames are encoded through convolutional layers
LiDAR is independently encoded into a compact latent vector
Telemetry, waypoint, and control embeddings are projected into learned feature spaces
Modalities are fused into a shared latent transition representation
Decoder heads predict the next RGB frame, LiDAR latent, telemetry state, and waypoint trajectory

S_hat_{t+1} -> S_hat_{t+2} -> S_hat_{t+3}

Driving simulator neural architecture diagram

VAE

The project uses a Variational Autoencoder (VAE) to compress high-dimensional visual observations into a compact latent representation before simulation learning occurs. Rather than asking the world model to simultaneously learn rendering, compression, and environment dynamics end-to-end, the VAE isolates the vision problem into a dedicated representation-learning stage. The encoder learns a smooth latent manifold of visual structure, motion, topology, and scene geometry from raw RGB frames, while the decoder reconstructs the original observations from this compressed space. The downstream simulator then operates directly on these latent representations instead of raw pixels, allowing the transition model to focus on predicting how environment state evolves over time under actions and control inputs. This separation of concerns improves efficiency, stabilizes training, reduces dimensionality, and creates a cleaner abstraction between visual perception and learned environment dynamics.

Watch the VAE video on YouTube

Architectural Philosophy

Preserve Behavior

The model is judged by whether controls produce coherent state changes, not by photorealism alone.

Stay Inspectable

The rollout remains human-visible, making failures and strengths easier to see during interaction.

Deploy Nearby

The architecture favors small, low-latency runtimes that can run close to the player or policy.

Metrics of Success

The project evaluates world models through practical dimensions that matter for interactive simulation.

Transition consistency over long recursive rollouts
Policy transferability from generated trajectories back to the source environment
Latency close to interactive framerates
Compression ratio between classical simulator footprint and neural transition model size
Cross-domain generalization across autonomous driving, browser games, and lightweight gameplay environments

Research Direction

The goal is not to replace traditional engines entirely. It is to learn the parts of simulation that can become predictive, portable, and interactive.

Hybrid Engines

Combine hand-authored structure with learned transition behavior.

Generated Worlds

Train policies and explore mechanics inside compact learned rollouts.

Neural Replay

Turn recorded interaction streams into controllable predictive environments.

Data And Training Infrastructure

Record

Capture interaction traces from driving simulation and browser games.

Index

Use manifests to query useful sample windows without scanning every raw asset.

Train

Run fast experiments in notebooks and longer sweeps on dedicated GPU workers.

Export

Convert selected models into lightweight browser and local inference runtimes.

Data Model

Vision

RGB frame sequences provide the visual state the model learns to roll forward.

LiDAR

Projected spatial tensors encode geometry and scene structure as a parallel sensor stream.

Telemetry

State vectors describe motion, dynamics, orientation, and environment metadata.

Waypoints

Structured path priors represent short-horizon navigation and future route geometry.

Controls

Action vectors condition transitions on steering, throttle, brake, and gameplay movement input.

Manifest

Chunk metadata links modalities, labels, time windows, and training splits for quick sampling.

Relationship to Literature

Dreamer

arxiv.org/abs/1912.01603

Demonstrated policy learning through latent imagination inside learned world models. This project keeps the rollout human-visible and playable, so the learned simulator can be inspected directly.

Genie

arxiv.org/abs/2402.15391

Introduced interactive generative environments trained from video. This project explores the opposite deployment pressure: tiny models intended for browser, edge, and on-device execution.

GAIA-1

arxiv.org/pdf/2309.17080

GAIA-1 presents a generative world model for autonomous driving. This project explores a smaller interactive version of that direction, with browser-scale deployment and playable rollouts.

Evaluation Metrics

Online Play Evaluation

Evaluate the simulator by how it plays in real time: steering response, rollout stability, control coherence, and whether the generated state remains useful under live interaction.

Low Latency

Target interactive browser performance around 30fps, with low inference latency and stable frame pacing during recursive rollout.

Per-Modality Loss

Track online and offline loss by modality, including RGB prediction, LiDAR latent alignment, telemetry evolution, waypoint consistency, and control-conditioned transition quality.

WANDB training graphs for neural simulator metrics

Value Proposition

More Rollouts

Small simulators can run many lightweight environment instances in parallel.

Faster Variation

Game-like environments can explore new modes and behaviors with less hand-authored systems code.

Human Feedback

Playable simulation lets people inspect the model's behavior directly, not only through aggregate metrics.

Policy Demo

Tiny Self-Driving Policy

youtu.be/rr_uS4bf0B4

A compact autonomous driving policy trained in CARLA using only realistic sensor modalities: RGB vision, LiDAR, telemetry, and waypoint guidance.

Zero-Shot Maps

Despite training on only Town01 and Town02, the policy generalizes to unseen CARLA environments without privileged simulator state.

Autoregressive Driving

The model performs stable closed-loop driving by repeatedly acting from its current sensor observations and route context.

Recovery Behavior

The policy can recover from off-road states and unusual positions it was not explicitly trained to visit, then steer back toward drivable routes.