01
Parameter Visualization
A Chromosome Is a Mahjong Hand
Every GA chromosome encodes a full PPO training configuration — 12 hyperparameters, one tile each.
Drawing a good hand means finding a training recipe that keeps the agent's edge alive through 200M+ steps of self-play noise.
Metaphor: Suits group related hyperparameters (learning-rate family, PPO objective, batch structure, critic design). Tile value within a suit represents position along that gene's search range. The GA "draws" and refines hands until it finds one that survives long-term training.
02
Legend
How to Read the Hand
Each hyperparameter family maps to a mahjong suit. Tile value within the suit = position along the search range.
Man · 萬子Learning-rate family
learning_rate · lr_cycle_steps · lr_min | 3 genes
Pin · 筒子PPO objective family
clip_range · entropy_coef · ppo_epochs | 3 genes
Sou · 索子Batch & gradient family
max_grad_norm · minibatch_size · num_steps | 3 genes
Honor · 風牌Critic design (optional)
critic_mode · gae_lambda · discount_gamma | 1 categorical + 2 conditional
Active tile (green border, slightly lifted) shows the current S2 hand-tuned value (or GA-selected value in §4).
Faded tiles are other candidate positions along that gene's search range.
03
Search Space
12 Genes · Current S2 Hand-Tuned Values
Each row shows the discretized search range (5 positions along the range). The highlighted tile is the current hand-tuned value.
Man · 萬子Learning-Rate Family
learning_rate[1e-6, 1e-3] log-scale
Cosine-schedule starting LR; primary stability lever.
current = 3e-5
lr_cycle_steps[10M, 100M] log-scale
Cosine warm-restart period; directly tied to the S2 collapse trigger.
current = 50M (collapse hotspot)
lr_min[1e-7, 1e-5] log-scale
Minimum LR at end of cosine cycle; shapes restart jump magnitude.
current = 1e-6
Pin · 筒子PPO Objective Family
clip_range (ε)[0.05, 0.4] linear
PPO trust region size; tight clip = conservative updates.
current = 0.2
entropy_coef[0.001, 0.1] log-scale
Exploration pressure; critical under critic-free setup.
current = 0.01
ppo_epochs[1, 5] integer
Number of PPO passes per rollout; more = more data reuse.
current = 2
Sou · 索子Batch & Gradient Family
max_grad_norm[0.1, 2.0] linear
Global gradient clipping threshold.
current = 0.5
minibatch_size{512, 1024, 2048, 4096, 8192}
Critical: at 1024, only ~74 independent advantage estimates per step (critic-free).
current = 1024 (low eff-sample)
num_steps (per env){512, 1024, 2048, 4096}
Rollout length per env; affects update frequency vs data freshness.
current = 1024
Honor · 風牌Critic Design (Optional · GA-Searchable)
critic_modecategorical · 3 options
Let GA decide whether to reintroduce a value signal.
current = none (critic-free)
gae_lambda[0.9, 0.99] · active if critic_mode ≠ none
GAE bias-variance tradeoff; inactive in current critic-free setup.
inactive
discount_gamma[0.95, 0.999] · active if reward_predictor
Only active when reward_predictor shaping is enabled.
inactive
04
Sample Chromosomes
Three Possible "Hands"
Each hand is a complete PPO configuration. The GA evolves these hands across generations.
Hand 1 · S2 Hand-Tuned
score: −0.27 / game
Degraded hand: collapses around step 150M, loses the edge entirely.
Hand 2 · GA Candidate A
hypothetical
Slower, safer updates + bigger batch (higher effective sample size).
Hand 3 · GA Candidate B
hypothetical
GA activates reward-predictor critic + longer LR cycle.
Evolution narrative: The GA starts with a random 12-tile population, evaluates each hand over 1M training steps, then breeds stronger hands through crossover and mutation. After 8 generations × 12 individuals = 96 hands played, the winning hand — stable and edge-preserving — remains.
05
Encoding Types
Mixed-Type Gene Encoding
Log-Scale Continuous
learning_rate · lr_cycle_steps · lr_min · entropy_coef
Evenly spaced in log domain → crossover/mutation operates in log-space, then maps back to linear. Ensures balanced exploration across orders of magnitude.
Linear Continuous
clip_range · max_grad_norm · gae_lambda · discount_gamma
SBX (Simulated Binary Crossover, η=20) + polynomial mutation (η=20) in original linear space.
Integer / Power-of-Two
ppo_epochs · minibatch_size · num_steps
SBX with rounding to nearest valid integer / power-of-2. Snapped after crossover and mutation.
Categorical
critic_mode (3 options: 東 / 南 / 西)
Uniform crossover (take one parent's value) + mutation (replace with random alternative at probability 1/n_genes).
Parameter visualization deck for Figma re-design. Tile assets from pongpong-online/web/src/assets/tiles/.
Search space: see 06_search_space_correct.csv. Chromosome encoding: see 07_ga_execution_plan.csv.