01 Parameter Visualization

A Chromosome Is a Mahjong Hand

Every GA chromosome encodes a full PPO training configuration — 12 hyperparameters, one tile each. Drawing a good hand means finding a training recipe that keeps the agent's edge alive through 200M+ steps of self-play noise.

Metaphor: Suits group related hyperparameters (learning-rate family, PPO objective, batch structure, critic design). Tile value within a suit represents position along that gene's search range. The GA "draws" and refines hands until it finds one that survives long-term training.

02 Legend

How to Read the Hand

Each hyperparameter family maps to a mahjong suit. Tile value within the suit = position along the search range.

Man · 萬子Learning-rate family

learning_rate · lr_cycle_steps · lr_min | 3 genes

Pin · 筒子PPO objective family

clip_range · entropy_coef · ppo_epochs | 3 genes

Sou · 索子Batch & gradient family

max_grad_norm · minibatch_size · num_steps | 3 genes

Honor · 風牌Critic design (optional)

critic_mode · gae_lambda · discount_gamma | 1 categorical + 2 conditional

Active tile (green border, slightly lifted) shows the current S2 hand-tuned value (or GA-selected value in §4). Faded tiles are other candidate positions along that gene's search range.

03 Search Space

12 Genes · Current S2 Hand-Tuned Values

Each row shows the discretized search range (5 positions along the range). The highlighted tile is the current hand-tuned value.

Man · 萬子Learning-Rate Family

learning_rate[1e-6, 1e-3] log-scale

Cosine-schedule starting LR; primary stability lever.

1e-6

1e-5

3e-5 ◀

1e-4

1e-3

current = 3e-5

lr_cycle_steps[10M, 100M] log-scale

Cosine warm-restart period; directly tied to the S2 collapse trigger.

10M

25M

50M ◀

75M

100M

current = 50M (collapse hotspot)

lr_min[1e-7, 1e-5] log-scale

Minimum LR at end of cosine cycle; shapes restart jump magnitude.

1e-7

3e-7

1e-6 ◀

3e-6

1e-5

current = 1e-6

Pin · 筒子PPO Objective Family

clip_range (ε)[0.05, 0.4] linear

PPO trust region size; tight clip = conservative updates.

0.05

0.1

0.2 ◀

0.3

0.4

current = 0.2

entropy_coef[0.001, 0.1] log-scale

Exploration pressure; critical under critic-free setup.

0.001

0.003

0.01 ◀

0.03

0.1

current = 0.01

ppo_epochs[1, 5] integer

Number of PPO passes per rollout; more = more data reuse.

2 ◀

current = 2

Sou · 索子Batch & Gradient Family

max_grad_norm[0.1, 2.0] linear

Global gradient clipping threshold.

0.1

0.5 ◀

1.0

1.5

2.0

current = 0.5

minibatch_size{512, 1024, 2048, 4096, 8192}

Critical: at 1024, only ~74 independent advantage estimates per step (critic-free).

512

1024 ◀

2048

4096

8192

current = 1024 (low eff-sample)

num_steps (per env){512, 1024, 2048, 4096}

Rollout length per env; affects update frequency vs data freshness.

512

1024 ◀

2048

4096

—

current = 1024

Honor · 風牌Critic Design (Optional · GA-Searchable)

critic_modecategorical · 3 options

Let GA decide whether to reintroduce a value signal.

東 / none ◀

南 / reward_pred

西 / standalone

current = none (critic-free)

gae_lambda[0.9, 0.99] · active if critic_mode ≠ none

GAE bias-variance tradeoff; inactive in current critic-free setup.

0.90

0.95

0.97

0.99

inactive

discount_gamma[0.95, 0.999] · active if reward_predictor

Only active when reward_predictor shaping is enabled.

0.95

0.97

0.99

0.999

inactive

04 Sample Chromosomes

Three Possible "Hands"

Each hand is a complete PPO configuration. The GA evolves these hands across generations.

Hand 1 · S2 Hand-Tuned score: −0.27 / game

lr 3e-5

cyc 50M

min 1e-6

clip .2

ent .01

epoch 2

gn .5

mb 1024

ns 1024

none

—

Degraded hand: collapses around step 150M, loses the edge entirely.

Hand 2 · GA Candidate A hypothetical

lr 1e-5

cyc 75M

min 3e-7

clip .1

ent .03

epoch 2

gn 1.0

mb 4096

ns 2048

none

—

Slower, safer updates + bigger batch (higher effective sample size).

Hand 3 · GA Candidate B hypothetical

lr 3e-5

cyc 100M

min 1e-6

clip .2

ent .005

epoch 3

gn .5

mb 2048

ns 1024

reward_pred

λ=.95

γ=.97

GA activates reward-predictor critic + longer LR cycle.

Evolution narrative: The GA starts with a random 12-tile population, evaluates each hand over 1M training steps, then breeds stronger hands through crossover and mutation. After 8 generations × 12 individuals = 96 hands played, the winning hand — stable and edge-preserving — remains.

05 Encoding Types

Mixed-Type Gene Encoding

Log-Scale Continuous

learning_rate · lr_cycle_steps · lr_min · entropy_coef

Evenly spaced in log domain → crossover/mutation operates in log-space, then maps back to linear. Ensures balanced exploration across orders of magnitude.

Linear Continuous

clip_range · max_grad_norm · gae_lambda · discount_gamma

SBX (Simulated Binary Crossover, η=20) + polynomial mutation (η=20) in original linear space.

Integer / Power-of-Two

ppo_epochs · minibatch_size · num_steps

SBX with rounding to nearest valid integer / power-of-2. Snapped after crossover and mutation.

Categorical

critic_mode (3 options: 東 / 南 / 西)

Uniform crossover (take one parent's value) + mutation (replace with random alternative at probability 1/n_genes).

  Parameter visualization deck for Figma re-design. Tile assets from pongpong-online/web/src/assets/tiles/.

  Search space: see 06_search_space_correct.csv. Chromosome encoding: see 07_ga_execution_plan.csv.