01 System Overview

Pongpong

A Taiwanese 16-tile Mahjong self-play reinforcement-learning platform. Go-powered engine · Transformer policy · Live production deployment with real human players.

idx = 0一萬

idx = 9一筒

idx = 18一索

idx = 27東

idx = 31中

idx = 34春 (花)

Every tile is assigned a stable integer idx (0–33 for scored tiles, 34–41 for flowers). The observation tensor and action space both index into this same space — when the policy outputs argmax = 4, it means "discard 五萬".

02 Tile Universe

34 Scored Tiles + 8 Flowers — each with a numeric index

Indices below each tile match pkg/mahjong/tile.go. The integer is the tile — it feeds every observation channel and every action logit.

Man · 萬子Characters (idx 0–8)

Pin · 筒子Dots (idx 9–17)

Sou · 索子Bamboo (idx 18–26)

Honors · 字牌Winds (idx 27–30) + Dragons (idx 31–33)

27東

28南

29西

30北

31中

32發

33白

Flowers · 花牌Bonus (idx 34–41, drawn & replaced — not scored into patterns)

34春

35夏

36秋

37冬

38梅

39蘭

40菊

41竹

Scored Tile Types

Man 9 + Pin 9 + Sou 9 + Winds 4 + Dragons 3

Copies per Scored Tile

136 scored tiles total in the wall

144

Wall Size (incl. flowers)

136 scored + 8 flowers × 1

03 Game Setup

Four Players, 16 Tiles Each

Taiwan 16-tile Mahjong: each player holds 16 tiles (dealer 17). Draw one, discard one, rotate clockwise. Flowers auto-replace. The observation is rotated so the learner is always rel_seat = 0.

rel_seat 2 · 對家 (across)

rel_seat 3 · 上家

rel_seat 1 · 下家

rel_seat 0 · 自己 (Learner)

~14

Learner Actions / Game

Measured from S2 tensorboard (n=736 updates)

~60

Draws per Round

Roughly 15 per seat until wall exhausts

~2.6

Avg Fan on Win

Expert self-play baseline (100K games)

04 Winning Patterns

Hu · 胡牌 = 4 Melds + 1 Eye-Pair + Pair-out Tile

17 tiles total: 4 melds (12 tiles) + 1 eye pair (2 tiles) + 1 pair-out (the winning tile, whether self-drawn or claimed).

Sequence · 順子 (3 consecutive, same suit)

0
1
2

一二三萬 — idx {0, 1, 2}

Triplet · 刻子 (3 copies of the same tile)

13
13
13

五筒 × 3 — idx {13, 13, 13}

Quad · 槓 (4 copies — declared via kan, draws a replacement tile)

五索 × 4 — counts as one meld + replacement draw

Eye Pair · 對子 (the "eye")

中 × 2 — idx {31, 31}

Complete Winning Hand · 一盤胡牌

      4 melds (一二三萬 · 三四五萬 · 三四五筒 · 三索刻) + 1 eye pair (中) = 17 tiles total
    

Fan scoring: base 1 fan. Examples: 門清 (+1) · 自摸 (+3) · 大三元 (+8) · 大四喜 (+16). Payment = (1 + total_fan) per unit from each loser. All reward signals to the RL agent are scaled versions of this fan payment (zero-sum across 4 seats).

05 Observation Representation

A Concrete Example — from Game State to 34 × 133 Tensor

The observation is laid out as obs[ch × 34 + tile_idx]. Each of the 34 scored-tile indices is a token; each of the 133 channels is a feature of that token. Below we show a tiny game state and trace exactly which values get written.

Example game state (learner = rel_seat 0)

Learner hand (8 tiles for brevity)

Own discards (rel_seat 0)

rel_seat 1 (下家) discards

Round wind: 東 (idx 27) · Seat wind: 東 (idx 27) · Dealer: self · Draws left: 56

How it becomes the tensor — channels 0–20 (first per-tile block)

Each row is one channel. Each column is one of the 34 tile tokens. Only the first 12 columns shown below; the other 22 are all zero in this example. Colours: 0.25 0.5 0.75 1.0

tile_idx →

Hand ≥ 1 ch 0

Hand ≥ 2 ch 1

Hand ≥ 3 ch 2

Hand ≥ 4 ch 3

Own discards /4 ch 4

0.25

rel_1 discards /4 ch 5

0.25

Visible /4 ch 20

0.75

0.25

0.5

Read this: column "idx 0" (一萬) shows ≥1, ≥2, ≥3 all firing — so the learner holds 3 copies of 一萬. Column "idx 4" (五萬) shows ≥1, ≥2 only — so 2 copies. Column "idx 31" (中) shows ≥1, ≥2 — so 2 copies. Channel 20 sums up everything visible: 0.75 for 一萬 (3 in hand), 0.75 for 五萬 (2 in hand + 1 seen in rel_1's discards), etc.

Complete channel layout (133 total)

ch 0–3

Hand — count-threshold encodingper-tile

4 binary layers: have ≥1 / ≥2 / ≥3 / ≥4 of this tile. Redundant-but-useful trick: lets the network read "pair vs triplet vs quad" without doing arithmetic.

ch 4–7

Discard piles, 4 seatsper-tile

Each cell = (count in that seat's discard pile) / 4. Seats are rotated: ch 4 = own, ch 5 = rel_1 (下家), ch 6 = rel_2 (對家), ch 7 = rel_3 (上家).

ch 8–19

Open / semi-open melds, 4 seats × 3 meld-typesper-tile

For each seat: one channel each for chi / pon / kan tiles. Concealed kans of opponents are broadcast uniformly (hidden identity).

ch 20

Globally visible countper-tile

(tiles visible everywhere: all discards + all open melds + own hand) / 4. A single "how many of this tile are still in play" feature.

ch 21–23

Last-discard source flagbroadcast

One-hot over rel_1 / rel_2 / rel_3 indicating who just discarded. Ch 68 (per-tile) says what they discarded.

ch 24

Round wind one-hotper-tile

Set on one of idx 27–30 (東南西北). Indicates current round wind.

ch 25

Seat wind one-hotper-tile

Set on idx 27–30 — agent's seat-specific lucky wind (可加台).

ch 26

Dealer flagbroadcast

1 across all 34 tokens if the learner is the current dealer, else 0. Dealer loses/wins 2× score.

ch 27

Draws left / 80broadcast

How deep into the wall we are. Controls risk tolerance (defend harder late).

ch 28

First round flagbroadcast

1 if still in the dealer's first draw cycle (for 天胡 / 地胡 eligibility).

ch 29

After-kan flagbroadcast

1 if the current draw came from a kan replacement (for 槓上開花).

ch 30–33

Flower counts per seat / 8broadcast

rel_0..3 flower counts. Flowers auto-draw replacements and contribute fan at scoring time.

ch 34–45

Meld source (who contributed)per-tile

4 seats × 3 relative from-directions. Marks which tile, and from whom, each open meld was claimed.

ch 46

門清 (concealed) flagbroadcast

1 if no open melds — enables +1 fan bonus for winning without calls.

ch 47–50

Suit ratios (Man/Pin/Sou/Honor)broadcast

Fraction of learner's total tiles in each suit. Lets the network quickly see "going one-suit (清一色)".

ch 51, 52

Concealed-triplet count / Pair countbroadcast

Normalized counts of "≥3" and "≥2" in hand. Quick structural summary.

ch 53–59

Dragon + wind tile counts / 4broadcast

Per-honor counts (hand + melds). Dragons separately (53–55), winds separately (56–59). For 三元 / 四喜 reasoning.

ch 60

Chi count / 4broadcast

How many 順子 melds the learner has already called.

ch 61

Current shanten / 8broadcast

Minimum tile-swaps to reach tenpai. 0 = tenpai, 1 = 1-away, etc.

ch 62, 63

Shanten-if-discard / Ukeire-if-discardper-tile

For each tile: the shanten and tile-accepts (ukeire) if you discarded that tile. This is the key signal for discard choice.

ch 64

Tenpai wait maskper-tile

1 on each tile the learner is waiting on (only set when current shanten = 0).

ch 65–67

Opponent hand sizes / 16broadcast

rel_1 / rel_2 / rel_3 hand counts. Tells us if they've recently called (making hand smaller).

ch 68

Last-discard identity one-hotper-tile

1 on the tile that was just discarded (can be called for pon/chi/win).

ch 69–132

Discard history (last 16 turns × 4 seats = 64 channels)per-tile

For each seat and each of the last 16 discard slots, one channel. Value = 0 (not discarded), 1 (手切), or 2 (摸切, tsumogiri). This gives the transformer temporal context to infer opponents' intentions.

Total: 69 game-state channels + 64 discard-history channels = 133 channels × 34 tile tokens = 4,522-dim observation. Two scopes coexist: per-tile channels write real per-tile values; broadcast channels copy the same scalar across all 34 tokens (treated as a "global feature for this step"). The transformer reshapes this to (34, 133) and runs self-attention over tile tokens.

06 Action Space

109 Discrete Actions

Model outputs a 109-dim logit vector. An action mask zeros out illegal moves before softmax, so the policy can never emit an illegal action.

0–33

Discard (one action per tile idx)

… idx 5–29 …

Hu · 胡牌
Declare a winning hand (self-drawn 自摸 or from opponent's discard 放槍).

Pass · 過
Decline any available reaction; let the turn proceed.

Pon · 碰 — triplet from an opponent's discard

← claimed

MinKan · 明槓 — quad from an opponent's discard

← claimed

38–40

Chi · 吃 — sequence from upstream's discard, 3 positions: low / mid / high

4 ←

5 ←

6 ←

41–74

AnKan · 暗槓 — concealed quad from hand (one per tile idx)

75–108

KaKan · 加槓 — upgrade an open pon by adding the 4th copy from hand

prior pon

+ hand

Action masking: illegal action logits are set to −∞ before softmax. Legal-action count per step is typically 2–10 (usually discard-only: 1 legal discard per distinct hand tile, typically 4–14 distinct tiles).

07 Model Architecture

MahjongTransformer · 10M Parameters

Self-attention over the 34 tile tokens lets the policy reason about arbitrary tile-to-tile relationships (sequences, triplets, honor pairs, defensive patterns). Below: the full tensor pipeline, attention visualization, and inside-the-layer breakdown.

Input

34 × 133

reshaped 4,522-dim observation

→

Transformer Encoder

4L × 8H × d=256

pre-norm; ff_dim=1024; no dropout

→

Output

109 logits

flatten(34 × 256) → Linear → masked softmax

7.1 · Full Tensor Pipeline

Each block is a tensor labeled with its shape (batch dim B omitted). Each dashed box between tensors is the operation that transforms one into the next.

(B, 4522)

raw observation — flat 1-D vector

.view(−1, 133, 34)

(B, 133, 34)

restore (channel, tile) layout

.transpose(1, 2)

(B, 34, 133)
34 tokens × 133 features each

input_proj : Linear(133 → 256), shared across all 34 tokens

(B, 34, 256)

tokens projected to d_model = 256

+ pos_embed (1, 34, 256) — learnable per-tile position

(B, 34, 256)

token embeddings ready for attention

encoder — 4 × TransformerEncoderLayer (8 heads, ff_dim=1024)

(B, 34, 256)

contextualized token representations

LayerNorm

(B, 34, 256)

normalized output

.flatten(1) — concatenate all 34 tokens

(B, 8704)

34 × 256 = 8704 features

policy_head : Linear(8704 → 109)

(B, 109)

raw action logits

masked_fill(~legal_mask, −∞) → softmax

(B, 109)

legal action distribution

# python/pongpong_client/model.py — MahjongTransformer._encode
def _encode(self, obs: torch.Tensor) -> torch.Tensor:
    x = obs.view(-1, 133, 34).transpose(1, 2)   # (B, 34, 133)
    x = self.input_proj(x) + self.pos_embed       # (B, 34, 256)
    x = self.encoder(x)                           # (B, 34, 256)
    return self.norm(x)                           # (B, 34, 256)
  

7.2 · Token = Tile Type, Not Time Step

Unlike NLP Transformers where tokens are words and sequence length varies, here the sequence length is always 34. Each token corresponds to one scored tile type. The 133-dim per-token feature vector packs "what this tile means in the current state" — hand counts, discards, meld participation, shanten-if-discarded, etc.

↓ Linear(133 → 256) · shared weights across all 34 tokens

tok0
256-d

tok1
256-d

tok2
256-d

tok3
256-d

tok4
256-d

… tok5 .. tok29 …

tok30
256-d

tok31
256-d

tok32
256-d

tok33
256-d

↓ + pos_embed_{34 × 256} · enters encoder

7.3 · Self-Attention Across 34 Tile Tokens

At every encoder layer, each token computes attention weights over all other tokens. Below shows the query token 五萬 (idx 4) and its attention distribution — thicker edges = higher attention.

strong attention (sequence / triplet partners)

weak attention (unrelated tiles)

What self-attention learns automatically: without hand-coded rules, the network learns that 五萬 should attend strongly to 三萬, 四萬, 六萬, 七萬 (neighbors for sequence building), to its own column (hand-count features), and to recent discards of the same tile (defensive reasoning). Honor tiles learn to attend to their triplet partners (三元 / 風). These patterns are what MLP / CNN architectures fail to capture efficiently.

7.4 · Inside One Encoder Layer (× 4 stacked)

Pre-norm Transformer block with residual connections. Each of the 4 stacked layers has identical structure; parameters are independent.

Input (B, 34, 256)

↓ LayerNorm

Multi-Head Self-Attention 8 heads × 32d = 256

↓ Dropout → + residual

↓ LayerNorm

Feed-Forward : Linear(256 → 1024) → GELU → Linear(1024 → 256)

↓ Dropout → + residual

Output (B, 34, 256)

MULTI-HEAD ATTENTION

The 256-d token is split into 8 heads of 32-d each. Each head computes its own Q/K/V projection and attention softmax, producing 8 independent "views" of tile relationships. Results are concatenated back to 256-d.

        params per layer ≈ 4 × (256 × 256) = 262K
      

FEED-FORWARD

Position-wise MLP: expands each token to 1024-d, applies GELU, projects back to 256-d. Operates independently per token — no cross-token mixing here (attention did that).

        params per layer ≈ 2 × (256 × 1024) = 524K
      

7.5 · Parameter Budget

Component	Shape	Params
input_proj	Linear(133 → 256)	34 K
pos_embed	Param(1, 34, 256)	9 K
encoder × 4 layers	attn 262K + FFN 524K + norms per layer	~3.2 M
LayerNorm (final)	Norm(256)	0.5 K
policy_head	Linear(8704 → 109)	949 K
Total		~4.2 M

Note: the S2 self-play run uses d_model=256 + an auxiliary danger_head; with older wider configs (d=384, 6 layers) parameter count reaches ~10M quoted elsewhere. The structure is identical.

Architecture choice rationale: three architectures were tested at similar parameter scale: MLP, CNN, Transformer. Only the Transformer reached expert-parity. Self-attention's ability to capture any tile-to-tile relationship (cross-suit globality + long-range defense reasoning) is the key unlock for Mahjong decision-making — and it is the single most important design lever behind the S1-best milestone.

08 Training Infrastructure

256 Parallel Environments · 3,000 Steps/Second

Go Engine

256 envs

shared-memory gRPC, complete 台灣 16-tile rules

⇄

Python PPO Trainer

PyTorch + AMP

opponent pool, rollout, update loop

→

Live Deployment

pongpong-online

Web + WebSocket, 2,814 human matches

3,000

Steps / Second

Measured mean on single RTX 4070 Ti

202M+

Total Training Steps

S2 self-play accumulated (ongoing)

2,814

Real Human Matches

40-day live deployment, 10 users

Training pipeline stages: Behavioral Cloning (BC) from expert traces → S1 self-play (curriculum against growing opponent pool) → S2 self-play (critic-free, leave-one-out baseline). Full pipeline detailed in the Incident-trajectory deck.

  Pongpong system visualization · tile assets from pongpong-online/web/src/assets/tiles/

  Channel layout verified against pkg/mahjong/observe.go (total = 69 + 4×16 = 133).

  Companion deck to ppt_reference_en.html (the Incident narrative).