WW-PGD: Projected Gradient Descent optimizer

Announcing: ๐—ช๐—ช-๐—ฃ๐—š๐—— โ€” ๐—ช๐—ฒ๐—ถ๐—ด๐—ต๐˜๐—ช๐—ฎ๐˜๐—ฐ๐—ต๐—ฒ๐—ฟ ๐—ฃ๐—ฟ๐—ผ๐—ท๐—ฒ๐—ฐ๐˜๐—ฒ๐—ฑ ๐—š๐—ฟ๐—ฎ๐—ฑ๐—ถ๐—ฒ๐—ป๐˜ ๐——๐—ฒ๐˜€๐—ฐ๐—ฒ๐—ป๐˜ ๐Ÿš€

I just released WW-PGD, a small PyTorch add-on that wraps standard optimizers (SGD, Adam, AdamW, etc.) and applies an epoch-boundary spectral projection using WeightWatcher diagnostics.

Elevator pitch: WW-PGD explicitly nudges each layer toward the Exact Renormalization Group (ERG) critical manifold during training.

๐—ง๐—ต๐—ฒ๐—ผ๐—ฟ๐˜† ๐—ถ๐—ป ๐˜€๐—ต๐—ผ๐—ฟ๐˜

โ€ข HTSR critical condition: ฮฑ โ‰ˆ 2

โ€ข SETOL ERG condition: trace-log(ฮป) over the spectral tail = 0

WW-PGD makes these explicit optimization targets, rather than post-hoc diagnostics.

๐—›๐—ผ๐˜„ ๐—ถ๐˜ ๐˜„๐—ผ๐—ฟ๐—ธ๐˜€

  • Runs weightwatcher (ww) at epoch boundaries
  • Uses ww layer quality metrics to identify the spectral tail
  • Selects the optimal tail guess at each epoch
  • Applies a stable Projected Gradient Descent update on the layer spectral density via a Proximal, Cayley-like step.
  • Retracts to exactly satisfy the SETOL ERG condition
  • Blends the projected weights back in (with warmup + ramping to avoid early instability)

In other words, it projects the results of your optimizer on the ERG critical manifold, the feasible set in a spectral constraint optimization setup.

๐—ฆ๐—ฐ๐—ผ๐—ฝ๐—ฒ (๐—ถ๐—บ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฎ๐—ป๐˜)

This first public release is focused on training small models from scratch. It is not yet intended for large-scale fine-tuning. Itโ€™s the first proof of concept of such an approach.

So far, WW-PGD has been tested on:

  • 3-layer MLPs (MNIST / FashionMNIST)
  • nano-GPTโ€“style small Transformer models

Larger architectures and fine-tuning workflows are active work in progress.

๐—˜๐—ฎ๐—ฟ๐—น๐˜† ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€ (๐—™๐—ฎ๐˜€๐—ต๐—ถ๐—ผ๐—ป๐— ๐—ก๐—œ๐—ฆ๐—ง, ๐Ÿฏ๐Ÿฑ ๐—ฒ๐—ฝ๐—ผ๐—ฐ๐—ต๐˜€, ๐—บ๐—ฒ๐—ฎ๐—ป ยฑ ๐˜€๐˜๐—ฑ)

Below I show the layer alphas for a small (3-layer) MLP, trained on FashionMNIST to 35 epochs, compared to default AdamW

โ€ข ๐๐ฅ๐š๐ข๐ง ๐ญ๐ž๐ฌ๐ญ: Baseline 98.05% ยฑ 0.13ย vsย WW-PGD 97.99% ยฑ 0.17

โ€ข ๐€๐ฎ๐ ๐ฆ๐ž๐ง๐ญ๐ž๐ ๐ญ๐ž๐ฌ๐ญ: Baseline 96.24% ยฑ 0.17ย vsย WW-PGD 96.23% ยฑ 0.20

Translation: accuracy is roughly neutral at this scale โ€” but WW-PGD gives you a spectral control knob and full per-epoch tuning.

๐—ฅ๐—ฒ๐—ฝ๐—ผ & ๐—ค๐˜‚๐—ถ๐—ฐ๐—ธ๐—ฆ๐˜๐—ฎ๐—ฟ๐˜

๐Ÿงฉ Repo: https://github.com/CalculatedContent/WW_PGD

๐Ÿ““ QuickStart (with MLP3+FashionMNIST example): https://github.com/CalculatedContent/WW_PGD/blob/main/WW_PGD_QuickStart.ipynb

๐Ÿ” ๐— ๐—ผ๐—ฟ๐—ฒ ๐—ถ๐—ป๐—ณ๐—ผ: https://weightwatcher.ai/ww_pgd.html

If youโ€™re experimenting with training and optimization on your own models, or want a data-free spectral health monitor + projection step, Iโ€™d love feedback โ€” especially on other optimizers or small Transformer setups.

Join us on the weightwatcher Community Discord to discuss

๐Ÿ’ฌ https://discord.com/invite/uVVsEAcfyF

A big thanks to hari kishan prakash for helping out here.

And, as always, if you need help with AI, reach out to me here. #talkToChuck

Liked Liked