Potential foundational piece

This is a different format of VAE that specifically targets CV for autoencoding, but the model is only preliminary and requires many systemic utilities be instantiated to function.

This model will likely function entirely with KL_DIV and standard AE structural systems with the footnote of being entirely geometrically aligned using similar principality as the SVAE with a specifically aligned internalized subsystem meant entirely for adjudication through a series of embedding arrays each meant to be aligned entirely on the CV spectrum.

This system is essentially a CV battery container, that handles hundreds of miniature SVD-trained batteries that are directly implanted into the substructure as learning starter points. The early design shows promise for rapid learning transfer.

This model will not require SVD FP64 TO TRAIN and it will be almost entirely linear upon completion, which means it will be a unique param-heavy model, rather than the combination of model shapes I've been cobbling together up to this point.

geolip-svae-nosvd-ablation

Status: shelved pending proper redesign. See "What's next" section.

An ablation study exploring whether the SVD in PatchSVAE can be replaced by a learned linear readout. The short answer is not directly, and the long answer is a list of architectural properties the SVD was providing implicitly that any replacement must supply explicitly.

This repo exists to preserve the experiment and its findings for future work. The parent architecture lives at geolip-svae and geolip-svae-batteries.

The motivating realization

During F-class sweep analysis, we articulated a claim that reframed what PatchSVAE is doing:

The SVD is a readout, not a decomposer. The encoder + sphere-normalization is the decomposer.

The argument:

  1. The encoder MLP projects a patch into a V×D matrix space
  2. Sphere-normalization (one line, zero parameters) constrains every row to S^(D-1) — the unit hypersphere in D dimensions
  3. The SVD is then exact arithmetic on V points on S^(D-1). Given V unit vectors in D-space, the factorization U·Σ·V^T is unique up to sign
  4. Cross-attention is 0.013% of parameters with alpha coefficients that barely move during training — per-patch SVD already produces correct coordinates; cross-attn is verification, not coordination

Under this frame, omega tokens are not a learned compressed representation. They are coordinates on the universal S^(D-1) packing manifold. The universal attractor (S₀ ≈ 5.1, erank ≈ 15.88 at D=16; CV 0.20–0.23 band) is a geometric property of "V unit vectors packed as evenly as possible on S^(D-1)," not a learned statistic. The encoder discovers projections onto that fixed manifold. The manifold is fixed by the architecture.

If this is right, the SVD should be replaceable by any mechanism that reads D-dim coordinates off a sphere-normalized V×D matrix. A learned Linear(V·D → D) should work. This repo tests that hypothesis.

What the ablation actually is

Stage Canonical PatchSVAE NoSVD ablation
Encoder MLP → V·D flat same
Sphere norm F.normalize(dim=-1) on V×D reshape same
Readout U,S,Vt = svd(M) → omega is S omega = Linear(V·D, D)(M.flatten())
Cross-attention on S across patches on omega across patches
Inverse readout M_hat = U @ diag(S_coord) @ Vt M_hat = sphere_norm(Linear(D, V·D)(omega_coord))
Decoder MLP from V·D flat same

Everything else — encoder, decoder, cross-attention logic, boundary smoothing, CV-EMA soft-hand, 16-type noise training, 30 epochs — is identical to the F-class trainer.

What happened

Four debug rounds before shelving. Each round revealed an architectural property the SVD was providing that a naive Linear replacement doesn't.

Round 1 — baseline. r=NaN at iteration 899. Adaptive gradient clipping (clip=max(recon_loss, 1.0)) in the original trainer assumes recon_loss is architecturally bounded. Without the SVD's implicit magnitude bound, recon blows up, the clip threshold blows up with it, and protection fails.

Round 2 — LayerNorm on omega, orthogonal init gain=0.5, fixed grad_clip=1.0. r=3.2e11 before NaN. LayerNorm bounded omega but did nothing for M_hat. The decoder can push inverse_readout to amplify freely to match heavy-tailed noise values (Cauchy tan(π·0.49) ≈ 63, exponential -log(tiny) ≈ 13+), and the unconstrained Linear output cubically amplifies during training.

Round 3 — added sphere-norm on M_hat after inverse_readout. Forward is now stable, but eval MSE is 2.3e11 and recon_ema goes NaN. Sphere-norming M_hat puts the decoder input on the same manifold as the canonical's reconstruction, but strips reconstruction-magnitude information. The decoder must hallucinate 63× amplification from unit-magnitude matrices to match Cauchy targets, which it cannot do.

Round 4 — gradient-flowing Cayley-Menger loss. This is the first implementation with a plausible mechanism. In the canonical, CV of pentachoron volumes is measured with .item() stripping the gradient — it's a readout. In Round 4, cv_loss_differentiable is added, computing CV across the batch with full gradient flow, penalizing quadratic distance from the 0.215 target (center of the 0.20–0.23 universal band). Weighted 20.0 during ALGN epochs (geometry first) and 10.0 during HAND epochs (geometry locked). Applied to every M matrix in every batch — the encoder has no place to hide.

Round 4 was in the training file but not run before shelving. The session ended with a design-level observation:

It has to hit everything that passes through the linear sector.

The realization is that the CV force as applied only covered one Linear (the readout bottleneck). The full geometric discipline needs to cover everything downstream that carries omega information.

What the four rounds actually taught us

These are the load-bearing architectural properties of the SVD path that need explicit replacement in any NoSVD design:

  1. Unitary U and V^T bound |M_hat|. In the canonical, |M_hat|_F = |S|_2 because U and V^T are orthogonal. Any learned inverse must be bounded by construction (sphere-norm is one way; RMS-norm with learned gain is another; but both fight against magnitude reconstruction).

  2. S magnitude is proportional to input magnitude. This is the property that lets the canonical handle heavy-tailed noise. Sphere-norming M kills magnitude per-row, but S recovers the per-matrix magnitude as the singular values. Any learned readout that normalizes loses this.

  3. The SVD factorization is exact and input-agnostic. Sphere-normed points on S^(D-1) always admit a unique SVD. The learned readout is not input-agnostic; it must learn to read, and what it learns to read from Cauchy-driven matrices is not the same as what it learns from Gaussian- driven matrices.

  4. Gradient-flowing CM is a partial replacement for #3 (input-agnostic geometric structure), but it has to apply everywhere downstream Linear operations carry omega information. A single bottleneck Linear with CM discipline is not enough; the whole inverse/decoder pathway needs geometric control.

What's next — proper research direction

The ablation as built is not the right experiment. It's "SVAE minus SVD," which treats the SVD as a swappable component in an architecture designed around it. That's the wrong framing.

The right framing: if you don't have the SVD's factorization, you have an autoencoder. Autoencoders have their own stability toolkit — KL-divergence regularization, explicit bottleneck embedding, reparameterization tricks — and you should use it.

A serious NoSVD successor should include:

Proper VAE machinery. Not "replace SVD with Linear, keep SVAE shape." Rebuild as a VAE with:

  • μ, logσ = encoder(patch) → (D,), (D,) — explicit learned distribution
  • z = μ + σ · ε — reparameterized sample
  • KL regularization D_KL(q(z|x) || N(0,I)) — standard VAE discipline
  • Decoder from z back to patch via Linear(D → V·D) → MLP decoder

The omega tokens here are z samples or μ values — learned latents, not spectral coordinates. Different object, different claims, honest framing.

Bottleneck embedding with capacity. The ablation's Linear(V·D → D) is a 1024→16 projection with no intermediate substrate. A proper bottleneck would use Linear → GELU → Linear with a hidden dimension that lets the MLP learn a meaningful projection. This is standard VAE practice; the SVAE didn't need it because sphere-norm + SVD already provided the projection discipline.

Per-sector Cayley-Menger discipline. If the goal is to make every Linear in the omega pathway produce geometrically-disciplined outputs, CM loss must be applied at every stage, not just at the encoder output. This is feasible but serious engineering — it's a new architectural idea, not a drop-in.

Independent of SVAE naming/structure. The result is not "PatchSVAE without SVD." It's a new VAE family that uses geometric discipline as a regularizer. Name it something else. Compare it to SVAE as peers, not as child-of-parent.

Estimated effort: a focused week for a first working prototype, longer for proper characterization. Shelved here pending that dedicated time.

Files

  • johanna_F_nosvd_trainer.py — final state of the ablation trainer after four debug rounds. Standalone (no imports from canonical F-class trainer). Independent HF repo configured: AbstractPhil/geolip-svae-nosvd-ablation.

What to read if resuming

  1. This document. Start here.
  2. The parent geolip-svae README for architectural context on what you're replacing.
  3. The F-class batteries README for the framework the ablation was meant to validate against.
  4. The omega tokens blog post for the self-solving frame framing that motivated the ablation in the first place.

Do not resume this as "finish debugging the Linear readout." Resume as "design the proper VAE successor." The four rounds of debugging already told you why the direct replacement doesn't work.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support