https://snsphd.online/chapter_04/section_05_results/#photon-...
Easy conversion into a distance metric is hugely valuable to making the property amenable to KNN-based dimensionality reduction algos (and I'm sure other things I don't understand, as a non-mathematician)
Here's a library that the creator of UMAP provides (UMAP being a workhorse of dimensional reduction algos), for doing approx nearest neighbor search: https://pynndescent.readthedocs.io/en/latest/api.html#pynnde...
I don’t think GANs are used much now in comparison to diffusion models, but as recently as a few years ago they were the standard way to make fake data, a la “this face does not exist”
https://lospino.so/statistics/jensen-shannon-divergence/
Feedback welcome both from initiates (on helpfulness) and experts (on correctness)!
I was just reading about JSD the other day after reading about KL divergence...seems like a nifty measurement device for things like sim-to-real evaluations in robots (the reason I was going down this rabbit hole.)
I think the appeal over raw KL is that JSD behaves a bit nicer when the simulated and real distributions don't perfectly overlap...which is basically always true in the real world!
Been knee-deep trying to understand this world, so seeing this on Hacker News today is kind of scary.
Calculating the JSD could be more difficult, the expression uses a mixture between the 'true' and 'fitted' distribution. You can still simulate this, but half the time you'd be fitting the model to itself, and I just don't see why that would be useful.
I think the JSD is most useful when you need an actual metric, but as long as you have a fitted and target distribution the KL divergence is a natural fit since you can interpret the result as information loss.
In reinforcement learning, usually what we want is to find the optimal action, i.e. action that maximizes the reward, this translates to the so-called "mode-seeking" optimization, which is the reverse KL.
One way to interpret JSD(P, Q): Associate the distributions P and Q with two target classes, respectively. Pick a target class based on a fair coin flip. Then sample either from distribution P or distribution Q, depending on the outcome of the coin flip. The JSD is the mutual information between the resulting mixture distribution and the target class.
Alternative intuition: Suppose we want to measure the correlation between a feature X and a binary target class Y. We have a tabular data set with two columns X and Y, whose rows correspond to individual samples. JSD is the mutual information between the feature X and the target class Y, but after we resample our data (rows) to ensure that we have a balanced representation of the target class Y. If we measure the JSD in bits, the quantity 2^(JSD-1) is the fraction of times X correctly predicts Y, assuming balanced classes.
In practice, which divergence you use doesn't seem to be very important. The KL is the one with the most theoretic foundation though, i.e. will work with infinite data. The important aspect seems to be that neural networks are Lipschitz bound, and that that is the most important constraint preventing collapse.
The real magic in generative modeling comes from the post training process that comes after, which usually (e.g., RLHF) approximates Reverse KL (given limited capacity, try to perfectly cover what you can, but it's fine to drop the rest entirely). This gives amazing results, but is also the cause of AI oddities like the "AI Image Pixar Look", many of the verbal tics of LLMs, and all AI music using the same small set of voices. Jensen-Shannon Divergence sits right in the middle of Forward and Reverse KL and is what many GANs are claimed to approximate. Ideally, it is a better trade-off between diversity and fidelity.