Double Slit Experiment and Bayes

Posted by Yuling Yao on Nov 12, 2019. Tag: modeling

Is there anything we (statisticians) can still learn from Quantum Physics under a classic probability framework?

I have heard from Andrew talking about the double slit experiment several times (here and here and here). The screen image when both slits are open is not the linear average of the two individual one-slit-open results. It is known that the waves from the two slits cancel each other out in some locations, and reinforce each other in other locations. However, if mathematically we denote $y$ as the of locations of electron image on the screen and $x$ as the index of the slit, it still seems unnatural that

\[p(y ) \neq p(x=1) p(y \vert x=1 )+ p(x=0) p(y \vert x=0).\]

To be clear, it is not even a paradox. Nothing is shocking is we accept the fact we (Bayesian statistician) are only dealing with the special case of degenerate probability wave that has real coefficients. Nevertheless, is there anything we (statisticians) can still learn from Quantum Physics under a classic probability framework?

Answer 0: No, There is no joint model

The short answer is that the Bayes rule holds if we have a joint distribution $(y,x)$, from which a version of the conditional probability $p(y\vert x)$ is defined. $y \vert x$ is just an ill-defined object.

But then we have the cliche that all models $p(y, \theta)$ are wrong, whereas we still have to and are able to do Bayesian inference– what is the flawed in this particular example?

Answer 1: No, they are different measurement conditions

One solution is to model the measurement condition: $z=2$ if there are two slits open and $z=1$ if there is only one. If is obvious that

\[p(y \vert z=2 ) \neq p(x=1) p(y \vert x=1, z=1 )+ p(x=0) p(y \vert x=0, z=1).\]

It is fine if we stop here. Nevertheless, why we have to model $z$ differently in this specific identity, while it is otherwise mostly fine if we apply Bayes formula in, say balls and urns?

Answer 2: And no, we can only evaluate the marginal distribution, and they are not independent

A more microlevel explanation comes from the fact that what we can observe, after all, is the empirical distribution of $y$– rather than population $p(y)$. Imagine there are countable light particles $y_1, \dots, y_n$, and $p(y_i)$ is what would happen if there is only one single particle.

When $y_{1,\dots, n}$ is independent we know $p(y_i, \dots, ,y_n ) = \prod_{i}[p(y_i)],$. However they are not. There is literally interference, and the dependency does not vanish. There is no Law of Large Numbers, no Central Limit Theorem , no Lindeberg-Feller, and its is expected that the empirical marginal distribution $1/N \sum_{i=1}^{N} \delta(y_i) $ not converge to the $p(y_i)$.

Answer 3: Wait, how about conditional independence?

But that is still not the end of the story. In particular, if there is only one slit open, we know particles are conditional independent:

\[p(y_1, \dots, y_n \vert x=k ) = \prod_{i} p(y_i \vert x=k ), k=1,0.\]

It is easy to verify, for example, by looking at the one-slit-open-image using lights of the same phase but different illuminances.

Actually for the sake of LLT, we probably only requires conditional exchangeable ( $y_{B(1)}, \dots, y_{B(n)} \vert x=k$ is invariant under a permutation) which is apparently true.

Such conditional independence is enough to derive the marginal law as an average of two terms:

\[p(y_1, \dots, y_n ) = p(x=1) \prod_{i} p(y_i \vert x=1 )+ p(x=0)\prod_{i} p(y_i \vert x=0 ),\]

and LLN will kick in and imply:
empirical distribution of $y$ = $1/2($empirical distribution $\vert x=1$ + empirical distribution $\vert x=0)$. but it is still wrong!

To resolve this, it is important to notice that by using a universal indicator $x$ in writing $p(x=1)$, we have already assumed some independence. What we can observe in the micro-level is $x_i$, the index of the actual slit that particle $i$ passes through.

Assuming the particle-level identity still holds (which does not)

\[p(y_i ) = p(x_i=1) p(y \vert x_i=1 )+ p(x_i=0) p(y_i \vert x_i=0).\]

The problem is that the conditionally factorization in general does not hold

\[p(y_1, \dots, y_n |x_1, \dots, x_n) \neq \prod_{i} p(y_i \vert x_1=\dots, x_1,\dots, x_n ),\]

Except in the in the degenerating cases ($x_1=\dots =x_n$):

\[p(y_1, \dots, y_n |x_1=\dots =x_n = k) = \prod_{i} p(y_i \vert x_1=\dots, =x_n = k), \quad k=1, 0.\]

The correct way to derive the Bayes rule reads:

\[p(y_1, \dots, y_n ) = \sum_{k_1, \dots, k_n=0,1} (1/2)^n p(y_1, \dots, y_n \vert x_1=k_1,\dots, x_n=k_n)\]

I didn’t say $ \sum_{k_1, \dots, k_n=0,1} (1/2)^n \prod_{i} p(y_i \vert x_1=k_1,\dots, x_n=k_n)$ which is equivalent to the classic view. Here all the interference is exactly encoded into $p(y_1, \dots, y_n \vert x_1=k_1,\dots, x_n=k_n)$.

Answer 4: Wait again, a single photon can interfere too

The argument in the last section seems coherent, expect it is wrong.

Wave/particle duality is still observed in the double slit experiment with camera sensitive to individual photons:

\[p(y_i) \neq p(x_i=1) p(y_i \vert x_i=1 )+ p(x_i=0) p(y_i \vert x_i=0).\]

even in the with a particle-specific indicator $x_i$. Indeed, it is Dirac that said “Each photon then interferes only with itself. Interference between different photons never occurs”.

(Historically the double-slit experiment was developed much earlier than the existence of QM. My understanding is that the experiment per se can be purely explained by between-particle interference(the previous section) too. In the modern QM, each particle $p(y_i)$ is indeed independent $\Phi(y_1, y_2)=\Phi(y_1)\Phi(y_2)$ and the interference is encoded into the non-linear relation between $p$ and $\Phi$. Nevertheless, the last section is still useful to serve as a reminder of the difference between exchangeable and independence. We will return to this view later.)

The single-photon interference is more profound than the Young’s initial experiment. According to the Ensemble Interpretation, which was attributed to the statistical interpretation from Born, it makes no sense to think about single system $p(y_i)$ in the first place. As a consequence, the probabilities of quantum mechanical predictions require large sample repetitions, e.g. repeat trials with one single proton each time.

So basically I can stop here and ban the use the $p(y_i)$ and go back to answer 3. However, we should be careful that a quantum interpretation is to explain QM, not to explain probability or statistics. Interesting I notice there is a so-called Bayesianism interpretation of quantum mechanics that only defines probability though the agent’s subjective degrees of belief. Sure, I should not deny that I know nothing about quantum interpretation, but my standpoint here is that we just treat probability as a mathematical object. It does not even matter if the probability comes from a subjective belief, a frequency with internal replication or anything else. It is the same attitude as in Andrew’s earlier blog:

Probability is a mathematical concept. To define it based on any imperfect real-world counterpart (such as betting or long-run frequency) makes about as much sense as defining a line in Euclidean space as the edge of a perfectly straight piece of metal, or as the space occupied by a very thin thread that is pulled taut. Ultimately, a line is a line, and probabilities are mathematical objects that follow Kolmogorov’s laws. Real-world models are important for the application of probability, and it makes a lot of sense to me that such an important concept has many different real-world analogies, none of which are perfect.

So, again, what is wrong with $p(y_i)$?

Superposition is NOT a mixture

Superposition is a pure state. There is simply no epistemic uncertainty! As a comparison, a mixture model implies a particle is among a few configurations, and it is we that do not know which configuration it belongs to.

Given the wave function $\Phi_1$ and $\Phi_2$ that represent the wave function with either slit 1 or 2 is open, the superposition is simply the addition of wave function

\[\Phi=1/\sqrt 2 (\Phi_1 + \Phi_2)\]

Also $p_k(y)= \vert\Phi_k\vert^2$ describes the distribution with only slit $k$ is open.

Then marginal probability of $y$ when both slit is open is

\[p(y)= 1/2 \vert (\Phi_1 + \Phi_2) \vert^2= 1/2 (p_1(y)+ p_2(y)) + Re(\Phi_1^* \Phi_2)\]

On the contrary, if we do observe and record the slit through which the particles went though, then

\[p(y)= 1/2 [\vert \Phi_1 \vert^2 + \vert\Phi_1 \vert^2] = 1/2 [(p_1(y)+ p_2(y)]\]

becomes a mixture.

Just as what logit is to log

If we are still obsessed with mixture models, we can also go back to answer 1, which simply says there is no direct relation with $p(y|z=2)$=superposition and $p(y|z=1)$=mixture.

In part, the confusion comes from the fact that the Schrodinger equation is linear on the wave function $\phi$, not the probability $P$. While the mixture model is linear on the probability $P$, or more generally the expectation. We should not use a mixture model in the first place, and therefore it is misleading to write the conditional law $p(y\vert x=1)$. Asking why we cannot simply linearly add the single slit screen is like to ask why logit is not invariant under differentiating as for log.

And that is fine, we can deal with non-linear models (cf. linear mixture), and they can evolve in whatever complicated ways. For example, we are never amazed by the fact that $X+Y$ (convolution) has a marginal distribution other than $p(X)+p(X)$. Another example is in subsampling, with the partition of data $D=D_1,\dots,D_k$ and a flat prior, it is obvious posterior is a log-linear-mixure:

\[p(\theta \vert D)= \prod_k p(\theta \vert D_k)\]

although I guess I could also complicate it by writing something wrong like $ p(\theta \vert y \in D)= \sum_k p(D_k) p(\theta \vert y\in D_k)$ for the very purpose of creating extra confusion and mystification.

In short, the superposition is just a way to combine several distributions (in the usual probability theory sense). Namely, given two distribution $p_1=\vert \Phi_1 \vert^2$ and $p_2=\vert \Phi_2 \vert^2$, we have an aggregation (of $p_1$ and $p_2$): $p= 1/2 \vert \Phi_1+ \Phi_2 \vert^2$. Sure, it is not fully specified by $p_1$ and $p_2$, but it is akin to convolution is not fully determined by the marginal law. Functionally, the superposition is not in particular fancier than mixture, or log-linear-mixure, or convolution. Hypothetically if there is some other physics equation that remains the convolution invariant, there will presumably another probability theory to keep it too.

(Let me take a step back. I am not trying to undermine the value of wave function and quantum probability, it is intrinsic to QM and makes all the calculations easier. So yes, it is fancier.)

What does it imply for statistical modeling

To be clear I am not trying to reinvent quantum physics. The goal is to motivate statistical modeling. Also it is clear that the Schrodinger equation is not the answer of the universe, it cannot make stan faster for example– the best I can do is to draw some hopefully useful analogies.

One useful analogy is how the measurement will change observations (some survey questions are self-allured and may therefore change the population). On the other hand, the relation between a survey response and the wave function collapse is probably no more rigorous than that between Soros’ reflexivity and his quantum fund. There is no reason to believe why a gentle glance from a YouGov surveyor will collapse the whole universe nor why my presidential election tendency has to obey the Schrodinger equation. The measurement error has to be modeled anyway. As written by Andrew and Michael Betancourt in their discussion:

We propose that the best way to use ideas of quantum uncertainty in applied statistics (in psychometrics and elsewhere) is not by directly using the complex probability-amplitude formulation proposed in the article, but rather by considering marginal probabilities that need not be averages over conditionals.

I will not repeat this direction. I will focus on another analogy: model aggregation.

Model aggregation

Given observations $y_1, \dots, y_n$ and models $M_k (k=1, \dots, K)$. We could always first these two models separately, which leads to two predictive density for new data $\tilde y$: $p(\tilde y \vert M_k)= \int p(\tilde y \vert \theta_k M_k) p(\theta_k \vert y, M_k)d \theta_k$.

It seems the most natural approach we could follow to aggregate these models is the Bayesian model averaging (BMA):

\[p(M_k\vert y )\propto p(y \vert M_k)p(M_k)\]

and the aggregated predictive density is simply a mixture:

\[p(\tilde y)= p(\tilde y \vert M_k) p(M_k\vert y )\]

If there is a single one lesson that should be learned from this post, it is how a sloppy notation result in a misleading model in which all the inference gets stuck. Just as $p(y\vert x)$ implies a wrong mixture model, by writing $(y_{1, \dots, n}\vert M_k)$ we have already assumed there is a single model $M_k$ that generates all the data.

In light of Answer 2 and 3, a more flexible notation is $p(y_i \vert M_k)$– which implies the $i$-th observation is randomly generated from a (random) model with certain probabilities. In particular, if we know that each data is generated from model $k$ with constant probability $w_k$, then we can prove (I am writing a companion paper on that), the stacking weight will converge to the true weight

\[w_k = P_Y(y \in M_k )\]

where the probability is over $y$. In terms of the distinction in Answer 2 and 3, stacking can be viewed as a point-wise BMA, or should I say quantum BMA?

Of course, stacking still is restricted to the mixture form. But it can, in principle, be extended to other aggregation forms such as convolution (each model is a linear regression with a single predictor and the predictors are highly colinear) or product (each model is the subsampling using non-overlapped partitions). I probably will not be shocked if anyone aggregates two predictive distributions using superposition.

Indeed, the most sophisticated aggregation cannot be a mixture. When we continuously combine two models (i.e., the joint model is a general form of each individual model), the resulted predictive density is not a linear average of two individual ones. A double-slit experiment of time series data $y$ probably reads

\[p(y\vert ARMA)\neq p_1 p(y\vert AR) + p_2 p(y\vert MR)\]

when we consider both the autoregressive(AR) and moving average(MR) models. It is indeed even more complicated than the physical double slits as we cannot even express anything like

\[\vert \mathrm{ARMA}\rangle = \vert \mathrm{AR}\rangle + \vert \mathrm{MA} \rangle\]

in a closed form! We probably have to refit the model (in stan).

Notice that in stacking, we do not infer which observed point belongs to which model. If we know the label, it simply becomes a multilevel model. If we are trying to infer/reconstruct the label, we get a mixture of mixture. Not surprisingly, fitting a mixture model (and infer which data comes from which model) results in a different predictive distribution than fitting two models separately and mix them back.

Here is a summary:

model aggregation	analogy in double slits
the flaw of BMA (cf. stacking) in $\mathcal{M}$-open case	the flaw of the conditional probability in double-slit experiment (Answer 2&3)
the difference between linear mixture and continuous model expansion and boosting	the difference between mixture and superposition (Answer 4)
the difference between mixture model and model ensemble	collapse during observations