Archive for the ‘Probability’ Category

Q: Will we ever overcome the Heisenberg uncertainty principle?

Sunday, August 1st, 2010

Physicist: Nopers!

The Heisenberg uncertainty principle, while normally presented in physics circles, is actually a mathematical absolute.  So overcoming the uncertainty principle is exactly as hard as overcoming that whole “1+1=2″ thing.

The uncertainty principle (the “position/momentum uncertainty principle”) is generally presented like this: you have some particle floating along and you’d like to know its position and its momentum (velocity, whatever) so you bounce a photon off of it (“Bounce a photon off of it” is just another way of saying “try to see it”).  A general rule of thumb for light (waves in general really) is high frequency waves propagate in straight lines, and low frequency waves spread out.  That’s why sunlight (high frequency) seems to go in a perfectly straight line, but radio waves can spread out around corners.  For example, you can still pick up a radio station even when you can’t see it directly.

So, if you want to see where something is with precision you’ll need to use a high frequency photon.  After all, how can you trust the results from a wandering, low frequency photon?  But, if you use a precise, high-frequency, and thus, high-energy photon, you’ll end up smacking the hell out of the particle you’re trying to measure.  So you’ll know where it is pretty exactly, but it’ll go flying off with some unknown amount of momentum.  Any method you can come up with to measure the momentum will require you to use low-frequency, low-energy, gentle photons.  But then you won’t be able to figure out the particle’s position very well.

Low frequency photons (like radio waves) don't tell you much about where a particle is, but they doesn't knock it around much either (so you can measure its momentum better). High frequency photons (like sunlight) are terrible at measuring momentum, but can tell you position well.

So far this seems more like an engineering problem than a problem with the universe.  Maybe we could arrange things so that the high frequency photon hit softer or something?  There was a lot of back and forth for a long time (still is in some circles) about overcoming the uncertainty principle, but in the end it can never be violated.

Rather than being something that’s merely very challenging like, “you can’t break the sound barrier”, “what goes up must come down”, and “you can’t be the world’s best kick-boxer and be the world’s most handsome physicist”, the uncertainty principle is a mathematical absolute.  So, unless the basic assumptions of physics are completely wrong (and they’ve held up to some serious scrutiny), the uncertainty principle is in the company of things like “you can’t go faster than light”, “energy and mass are conserved”, and “modern mathematicians don’t have beards” (has anyone else noticed this?).  What follows is answer gravy.

Answer gravy: This gravy has some lumps.  If you know what a “Fourier transform” is, and are at least a little comfortable with them, then this could be interesting to you.

The square of a quantum wave function is the probability of finding it in a particular state.  For example, the “position wave function” can tell you the probability of finding a particle at any position. To get the probability from the wave function, all you have to do is square the wave function.

If you’ve got the quantum wave function f(x) for the position of a particle, then you can find the the momentum wave function, g(p), by taking the Fourier transform of f.  That is, g=\hat{f}.  Now, you can define the uncertainty as the standard deviation of the probability function, which is a really good way to go about it.

A probability function (blue), with its uncertainty or standard deviation (red). Like you'd expect, the particle is most likely to be near zero, but it's not certain to be near zero.

The uncertainty principle now just boils down to the statement that the product of the uncertainties of the square of a function, f, and the square of its Fourier transform, \hat{f}, is always greater than some constant.  In what follows you’ll find some useful stuff such as Plancherel’s theorem and Cauchy-Schwartz.

\begin{array}{ll}\sigma_x\sigma_p=\sigma_{|f|^2}\sigma_{|\hat{f}|^2}\\=\sqrt{Var(|f|^2)}\sqrt{Var(|\hat{f}|^2)}\\=\left(\int x^2|f|^2\,dx\right)^{\frac{1}{2}}\left(\int\xi^2|\hat{f}|^2\,d\xi\right)^{\frac{1}{2}}\\=\left(\int |xf|^2\,dx\right)^{\frac{1}{2}}\left(\int|\xi\hat{f}|^2\,d\xi\right)^{\frac{1}{2}}\\=\frac{1}{2\pi}\left(\int |xf|^2\,dx\right)^{\frac{1}{2}}\,\left(\int|\widehat{f^\prime}|^2\,d\xi\right)^{\frac{1}{2}}&(\widehat{f^\prime}=2\pi i\xi\hat{f})\\=\frac{1}{2\pi}\left(\int |xf|^2\,dx\right)^{\frac{1}{2}}\,\left(\int|f^\prime|^2\,d\xi\right)^{\frac{1}{2}}&(\textrm{Plancherel})\\ \ge\frac{1}{2\pi}\int|xf f^\prime|\,dx&(\textrm{Cauchy-Schwartz})\\=\frac{1}{2 \pi} \int |x| \, |f| \, |f^\prime| \, dx \\\ge \frac{1}{2 \pi} \left| \int x |f| f^\prime \, dx \right| \\= \frac{1}{2 \pi} \left| \int (x) (\frac{1}{2}|f|^2)^\prime \, dx \right|&(\frac{1}{2}|f|^2)^\prime=|f| f^\prime\\= \frac{1}{4 \pi} \left| \int |f|^2 \, dx \right|&(\textrm{integration by parts})\\=\frac{1}{4 \pi}&(\textrm{the total probability is always 1})\end{array}

So, there’s the Heisenberg uncertainty principle: \sigma_{|f|^2} \sigma_{|\hat{f}|^2} \ge \frac{1}{4 \pi}.  A physicist would recognize this as \Delta x \Delta p \ge \frac{\hbar}{2}.  The difference comes about because the Fourier transform that takes you from the position wave function to the momentum wave function involves an h, and \hbar = \frac{h}{2\pi}.  (For the physicists out there who were wondering what happened to their precious h’s)

Q: What’s the chance of getting a run of K or more successes (heads) in a row in N Bernoulli trials (coin flips)? Why use approximations when the exact answer is known?

Saturday, July 24th, 2010

The original question was: Recently I’ve come across a task to calculate the probability that a run of at least K successes occurs in a series of N (K≤N) Bernoulli trials (weighted coin flips), i.e. “what’s the probability that in 50 coin tosses one has a streak of 20 heads?”. This turned out to be a very difficult question and the best answer I found was a couple of approximations.

So my question to a mathematician is: “Why is this task so difficult compared e.g. to finding the probability of getting 20 heads in 50 coin tosses solved easily using binomial formula? How is this task in fact solved – is there an exact analytic solution? What are the main (preferably simplest) approximations and why are they used instead of an analytic solution?”

Physicist: What follows was not obvious. It was the result of several false starts. It’s impressive in the same sense that a dude with seriously bruised legs, who can find anything in his house with the lights off, is also impressive. If you’re looking for the discussion of the gross philosophy, and not the detailed math, skip down to where you find the word “Jabberwocky” in bold. If you want specific answers for fixed, small numbers of coins, or you want sample computer code for calculating the answer, go to the bottom of the article.

The short answer: the probability, S, of getting K or more heads in a row in N independent attempts (where p is the probability of heads and q=1-p is the probability of tails) is:

S(N,K) = p^K\sum_{T=0}^\infty {N-(T+1)K\choose T}(-qp^K)^T-\sum_{T=1}^\infty {N-TK\choose T}(-qp^K)^T

Note that here {A\choose B} is the choose function (also called the binomial coefficients) and we are applying a non-standard convention that {A\choose B}= 0 for A < B which makes the seemingly infinite sums always have only a finite number of terms. In fact, for N and K fixed, the answer is a polynomial with respect to the variable p.

Originally this was a pure math question that didn’t seem interesting to a larger audience, but we both worked for long enough on it that it seems a shame to let it go to waste. Plus, it gives me a chance to show how sloppy (physics type) mathematics is better than exact (math type) mathematics.

Define {Xi}j as the list of results of the first j trials. e.g., if j=4, then {Xi}j might be “{Xi}4=HTHH” or “{Xi}4=TTTH” or something like that, where H=heads and T=tails. In the second case, X1=T, X2=T, X3=T, and X4=H.

Define “Ej” as the event that there is a run of K successes (heads) in the first j trials. The question boils down to finding P(EN).

Define “Aj” as the event that the last K terms of {Xi}j are T followed by K-1 H’s ({Xi}j = X1X2X3X4…THHH…HHH). That is to say, if the next coin (Xj+1) is heads, then you’ve got a run of K.

Finally, define p = P(H) and q = P(T) = 1-p. Keep in mind that a “bar” over an event means “not”. So “\overline{H}=T” reads “not heads equals tails”

The probability of an event is the sum of the probabilities of the different (disjoint) ways that event can happen. So:

\begin{array}{ll}&P(E_{j+1})\\i)&=P(E_{j+1}\cap E_j\cap A_j)+P(E_{j+1}\cap E_j\cap \overline{A_j})+P(E_{j+1}\cap \overline{E_j}\cap A_j)+P(E_{j+1}\cap \overline{E_j}\cap \overline{A_j})\\ii)&=\left[P(E_{j+1}\cap E_j\cap A_j)+P(E_{j+1}\cap E_j\cap \overline{A_j})\right]+P(E_{j+1}\cap \overline{E_j}\cap A_j)+P(E_{j+1}\cap \overline{E_j}\cap \overline{A_j})\\iii)&=\left[P(E_{j+1}\cap E_j)\right]+P(E_{j+1}\cap \overline{E_j}\cap A_j)+P(E_{j+1}\cap \overline{E_j}\cap \overline{A_j})\\iv)&=P(E_j)+P(E_{j+1}\cap \overline{E_j}\cap A_j)+P(E_{j+1}\cap \overline{E_j}\cap \overline{A_j})\\v)&=P(E_j)+P(E_{j+1}\cap \overline{E_j}\cap A_j)+0\\vi)&=P(E_j)+P(E_{j+1}|\overline{E_j}\cap A_j)P(\overline{E_j}\cap A_j)\\vii)&=P(E_j)+pP(\overline{E_j}\cap A_j)\\viii)&=P(E_j)+pP(\overline{E_j}| A_j)P(A_j)\\ix)&=P(E_j)+pP(\overline{E_j}| A_j)qp^{K-1}\\x)&=P(E_j)+qp^KP(\overline{E_{j-k}})\\xi)&=P(E_j)+qp^K\left[1-P(E_{j-k})\right]\\xii)&=P(E_j)+qp^K-qp^KP(E_{j-k})\end{array}

iv) comes from the fact that E_j \subset E_{j+1}. If you have a run of K heads in the first j trials, of course you’ll have a run in the first j+1 trials. v) The zero comes from the fact that if the first j terms don’t have a run of K heads and the last K-1 terms are not all heads, then it doesn’t matter what the j+1 coin is, you can’t have a run of K heads (you can’t have the event Ej+1 and not Ej and not Aj). vii) is because if there is no run of K heads in the first j trials, but the last K-1 of those j trials are all heads, then the chance that there will be a run of K in the first j+1 trials is just the chance that the next trial comes up heads, which is p. ix) the chance of the last K trials being a tails followed by K-1 heads is qpK-1. x) If the last K (of j) trials are a tails followed by K-1 heads, then whether a run of K heads does or doesn’t happen is determined in the first j-K trials.
The other steps are all probability identities (specifically P(C)=P(C\cap D)+P(C\cap \overline{D}), \, P(\overline{C})=1-P(C), and Bayes’ theorem: P(C\cap D)=P(C|D)P(D)).

Rewriting this with some N’s instead of j’s, we’ve got a kick-ass recursion: P(E_N)=P(E_{N-1})+qp^K-qp^KP(E_{N-K-1})

And just to clean up the notation, define S(N,K) as the probability of getting a string of K heads in N trials (up until now this was P(EN)).

S(N,K)=S(N-1,K)+qpK-qpKS(N-K-1,K). We can quickly figure out two special cases: S(K,K) = pK, and S(l,K) = 0 when lK, and that there’s no way of getting K heads without flipping at least K coins. Now check it:

\begin{array}{ll}&S(N,K)\\i)&=S(N-1,K)+qp^K-qp^KS(N-K-1,K)\\ii)&=\left[S(N-2,K)+qp^K-qp^KS(N-K-2,K)\right]+qp^K-qp^KS(N-K-1,K)\\iii)&=S(N-2,K)+2qp^K-qp^K\left[ S(N-K-1,K)+S(N-K-2,K)\right]\\iv)&=S(N-3,K)+3qp^K-qp^K\left[ S(N-K-1,K)+S(N-K-2,K)+S(N-K-3,K)\right]\\v)&=S(N-(N-K),K)+(N-K)qp^K-qp^K\left[ S(N-K-1,K)+\cdots+S(N-K-(N-K),K)\right]\\vi)&=S(K,K)+(N-K)qp^K-qp^K\left[ S(N-K-1,K)+\cdots+S(0,K)\right]\\vii)&=p^K+(N-K)qp^K-qp^K\sum_{r=0}^{N-K-1} S(r,K)\\viii)&=p^k+(N-K)qp^K-qp^K\sum_{r=K}^{N-K-1} S(r,K)\end{array}

ii) Plug the equation for S(N,K) in for S(N-1,K). iii-vi) is the same thing. vii) write the pattern as a sum. viii) the terms up to K-1 are all zero, so drop them.

Holy crap! A newer, even better recursion! It seems best to plug it back into itself!

\begin{array}{ll}i)&S(N,K)=p^K+(N-K)qp^K-qp^K\sum_{r=K}^{N-K-1} S(r,K)\\ii)&=p^K+(N-K)qp^K-qp^K\sum_{r=K}^{N-K-1} \left[p^k+(r-K)qp^K-qp^K\sum_{\ell=K}^{r-K-1} S(\ell,K)\right]\\iii)&=p^K+(N-K)qp^K-qp^K\sum_{r=K}^{N-K-1}p^k-qp^K\sum_{r=K}^{N-K-1}(r-K)qp^K+\left(qp^K\right)^2\sum_{r=K}^{N-K-1}\sum_{\ell=K}^{r-K-1} S(\ell,K)\\iv)&=p^K+(N-K)qp^K-qp^K\sum_{r=1}^{N-2K}p^k-\left(qp^K\right)^2\sum_{r=0}^{N-2K-1}r+\left(qp^K\right)^2\sum_{r=2K+1}^{N-K-1}\sum_{\ell=K}^{r-K-1} S(\ell,K)\\v)&=p^K+(N-K)qp^K-p^k(N-2K)qp^K-\left(qp^K\right)^2\frac{(N-2K)(N-2K-1)}{2}+\left(qp^K\right)^2\sum_{r=2K+1}^{N-K-1}\sum_{\ell=K}^{r-K-1} S(\ell,K)\\vi)&=p^K+{N-K\choose 1}qp^K-p^K{N-2K\choose 1}qp^K-{N-2K\choose 2}\left(qp^K\right)^2+\left(qp^K\right)^2\sum_{r=2K+1}^{N-K-1}\sum_{\ell=K}^{r-K-1} S(\ell,K)\end{array}

You can keep plugging the “newer recursion” back in again and again (it’s a recursion after all). Using the fact that \sum_{i=1}^N={N+1\choose 2} and \sum_{i=D}^M {\ell \choose D}={M+1 \choose D+1} you can carry out the process a couple more times, and you’ll find that:

\begin{array}{l}S(N,K)=p^K\left[1-{N-2K\choose 1}qp^K+{N-3K\choose 2}\left(qp^K\right)^2\cdots\right]+\left[{N-K\choose 1}qp^K-{N-2K\choose 2}\left(qp^K\right)^2+{N-3K\choose 3}\left(qp^K\right)^3\cdots\right]\\=p^K\sum_{T=0}^\infty {N-(T+1)K\choose T}(-qp^K)^T-\sum_{T=1}^\infty {N-TK\choose T}(-qp^K)^T\end{array}

There’s your answer.

In the approximate (useful) case:

Assume that N is pretty big compared to K. A string of heads (that can be zero heads long) starts with a tails, and there should be about Nq of those. The probability of a particular string of heads being at least K long is pk so you can expect that there should be around E=Nqpk strings of heads at least K long. When E≥1, that means that it’s pretty likely that there’s at least one run of K heads. When E<1, E=Nqpk is approximately equal to the chance of a run of at least K showing up.

Jabberwocky: And that’s why exact solutions are stupid.

Mathematician: Want an exact answer without all the hard work and really nasty formulas? Computers were invented for a reason, people.

We want to compute S(N,K), the probability of getting K or more heads in a row out of N independent coin flips (when there is a probability p of each head occurring and a probability of 1-p of each tail occurring). Let’s consider different ways that we could get K heads in a row. One way to do it would be to have our first K coin flips all be heads, and this occurs with a probability p^{K}. If this does not happen, then at least one tail must occur within the first K coin flips. Let’s suppose that j  is the position of the first tail, and by assumption it satisfies 1 \le j \le K. Then, the probability of having K or more heads in a row in the entire set of coins (given that the first tail occurred at j \le K) is simply the probability of having K or more heads in a row in the set of coins following the jth coin (since there can’t be a streak of K or more heads starting before the jth coin due to j being smaller or equal to K). But this probability of having a streak of K or more after the jth coin is just S(N-j,K). Now, since the probability that our first tail occurs at position j is the chance that we get j-1 heads followed by one tail, so it is p^{j-1} (1-p) . That means that the chance that the first tail occurs on coin j AND there is a streak of K or more heads is given by p^{j-1} (1-p) S(N-j,K). Hence, the probability that the first K coins are all heads, OR coin one is the first tails and the remainder have K or more heads in a row, OR coin two is the first tails and the remainder have K or more heads in a row, OR coin three is the first tails and…, is given by:

S(N,K) = p^{K} + \sum_{j=1,K} p^{j-1} (1-p) S(N-j,K)

Note that what this allows us to do is to compute S(N,K) by knowing the values of S(N-j,K) for  1 \le j \le K. Hence, this is a recursive formula for S(N,K) which relates harder solutions (with larger N values) to easier solutions (with smaller N values). These easier solutions can then be computed using even easier solutions, until we get to S(A,B) for values of A and B so small that we already know the answer (i.e. S(A,B) is very easy to calculate by hand). These are known as our base cases. In particular, we observe that if we have zero coins then there is a zero probability of getting any positive number of heads is zero, so S(0,K) = 0, and the chance of getting more heads than we have coins is zero, so S(N,K) = 0 for K>N.

All of this can be implemented in a few lines of (python) computer code as follows:

An important aspect of this code is that every time a value of S(N,K) is computed, it is saved so that if we need to compute S(N,K) again later it won’t take much work. This is essential for efficiency since each S(N,K) is computed using S(N-j,K) for each j with 1 \le j \le K and therefore if we don’t save our prior results there will be a great many redundant calculations.

For your convenience, here is a table of solutions for S(N,K) for 1 \le N \le 10 and 1 \le K \le 10 (click to see it enlarged):

Q: How it is that Bell’s Theorem proves that there are no “hidden variables” in quantum mechanics? How do we know that God really does play dice with the universe?

Tuesday, June 22nd, 2010

Physicist: Bell’s theorem, and its philosophical fallout, is one of the most profound discoveries since relativity.

Bell’s theorem states (among other things) that the universe is fundamentally unpredictable, and that quantum mechanical things (for example: everything) are not actually in one state.  If a box could contain either a blue marble or a red marble, then when you open it you’ll see either on or the other.  In “reality” it was one color or the other before you open the box.  In QM, it can be both before you open the box (it’s actually still both afterwords, but moving on…).

Einstein (and most other physicists of the time) believed that if you knew everything about a system of particles (no matter how big) that you could theoretically predict what that system will be doing in the future, perfectly.  Homeboy thought that the only reason that the movement of air molecules seems to be random, is that we can’t perfectly measure that exact position and velocity of every single one.  So he thought that every particle truly is in some particular state, but that we merely don’t know for sure what that state is (the marble in the box has only one color, but we don’t know what it is).

The idea that randomness and unpredictability are caused by unknown (or unknowable) things is called “hidden variable theory” (The ‘Stein believed in this).  For example; 2, 2, 3, 6, 0, 6, 7, 9, 7, 7, 4, 9, 9, … is not random, but seems random.  It would be really hard to predict the next term (7) if you don’t know the hidden variable.  (BTW, the “hidden variable” is: this is the decimal expansion of \sqrt{5})

Bell’s theorem essentially boils down to a proof that the result of an experiment doesn’t exist until the measurement is made (so it can’t be predicted).  Hidden variable theory presupposes that the particles involved are in definite states, which means that the result of a measurement already exists before the measurement is made.  For example: before you open a gift what you’ll see is already set in stone.  The gift is a set thing before you open the box.  This is not the case for most quantum mechanical systems.

Here’s one of the experiments that demonstrates Bell’s theorem, and two ways to look at it.

An entangled pair of photons is created and fired in opposite directions. En route the polarizers are randomly oriented, then the detectors measure whether or not the photons pass through. This is done hundreds of thousands of times to measure the relationship between 1) the difference in angles between the polarizers and 2) the probability of measuring the same result.

The experiment: Step 1: Create a pair of entangled photons and fire them in opposite directions.  Entangled particles always yield the same result when they are subjected to the same measurement, and are likely to yield the same result for similar measurements.

Step 2: Randomly orient the polarizers, after the entangled pair is created, but before either is detected (this is hard to time, and is really fast).  This is done so that the photons “don’t know what to expect” and “can’t compare notes”.  Information about polarizer A would have to travel faster than the speed of light to get to photon B before photon B hits it’s own polarizer.  So, without faster than light effects (which don’t exist for many, really good reasons) the photons are each acting independently.  The orientation is random so that the photons can’t “plan ahead”.

Step 3: Measure the polarization.  If the detector “clicks” then the photon made it through the polarizer, and therefore has the same polarization.  If the detector doesn’t click, then the photon had the opposite polarization and was stopped.

The probability of the measurements being the same (for an entangled pair) is P = \cos^2{(\theta)}, where \theta is the difference in angles between the polarizers.  It is tricky to see why, but this probability is impossible if you assume that the result of a measurement exists before the measurement is made.  Here’s why.

The possible polarizations for polarizer A (red) and polarizer B (blue).

Algebraic approach: Restricting the possible angles of the polarizers to 0° and 45° for A, and 22.5° and 67.5° for B, run the experiment. Here’s what’s about to happen:

1) If you could predict the outcome of each version of the experiment, then you could find a definite value of L (see below).

2) For strictly (unarguable) mathematical reasons L = ±2.

3) Experimentally we find that the average value of L is 2√2.

4) But this is a contradiction, so we cannot actually make useful predictions.

Now it’s happening:

If polarizer A is at 0° and the detector clicks then you’d say “A0 = 1″, and if the detector doesn’t click then “A0 = -1″.  Similarly, you can define B67.5, A45, and B22.5.  Just for the hell of it, take a look at: L = A0B22.5 + A45B22.5 + A45B67.5 - A0B67.5 = (A0 + A45)B22.5 + (A45 - A0)B67.5

L = (A0 + A45)B22.5 + (A45 - A0)B67.5 = ±2, since either (A0 + A45) = ±2 and (A45 - A0) = 0, or (A0 + A45) = 0 and (A45 - A0) = ±2.  So L = A0B22.5 + A45B22.5 + A45B67.5 - A0B67.5 = ±2 ≤ 2.

So if you could fill out each of these values (A0, A45, B22.5, B67.5), then L = ±2 ≤ 2.

However, you can’t make all of these measurements simultaneously, so you can’t actually get A0B22.5 + A45B22.5 + A45B67.5 - A0B67.5 for each run of the experiment.  The best you can do is find one of these four terms each time you run the experiment.  For example, if the polarizer A was randomly set to 45° and the detector clicked, and polarizer B was randomly set to 22.5° and the detector didn’t click, then you just found out that A45B22.5 = (1)(-1) = -1 for that run.

You can however find the expectation value by running the experiment over and over and keeping track of the results and polarizer orientation.

E[A0B22.5] + E[A45B22.5] + E[A45B67.5] – E[A0B67.5] = E[A0B22.5 + A45B22.5 + A45B67.5 - A0B67.5] ≤E[|A0B22.5 + A45B22.5 + A45B67.5 - A0B67.5|] = E[2] = 2.

So E[A0B22.5] + E[A45B22.5] + E[A45B67.5] – E[A0B67.5] ≤ 2.  This is one version of “Bell’s Inequality”, and it holds if each term (A0, A45, B22.5, B67.5) has a value.

Using the fact that the chance of getting the same result is P = \cos^2{(\theta)}, and that each term is 1 when the results are the same ((1)(1) or (-1)(-1)), and -1 when the results are different ((1)(-1) or (-1)(1)), you can calculate each term.  For example:

E[A_0B_{22.5}]=P(same)-P(different)=\cos^2{(22.5)}-(1-\cos^2{(22.5)})=\frac{1}{\sqrt{2}}

You’ll find that:

E[A_0B_{22.5}]+E[A_{45}B_{22.5}]+E[A_{45}B_{67.5}]-E[A_0B_{67.5}]=\frac{1}{\sqrt{2}}+\frac{1}{\sqrt{2}}+\frac{1}{\sqrt{2}}-\frac{-1}{\sqrt{2}}=2\sqrt{2}

Holy crap!  2\sqrt{2}>2!  But that’s a violation of Bell’s inequality!  But the existence of each measurement (whether or not you actually do that measurement) is all you need for Bell’s inequality!  So if the inequality is false, then the result of those measurements don’t exist if the measurement isn’t made!

God plays dice with the universe.

Maybe, if you're clever and have ready access to a time machine, you could go back and do all the measurements you didn't make the first time. Then all the results would have to exist! They'd just have to!

Me and my time machine vs. quantum mechanics: If the results exist, but you just didn’t happen to do all the measurements, why not get a time machine?  Then you could do one measurement, go back, do a different measurement, go back, do a different measurement, …  Then every possible result would be known.

However, once again that correlation probability (P = \cos^2{(\theta)}) screws things up.

So, for example, if the photon goes through at 50°, and then you go back in time, change the polarizer to 51°, and repeat the experiment, then there’s a 99.97% (cos2(1°) = 0.9997) that the photon will go through again.

One result from probability says that P(x=z)\ge P(x=y)+P(y=z)-1.  Do this twice and you get P(w=z)\ge P(w=x)+P(x=z)-1\ge P(w=x)+P(x=y)+P(y=z)-2.  So if you measure in the 0° direction to find A0, then go back and change the angle by 1° and repeat this until you’re measuring at 90°, then:

P(A_0=A_{90})\ge P(A_0=A_1)+P(A_1=A_2)+\cdots+P(A_{89}=A_{90})-89 =90\cos^2{(1^o)}-89=0.9726

So, if you go back and forth in time to measure whether or not the photon goes through at 1° increments, then there’s a 97% chance that by the time you get to 90° you’ll be getting the same result you did at 0°.  However, in reality P(A_0=A_{90})=\cos^2{(90^o)}=0.

But this is a contradiction.  So the results of each measurement (A0, A1, A2, …, A90) can’t all exist.

If I had to guess, every time you go back in time the experiment is completely reset, and the experiment becomes completely random again.  The reason (such as it is) is below this unsettling picture.

Wait. Wait... Why?

But why?!: It turns out that the reason that the results of a quantum event can’t be predicted, is that every possible result of that event plays out.  So if you ask “will I see the photon go through the polarizer?” the answer is “yes, some versions of you will see the photon go through” and an equally valid answer is “no, some versions of you will not”.

If different versions of you will see every possible result, then the result can’t be predicted, and doesn’t really exist one way or the other until after the measurement is done.  At that time the different versions of you will disagree on the result.  But don’t worry too much.  You’ll never meet you’re parallel-universe twins.

Q: Can math and science make you better at gambling?

Sunday, May 23rd, 2010

Physicist: Yes.

Don’t gamble (mathematically speaking).

Mathematician: Gambling (at casinos, in lotteries, and in most other instances) is expected value negative, even when you play optimally. That means that the average amount of money you will make per play is negative (i.e. you will lose money, on average). It also implies (via the Central Limit Theorem) that when you play many times, there will be a greater than 50% chance that you lose money overall.

There are a few exceptions to this rule. Card counting in games like blackjack can be expected value positive if you are very good at it and increase your bet sizes at the right times, but casinos are savvy and make this difficult (e.g. by shuffling many decks of cards together). Plus, if you get caught, you will get banned, or worse. Betting against friends can also be expected value positive if you can be sure you’re really a better gambler than they are. Poker at casinos (where you play against other gamblers) can also be expected value positive if you’re very good, but since the casino charges you to play, even if you’re better than the other players you may not (on average) come out positive.

A good rule of thumb is this: If you don’t enjoy gambling, then simply don’t do it since (on average) it will just waste your money. If you do enjoy it, then think of it as a recreational activity, not as a way to make money. Before starting, decide how much money you are willing to lose, and if you use that money up, quit immediately. If you assume that you are going to lose then you won’t be disappointed.

All of this being said, if you do decide to gamble, you’d better study up on your probability! It’s important to know what the best course of action is (probabilistically speaking) at each decision point in the game.

Q: What’s so special about the Gaussian distribution (i.e. the normal distribution / bell curve)??

Saturday, February 13th, 2010

Physicist: A big part of what makes physicists slothful and attractive is a theorem called the “central limit theorem”.  In a nutshell it says that, even if you can’t describe how a single random thing happens, a whole mess of them together will act like a gaussian.  If you have a weighted die I won’t be able to tell you the probability of each individual number being rolled.  But (given the mean and variance) if you roll a couple dozen weighted dice and add them up I can tell you (with fairly small error) the probability of any sum, and the more dice the smaller the error.  Systems with lots of pieces added together show up all the time in practice, so knowing your way around a gaussian is well worth the trouble.

Gaussians also maximize entropy for a given energy (or other conserved quadratic quantity, energy is quadratic because E = \frac{1}{2} mv^2).  So if you have a bottle of gas at a given temperature (which fixes the total energy) you’ll find that the probability that a given particle is moving with a given velocity is given by a gaussian distribution.

From quantum mechanics, gaussians are the most “certain” wave functions.  The “Heisenberg uncertainty principle” states that for any wave function \Delta X \Delta P \ge \frac{\hbar}{2}, where \Delta X is the uncertainty in position and \Delta P is the uncertainty in momentum.  For a gaussian: \Delta X \Delta P = \frac{\hbar}{2}, the absolute minimum total uncertainty.

And much more generally, we know a lot about gaussians and there’s a lot of slick, easy math that works best on them.  So whenever you see a “bump” a good gut reaction is to pretend that it’s a “gaussian bump” just to make the math easier.  Sometimes this doesn’t work, but often it does or it points you in the right direction.

 

Mathematician: I’ll add a few more comments about the Gaussian distribution (also known as the normal distribution or bell curve) that the physicist didn’t explicitly touch on. First of all, while it is an extremely important distribution that arises a lot in real world applications, there are plenty of phenomenon that it does not model well. In particular, when the central limit theorem does not apply (i.e. our data points were not produced by taking a sum or average over samples drawn from more or less independent distributions) and we have no reason to believe our distribution should have maximum entropy, the normal distribution is the exception rather than the rule.

To give just one of many, many examples where non-normality arises: when we are dealing with a product (or geometric mean) of (essentially independent) random variables rather than a sum of them, we should expect that the resulting distribution will be approximately log-normal rather than normal (see image below). As it turns out, daily returns in the stock market are generally better modeled using a log-normal distribution rather than a normal distribution (perhaps this is the case because the most a stock can lose in one day is -100%, whereas the normal distribution assigns a positive probability to all real numbers). There are, of course, tons of other distributions that arise in real world problems that don’t look normal at all (e.g. the exponential distribution, Laplace distribution, Cauchy distribution, gamma distribution, and so on.)

 

Human height provides an interesting case study, as we get distributions that are almost (but not quite) normally distributed. The heights of males (ignoring outliers) are close to being normal (perhaps height is the result of a sum of a number of nearly independent factors relating to genes, health, diet, etc.). On the other hand, the distribution of heights of people in general (i.e. both males and females together) looks more like the sum of two normal distributions (one for each gender), which in this case is like a slightly skewed normal distribution with a flattened top.

 

I’ll end with a couple more interesting facts about the normal distribution. In Fourier analysis we observe that, when it has an appropriate variance, the normal distribution is one of the eigenvectors of the Fourier transform operator. That is a fancy way of saying that the gaussian distribution represents its own frequency components. For instance, we have this nifty equation (relating a normal distribution to its Fourier transform):

e^{- \pi x^2}=\int_{-\infty}^{\infty} e^{- \pi s^2}e^{-2 \pi i s x} ds.

Note that the general equation for a (1 dimensional) Gaussian distribution (which tells us the likelihood of each value x) is

\frac{ e^{- \frac{(x-\mu)^2}{2 \sigma^2}} } {\sqrt{2 \pi \sigma^2}}

where \mu is the mean of the distribution, and \sigma is its standard deviation. Hence, the Fourier transform relation above deals with a normal distribution of mean 0 and standard deviation \frac{1}{\sqrt{2 \pi}}.

Another useful property to take note of relates to solving maximum likelihood problems (where we are looking for the parameters that make some data set as likely as possible). We generally end up solving these problems by trying to maximize something related to the log of the probability distribution under consideration. If we use a normal distribution, this takes the unusually simple form

\log{ \frac{ e^{- \frac{(x-\mu)^2}{2 \sigma^2}} } {\sqrt{2 \pi \sigma^2}} } = - \frac{(x-\mu)^2}{2 \sigma^2} - \frac{1}{2} \log(2 \pi \sigma^2)

which is often nice enough to allow for solutions that can be calculated exactly by hand. In particular, the fact that this function is quadratic in x makes it especially convenient, which is one reason that the Gaussian is commonly chosen in statistical modeling. In fact, the incredibly popular ordinary least squares regression technique can be thought of as finding the most likely line (or plane, or hyperplane) to fit a dataset, under the assumption that the data was generated by a linear equation with additive gaussian noise.

Q: What is the meaning of the term “random”? Can thinking affect the future?

Wednesday, February 3rd, 2010

The complete question was:

1 – When probability is invoked it commonly implies or states that something “will probably” or “is likely” to happen. Doesn’t this suggest that we can predict the future or just by thinking about it effect the future?

2 – What is the meaning of the term “random” as it is used in mathematics/statistics? Does randomness actually exist?

Physicist: It does suggest that we can predict the future.  In fact, some people formulate the definition (and purpose) of science as an attempt to predict the future.

Thought alone has absolutely no impact on probability or possibility.
The predictive power of probability is fairly useless when applied to individual events.  That is to say, if someone tells you that the next time you roll a die you’ll get a 5, then they’re full of it.  How would they know that?  However, if they say that the next several times you roll a die, about 1/6 of the time you’ll get a 5, then they’ll be correct.
In that last sentence the word “about” is underlined because pretty much all of the mathematical theory behind probability is wrapped up in figuring out the details of that “about” (probability distributions, variance, etc.).
Probabilities are found by experiment.  So when you say that “each side of a die will be rolled with a probability of 1/6″, what you’re really saying is “looking at all the dice in the past, each side came up 1/6th of the time”.
“Randomness”, at least until Bell fucked everything, is nothing more than a reflection of one’s lack of knowledge.  In practice I think it would be fair to define something as random if no one can think of a predictive model that does any better than just picking the most likely outcome.  For example, if you’re looking at coin flips and one person picks “heads” every time, and another gets a super computer, some psychics, and Mensa™ to pick sides, then they’ll both get the correct answer 50% of the time.
The interplay between perception and probability is subtle, especially where quantum mechanics is concerned.  It takes a surprising amount of study and contemplation to see where the weirdness of quantum mech enters the scene, so I’ll restrict this discussion to strictly classical (non-quantum) effects.  If you love quantum effects so much you want to marry them, then take a look at “Q: Do physicists really believe in true randomness?” and “Q: What is the connection between quantum physics and consciousness?“.  This has been a popular line of questions.
-Why thought and perception don’t do a damn thing, but seem to:  Say you lose your dog, Kolmogorov, in Madrid.  By the end of a week the dog could be almost anywhere in Spain.

The probability distribution of the dog before and after an observation.

Then someone spots Kolmogorov in Avila, and calls to tell you the next day.  Suddenly (by dog-spotting alone mind you) the probability distribution of Kolmogorov’s location changes dramatically.

This new probability distribution is called a “conditional distribution”.  You can tell when someone is using a conditional distribution because they’ll say things like “the probability, given that …”.  It may seem as though perceiving the dog has changed something in the universe, but keep in mind what comes first.  The dog’s location dictates where you’ll see it, not the other way around.

Another classic is the phenomenon of “hot” and “cold” tables.  You’ll find that sometimes a craps, blackjack, etc. table will suddenly be especially lucky or unlucky for a while.  It’s extremely common for people to gravitate toward tables during a winning streak, or, if things have been going well for a while, to get nervous and leave.  Both are examples of the belief that perception affects reality, or at the very least, that random things aren’t actually random.

Examples of random walks, specifically the “Wiener Process”.  Sometimes it looks like there’s a pattern, but there’s not.  And yes, that’s really his name; “Norbert Wiener“.

Here’s a thought experiment (or if you have the time, actual experiment): Have a friend flip a coin.  Try to guess what it is.  How often do you expect to get it right?  Do the same thing, but this time have your friend look at the coin first before you guess.  Do your guesses get any better or worse?  Does someone “out there” with information have any influence on probability?  Hells nope.

At the end of the day, every single experiment that you can do that involves chance, and somebody having some knowledge about results, will still act exactly the same as an experiment where nobody knows anything.

For no reason, here’s a quote about Wiener: “Gifted for abstract sciences, philosophy and literature, he also had an inclination to the fine arts. These tendencies were certainly enhanced by a meditative temperament partly due to his ungainliness and myopia, which disqualified him for the usual games of physical skill popular among youngsters…

Q: Is it possible to choose an item from an infinite set of items such that each one has an equal chance of being selected?

Sunday, January 31st, 2010

The complete question was: The other day I was trying to explain the difference between “impossible” and “with probability zero” to a friend. I remarked “if you pick an integer, and I pick an integer, the mathematical probability that we picked the same number is zero, but it’s certainly not impossible.” A seemingly harmless statement, but only later did I think, what does it even mean to “pick ANY integer with EQUAL probability?” Is there any meaning to a random variable that can take on an infinite number of values with equal probability? It would be like a uniform distribution that has one or both bounds stretched to infinity. Does this kind of object have a name? What kind of mathematical properties would this even have? Oh dear, this is blowing my mind just thinking about it. Please help!

Mathematician: Developing a consistent theory of probability for sets with an infinite number of elements (like the set of all integers, the set of all real numbers, or the set of real numbers between 0 and 1) requires dealing with a handful of subtle and tricky problems. In fact, to determine whether it is possible to choose one object from an infinite number of objects such that each object has the same probability of being chosen, we must delve into what we really mean by “possible”. Various interpretations of this question are:


1. Can we work with probabilities (as they are defined in a formal, mathematical way) on infinite sets of items, and if so, can we assign equal probabilities to each item?

2. Can a (formal, mathematical) procedure be devised to sample uniformly at random from an infinite set?

3. Can an algorithm be designed that can carry out this sampling procedure on a real computer such that the algorithm will terminate in a finite amount of time?

Each of these questions, which I’ll address one by one, raises interesting considerations.

Can a formal theory of probability be developed for infinite sets of items? Absolutely. For example, the Gaussian distribution (often known as the bell curve or normal distribution) is defined on the set of real numbers (so when you draw a sample according to this distribution, you get some real number). Strictly speaking, it assigns a probability of zero to each of the individual real numbers, but assigns a non-zero probability to subsets of real numbers. For example, if we sample from a Gaussian distribution, there is a formula that can tell us how likely the number is to be less than any particular value X (so the set of all numbers less than X is assigned a positive probability). Why does the gaussian distribution not assign non-zero probabilities to the actual numbers themselves? The reason (loosely speaking) is because the probability of getting ANY number when we sample from a distribution must be 1 (which just means that some number must always occur). On the other hand, the set of real numbers contains so many numbers that, if each of them had a non-zero probability of occurring, it would not be possible for the total probability (which is, essentially, just the sum or integral of all of the numbers’ probabilities) to be 1. Another, more intuitive way to think about this is to consider a dart board. If we throw a dart at the board and are willing to assume that matter is infinitely divisible (sorry, Physicist) then we will always hit some point on the board (assuming our aim isn’t too terrible). But, at the same time, the chance of hitting any particular point is negligibly small (zero in fact) since there are so many possible points. So clearly, to describe the probability of hitting this board’s points, it is not sufficient to consider only the probability of each individual point being hit, but rather we have to consider how likely we are to hit various regions of the board, such as the region constituting the bullseye. Even though each particular point has a zero probability of being hit, some point is always hit in practice, and the set of points that make up the bullseye together have a positive probability of being hit with each throw.

Okay, so we can define probabilities on an infinite set, though as the Gaussian distribution case shows, we may have to actually let the probabilities be assigned to subsets of our original set, rather than to every object in the set itself. But can we do this in such a way that every object in our set is sampled with equal likelihood? The answer is again yes, though with some caveats. For instance, if we limit ourselves to the real numbers that are between 0 and 1, we can assign a uniform distribution to these numbers which will give them each an equal probability. The uniform distribution basically says that the probability that a given sampled number  will be between a and b (for 0<a<b<1) is equal to b-a. This fact implies that all numbers are equally likely to be sampled from this distribution.

Fine, but can we define a probability distribution on the set of integers (rather than the real numbers between 0 and 1) such that they each occur with equal probability (i.e. does a uniform distribution on the integers exist)? The answer, sadly, is no. A probability mass function (which is the kind of probability distribution we need in this case) is defined to be a positive function that has a sum of values equal to 1. But any positive function that assigns an equal value to each integer must have probabilities that sum to either infinity or zero, so the desired distribution is impossible to construct. As a technical side note though, people sometimes try to get around this issue in Bayesian analysis by applying what are known as improper priors. Attempting to define a uniform distribution on the full set of real numbers also fails, for a very similar reason that it doesn’t work on the integers (it can not be the case that each real number (or equally sized interval of real numbers) has the same probability and the probability density function integrates to 1).

On to our second question of whether is it possible to come up with a formal mathematical procedure for sampling from infinite sets. The answer is yes, if we have an unlimited amount of time to spare. For real numbers between 0 and 1, we can use the following sampling procedure:

i.  Start with the number 0.0, and set n=1

ii. Set the nth digit of our number after the decimal point to a random number from 0 to 9.

iii. Increase n by 1 and return to step ii.

If this procedure were iterated forever, it would produce a single random number between 0 and 1, and all real numbers between 0 and 1 are equally likely to be generated. Essentially we are just constructing a number by choosing each of it’s decimal digits randomly.

But what if we wanted to carry out this procedure for the set of all integers? Here things get stranger. The natural choice is the following procedure:

i.  Start with the number 0.0, and set n=1

ii. Set the nth digit of our number BEFORE the decimal point to a random number from 0 to 9.

iii. Increase n by 1 and return to step ii.

If this procedure were carried out forever, it would seem as though it would produce an integer, and that this integer would have an equal probability of being any integer. This is true, in the sense that all integers that the algorithm produces have an equal likelihood of being produced (i.e. when it DOES produce integers those integers are each equally likely). But the algorithm does not actually do what we would like. We begin to see the problem when we pick any integer X, and ask the question, “what is the probability that this procedure would produce a number that is greater than X?”. The answer, is that there is a probability of 1 (i.e. a 100% chance), NO MATTER WHAT X IS. This makes sense, given that the integers stretch off to infinity, and that the number of integers close to 0 will always be dwarfed by the number of them far from it. But how can this procedure (when run forever) produce one, single integer, while at the same time having a probability of 1 of producing a number bigger than any particular number we choose? Well, each integer can be thought of as having an infinite sequence of zero digits to the left of its first non-zero digit. This algorithm will have a probability of 0 of producing a number with an infinite successive sequence of zeros, and therefore will have a zero probability of producing an integer! In other words, there is a probability of 1 that each number it produces will be (in a certain sense) “infinite”, so it does not serve the purpose that we hoped.

This brings us to our final question, regarding whether a terminating algorithm (that can be run on a real computer) can be created that will sample uniformly at random from an infinite set of numbers. The answer to this question is no, but with the footnote that this does not matter too much in practice (for reasons to be discussed). One way to understand why this is impossible is to consider how many bits it would take to transmit the number produced by such a sampling procedure. We would have to transmit some kind of code (agreed upon in advance) that represents the number we got from sampling, but since there are an infinite number of possible outcomes and since we have no knowledge (in advance) of what number will occur, we would need to have an infinite number of codes to represent all the outcomes. Hence, if  we think of the codes as numbers, and choose some number N, then some of the codes must contain more than N digits (since N digits is only enough to describe a finite number of different codes). But since this holds for all N no matter how large, this means that, on average, it would take literally forever to transmit one of these codes. But, if the number cannot be transmitted, no computer could ever make a copy of it, which implies that no computer could ever generate such a number. What this confusing, convoluted argument is getting at is the fact that a uniform distribution on an infinite set of items (if it existed) would have an infinite entropy, so the numbers sampled from such a procedure could never be transmitted or stored (as doing so, on average, would require an infinite number of bits) so there is no way that such an algorithm could be used in real life. One way to see that the entropy of such a distribution would be infinite is to note that if we define a uniform distribution on n items, that as we let n go to infinity the entropy of the distribution approaches infinity.

Despite these problems, some infinite sets have a nice property that the procedure of sampling (with equal probability) from them can be nicely approximated on a computer. For example, if we want to sample a real number between 0 and 1, we can approximate this procedure by limiting ourselves only to numbers with at most 40 digits after the decimal point, and then sampling uniformly at random from this restricted set. While this procedure is not perfect, it will produce numbers that (for most purposes) look like those we would get if we truly sampled from all real numbers between 0 and 1. On the other hand, sampling uniformly at random from the set of all integers cannot be approximated in any nice way (and hence, is in some sense an inaccessible procedure). The problem here is that, as noted, if you fix any number X that you like, 100% of integers are greater than that X, no matter what X is. Since real computers are limited in the size of the numbers they can store, any attempt to approximate the procedure of sampling from all integers will be limited to sampling from integers less than some number X, despite the fact that 100% of integers truly are above X. If we try to sample uniformly at random from the set of all integers (or the set of real numbers, for that matter) we are doomed to complete failure.