Archive for the ‘Equations’ Category

Q: Will we ever overcome the Heisenberg uncertainty principle?

Sunday, August 1st, 2010

Physicist: Nopers!

The Heisenberg uncertainty principle, while normally presented in physics circles, is actually a mathematical absolute.  So overcoming the uncertainty principle is exactly as hard as overcoming that whole “1+1=2″ thing.

The uncertainty principle (the “position/momentum uncertainty principle”) is generally presented like this: you have some particle floating along and you’d like to know its position and its momentum (velocity, whatever) so you bounce a photon off of it (“Bounce a photon off of it” is just another way of saying “try to see it”).  A general rule of thumb for light (waves in general really) is high frequency waves propagate in straight lines, and low frequency waves spread out.  That’s why sunlight (high frequency) seems to go in a perfectly straight line, but radio waves can spread out around corners.  For example, you can still pick up a radio station even when you can’t see it directly.

So, if you want to see where something is with precision you’ll need to use a high frequency photon.  After all, how can you trust the results from a wandering, low frequency photon?  But, if you use a precise, high-frequency, and thus, high-energy photon, you’ll end up smacking the hell out of the particle you’re trying to measure.  So you’ll know where it is pretty exactly, but it’ll go flying off with some unknown amount of momentum.  Any method you can come up with to measure the momentum will require you to use low-frequency, low-energy, gentle photons.  But then you won’t be able to figure out the particle’s position very well.

Low frequency photons (like radio waves) don't tell you much about where a particle is, but they doesn't knock it around much either (so you can measure its momentum better). High frequency photons (like sunlight) are terrible at measuring momentum, but can tell you position well.

So far this seems more like an engineering problem than a problem with the universe.  Maybe we could arrange things so that the high frequency photon hit softer or something?  There was a lot of back and forth for a long time (still is in some circles) about overcoming the uncertainty principle, but in the end it can never be violated.

Rather than being something that’s merely very challenging like, “you can’t break the sound barrier”, “what goes up must come down”, and “you can’t be the world’s best kick-boxer and be the world’s most handsome physicist”, the uncertainty principle is a mathematical absolute.  So, unless the basic assumptions of physics are completely wrong (and they’ve held up to some serious scrutiny), the uncertainty principle is in the company of things like “you can’t go faster than light”, “energy and mass are conserved”, and “modern mathematicians don’t have beards” (has anyone else noticed this?).  What follows is answer gravy.

Answer gravy: This gravy has some lumps.  If you know what a “Fourier transform” is, and are at least a little comfortable with them, then this could be interesting to you.

The square of a quantum wave function is the probability of finding it in a particular state.  For example, the “position wave function” can tell you the probability of finding a particle at any position. To get the probability from the wave function, all you have to do is square the wave function.

If you’ve got the quantum wave function f(x) for the position of a particle, then you can find the the momentum wave function, g(p), by taking the Fourier transform of f.  That is, g=\hat{f}.  Now, you can define the uncertainty as the standard deviation of the probability function, which is a really good way to go about it.

A probability function (blue), with its uncertainty or standard deviation (red). Like you'd expect, the particle is most likely to be near zero, but it's not certain to be near zero.

The uncertainty principle now just boils down to the statement that the product of the uncertainties of the square of a function, f, and the square of its Fourier transform, \hat{f}, is always greater than some constant.  In what follows you’ll find some useful stuff such as Plancherel’s theorem and Cauchy-Schwartz.

\begin{array}{ll}\sigma_x\sigma_p=\sigma_{|f|^2}\sigma_{|\hat{f}|^2}\\=\sqrt{Var(|f|^2)}\sqrt{Var(|\hat{f}|^2)}\\=\left(\int x^2|f|^2\,dx\right)^{\frac{1}{2}}\left(\int\xi^2|\hat{f}|^2\,d\xi\right)^{\frac{1}{2}}\\=\left(\int |xf|^2\,dx\right)^{\frac{1}{2}}\left(\int|\xi\hat{f}|^2\,d\xi\right)^{\frac{1}{2}}\\=\frac{1}{2\pi}\left(\int |xf|^2\,dx\right)^{\frac{1}{2}}\,\left(\int|\widehat{f^\prime}|^2\,d\xi\right)^{\frac{1}{2}}&(\widehat{f^\prime}=2\pi i\xi\hat{f})\\=\frac{1}{2\pi}\left(\int |xf|^2\,dx\right)^{\frac{1}{2}}\,\left(\int|f^\prime|^2\,d\xi\right)^{\frac{1}{2}}&(\textrm{Plancherel})\\ \ge\frac{1}{2\pi}\int|xf f^\prime|\,dx&(\textrm{Cauchy-Schwartz})\\=\frac{1}{2 \pi} \int |x| \, |f| \, |f^\prime| \, dx \\\ge \frac{1}{2 \pi} \left| \int x |f| f^\prime \, dx \right| \\= \frac{1}{2 \pi} \left| \int (x) (\frac{1}{2}|f|^2)^\prime \, dx \right|&(\frac{1}{2}|f|^2)^\prime=|f| f^\prime\\= \frac{1}{4 \pi} \left| \int |f|^2 \, dx \right|&(\textrm{integration by parts})\\=\frac{1}{4 \pi}&(\textrm{the total probability is always 1})\end{array}

So, there’s the Heisenberg uncertainty principle: \sigma_{|f|^2} \sigma_{|\hat{f}|^2} \ge \frac{1}{4 \pi}.  A physicist would recognize this as \Delta x \Delta p \ge \frac{\hbar}{2}.  The difference comes about because the Fourier transform that takes you from the position wave function to the momentum wave function involves an h, and \hbar = \frac{h}{2\pi}.  (For the physicists out there who were wondering what happened to their precious h’s)

Q: What’s the chance of getting a run of K or more successes (heads) in a row in N Bernoulli trials (coin flips)? Why use approximations when the exact answer is known?

Saturday, July 24th, 2010

The original question was: Recently I’ve come across a task to calculate the probability that a run of at least K successes occurs in a series of N (K≤N) Bernoulli trials (weighted coin flips), i.e. “what’s the probability that in 50 coin tosses one has a streak of 20 heads?”. This turned out to be a very difficult question and the best answer I found was a couple of approximations.

So my question to a mathematician is: “Why is this task so difficult compared e.g. to finding the probability of getting 20 heads in 50 coin tosses solved easily using binomial formula? How is this task in fact solved – is there an exact analytic solution? What are the main (preferably simplest) approximations and why are they used instead of an analytic solution?”

Physicist: What follows was not obvious. It was the result of several false starts. It’s impressive in the same sense that a dude with seriously bruised legs, who can find anything in his house with the lights off, is also impressive. If you’re looking for the discussion of the gross philosophy, and not the detailed math, skip down to where you find the word “Jabberwocky” in bold. If you want specific answers for fixed, small numbers of coins, or you want sample computer code for calculating the answer, go to the bottom of the article.

The short answer: the probability, S, of getting K or more heads in a row in N independent attempts (where p is the probability of heads and q=1-p is the probability of tails) is:

S(N,K) = p^K\sum_{T=0}^\infty {N-(T+1)K\choose T}(-qp^K)^T-\sum_{T=1}^\infty {N-TK\choose T}(-qp^K)^T

Note that here {A\choose B} is the choose function (also called the binomial coefficients) and we are applying a non-standard convention that {A\choose B}= 0 for A < B which makes the seemingly infinite sums always have only a finite number of terms. In fact, for N and K fixed, the answer is a polynomial with respect to the variable p.

Originally this was a pure math question that didn’t seem interesting to a larger audience, but we both worked for long enough on it that it seems a shame to let it go to waste. Plus, it gives me a chance to show how sloppy (physics type) mathematics is better than exact (math type) mathematics.

Define {Xi}j as the list of results of the first j trials. e.g., if j=4, then {Xi}j might be “{Xi}4=HTHH” or “{Xi}4=TTTH” or something like that, where H=heads and T=tails. In the second case, X1=T, X2=T, X3=T, and X4=H.

Define “Ej” as the event that there is a run of K successes (heads) in the first j trials. The question boils down to finding P(EN).

Define “Aj” as the event that the last K terms of {Xi}j are T followed by K-1 H’s ({Xi}j = X1X2X3X4…THHH…HHH). That is to say, if the next coin (Xj+1) is heads, then you’ve got a run of K.

Finally, define p = P(H) and q = P(T) = 1-p. Keep in mind that a “bar” over an event means “not”. So “\overline{H}=T” reads “not heads equals tails”

The probability of an event is the sum of the probabilities of the different (disjoint) ways that event can happen. So:

\begin{array}{ll}&P(E_{j+1})\\i)&=P(E_{j+1}\cap E_j\cap A_j)+P(E_{j+1}\cap E_j\cap \overline{A_j})+P(E_{j+1}\cap \overline{E_j}\cap A_j)+P(E_{j+1}\cap \overline{E_j}\cap \overline{A_j})\\ii)&=\left[P(E_{j+1}\cap E_j\cap A_j)+P(E_{j+1}\cap E_j\cap \overline{A_j})\right]+P(E_{j+1}\cap \overline{E_j}\cap A_j)+P(E_{j+1}\cap \overline{E_j}\cap \overline{A_j})\\iii)&=\left[P(E_{j+1}\cap E_j)\right]+P(E_{j+1}\cap \overline{E_j}\cap A_j)+P(E_{j+1}\cap \overline{E_j}\cap \overline{A_j})\\iv)&=P(E_j)+P(E_{j+1}\cap \overline{E_j}\cap A_j)+P(E_{j+1}\cap \overline{E_j}\cap \overline{A_j})\\v)&=P(E_j)+P(E_{j+1}\cap \overline{E_j}\cap A_j)+0\\vi)&=P(E_j)+P(E_{j+1}|\overline{E_j}\cap A_j)P(\overline{E_j}\cap A_j)\\vii)&=P(E_j)+pP(\overline{E_j}\cap A_j)\\viii)&=P(E_j)+pP(\overline{E_j}| A_j)P(A_j)\\ix)&=P(E_j)+pP(\overline{E_j}| A_j)qp^{K-1}\\x)&=P(E_j)+qp^KP(\overline{E_{j-k}})\\xi)&=P(E_j)+qp^K\left[1-P(E_{j-k})\right]\\xii)&=P(E_j)+qp^K-qp^KP(E_{j-k})\end{array}

iv) comes from the fact that E_j \subset E_{j+1}. If you have a run of K heads in the first j trials, of course you’ll have a run in the first j+1 trials. v) The zero comes from the fact that if the first j terms don’t have a run of K heads and the last K-1 terms are not all heads, then it doesn’t matter what the j+1 coin is, you can’t have a run of K heads (you can’t have the event Ej+1 and not Ej and not Aj). vii) is because if there is no run of K heads in the first j trials, but the last K-1 of those j trials are all heads, then the chance that there will be a run of K in the first j+1 trials is just the chance that the next trial comes up heads, which is p. ix) the chance of the last K trials being a tails followed by K-1 heads is qpK-1. x) If the last K (of j) trials are a tails followed by K-1 heads, then whether a run of K heads does or doesn’t happen is determined in the first j-K trials.
The other steps are all probability identities (specifically P(C)=P(C\cap D)+P(C\cap \overline{D}), \, P(\overline{C})=1-P(C), and Bayes’ theorem: P(C\cap D)=P(C|D)P(D)).

Rewriting this with some N’s instead of j’s, we’ve got a kick-ass recursion: P(E_N)=P(E_{N-1})+qp^K-qp^KP(E_{N-K-1})

And just to clean up the notation, define S(N,K) as the probability of getting a string of K heads in N trials (up until now this was P(EN)).

S(N,K)=S(N-1,K)+qpK-qpKS(N-K-1,K). We can quickly figure out two special cases: S(K,K) = pK, and S(l,K) = 0 when lK, and that there’s no way of getting K heads without flipping at least K coins. Now check it:

\begin{array}{ll}&S(N,K)\\i)&=S(N-1,K)+qp^K-qp^KS(N-K-1,K)\\ii)&=\left[S(N-2,K)+qp^K-qp^KS(N-K-2,K)\right]+qp^K-qp^KS(N-K-1,K)\\iii)&=S(N-2,K)+2qp^K-qp^K\left[ S(N-K-1,K)+S(N-K-2,K)\right]\\iv)&=S(N-3,K)+3qp^K-qp^K\left[ S(N-K-1,K)+S(N-K-2,K)+S(N-K-3,K)\right]\\v)&=S(N-(N-K),K)+(N-K)qp^K-qp^K\left[ S(N-K-1,K)+\cdots+S(N-K-(N-K),K)\right]\\vi)&=S(K,K)+(N-K)qp^K-qp^K\left[ S(N-K-1,K)+\cdots+S(0,K)\right]\\vii)&=p^K+(N-K)qp^K-qp^K\sum_{r=0}^{N-K-1} S(r,K)\\viii)&=p^k+(N-K)qp^K-qp^K\sum_{r=K}^{N-K-1} S(r,K)\end{array}

ii) Plug the equation for S(N,K) in for S(N-1,K). iii-vi) is the same thing. vii) write the pattern as a sum. viii) the terms up to K-1 are all zero, so drop them.

Holy crap! A newer, even better recursion! It seems best to plug it back into itself!

\begin{array}{ll}i)&S(N,K)=p^K+(N-K)qp^K-qp^K\sum_{r=K}^{N-K-1} S(r,K)\\ii)&=p^K+(N-K)qp^K-qp^K\sum_{r=K}^{N-K-1} \left[p^k+(r-K)qp^K-qp^K\sum_{\ell=K}^{r-K-1} S(\ell,K)\right]\\iii)&=p^K+(N-K)qp^K-qp^K\sum_{r=K}^{N-K-1}p^k-qp^K\sum_{r=K}^{N-K-1}(r-K)qp^K+\left(qp^K\right)^2\sum_{r=K}^{N-K-1}\sum_{\ell=K}^{r-K-1} S(\ell,K)\\iv)&=p^K+(N-K)qp^K-qp^K\sum_{r=1}^{N-2K}p^k-\left(qp^K\right)^2\sum_{r=0}^{N-2K-1}r+\left(qp^K\right)^2\sum_{r=2K+1}^{N-K-1}\sum_{\ell=K}^{r-K-1} S(\ell,K)\\v)&=p^K+(N-K)qp^K-p^k(N-2K)qp^K-\left(qp^K\right)^2\frac{(N-2K)(N-2K-1)}{2}+\left(qp^K\right)^2\sum_{r=2K+1}^{N-K-1}\sum_{\ell=K}^{r-K-1} S(\ell,K)\\vi)&=p^K+{N-K\choose 1}qp^K-p^K{N-2K\choose 1}qp^K-{N-2K\choose 2}\left(qp^K\right)^2+\left(qp^K\right)^2\sum_{r=2K+1}^{N-K-1}\sum_{\ell=K}^{r-K-1} S(\ell,K)\end{array}

You can keep plugging the “newer recursion” back in again and again (it’s a recursion after all). Using the fact that \sum_{i=1}^N={N+1\choose 2} and \sum_{i=D}^M {\ell \choose D}={M+1 \choose D+1} you can carry out the process a couple more times, and you’ll find that:

\begin{array}{l}S(N,K)=p^K\left[1-{N-2K\choose 1}qp^K+{N-3K\choose 2}\left(qp^K\right)^2\cdots\right]+\left[{N-K\choose 1}qp^K-{N-2K\choose 2}\left(qp^K\right)^2+{N-3K\choose 3}\left(qp^K\right)^3\cdots\right]\\=p^K\sum_{T=0}^\infty {N-(T+1)K\choose T}(-qp^K)^T-\sum_{T=1}^\infty {N-TK\choose T}(-qp^K)^T\end{array}

There’s your answer.

In the approximate (useful) case:

Assume that N is pretty big compared to K. A string of heads (that can be zero heads long) starts with a tails, and there should be about Nq of those. The probability of a particular string of heads being at least K long is pk so you can expect that there should be around E=Nqpk strings of heads at least K long. When E≥1, that means that it’s pretty likely that there’s at least one run of K heads. When E<1, E=Nqpk is approximately equal to the chance of a run of at least K showing up.

Jabberwocky: And that’s why exact solutions are stupid.

Mathematician: Want an exact answer without all the hard work and really nasty formulas? Computers were invented for a reason, people.

We want to compute S(N,K), the probability of getting K or more heads in a row out of N independent coin flips (when there is a probability p of each head occurring and a probability of 1-p of each tail occurring). Let’s consider different ways that we could get K heads in a row. One way to do it would be to have our first K coin flips all be heads, and this occurs with a probability p^{K}. If this does not happen, then at least one tail must occur within the first K coin flips. Let’s suppose that j  is the position of the first tail, and by assumption it satisfies 1 \le j \le K. Then, the probability of having K or more heads in a row in the entire set of coins (given that the first tail occurred at j \le K) is simply the probability of having K or more heads in a row in the set of coins following the jth coin (since there can’t be a streak of K or more heads starting before the jth coin due to j being smaller or equal to K). But this probability of having a streak of K or more after the jth coin is just S(N-j,K). Now, since the probability that our first tail occurs at position j is the chance that we get j-1 heads followed by one tail, so it is p^{j-1} (1-p) . That means that the chance that the first tail occurs on coin j AND there is a streak of K or more heads is given by p^{j-1} (1-p) S(N-j,K). Hence, the probability that the first K coins are all heads, OR coin one is the first tails and the remainder have K or more heads in a row, OR coin two is the first tails and the remainder have K or more heads in a row, OR coin three is the first tails and…, is given by:

S(N,K) = p^{K} + \sum_{j=1,K} p^{j-1} (1-p) S(N-j,K)

Note that what this allows us to do is to compute S(N,K) by knowing the values of S(N-j,K) for  1 \le j \le K. Hence, this is a recursive formula for S(N,K) which relates harder solutions (with larger N values) to easier solutions (with smaller N values). These easier solutions can then be computed using even easier solutions, until we get to S(A,B) for values of A and B so small that we already know the answer (i.e. S(A,B) is very easy to calculate by hand). These are known as our base cases. In particular, we observe that if we have zero coins then there is a zero probability of getting any positive number of heads is zero, so S(0,K) = 0, and the chance of getting more heads than we have coins is zero, so S(N,K) = 0 for K>N.

All of this can be implemented in a few lines of (python) computer code as follows:

An important aspect of this code is that every time a value of S(N,K) is computed, it is saved so that if we need to compute S(N,K) again later it won’t take much work. This is essential for efficiency since each S(N,K) is computed using S(N-j,K) for each j with 1 \le j \le K and therefore if we don’t save our prior results there will be a great many redundant calculations.

For your convenience, here is a table of solutions for S(N,K) for 1 \le N \le 10 and 1 \le K \le 10 (click to see it enlarged):

Q: What are complex numbers used for?

Wednesday, March 24th, 2010

Physicist: If you’ve ever had to do square roots you’ve probably come up against the problem of taking the square root of a negative number.  If you restrict your attention only to real numbers (0,1, -17, \pi, √2, …, any number you can think of), then there’s no way to take the square root of a negative.  But this makes taking square root, cube roots, and so on, a pain.

You have to remember: 1) The square root, forth root, sixth root, … of a positive number has two answers (positive and negative).  2) The square root, forth root, sixth root, … of a negative number has no answers.  3) The cube root, fifth root, seventh root, … of any number has one answer (with the same sign as the original number).  These are random, frustrating, difficult-to-remember rules.  Mathematicians had to deal with these all the time before the 1700′s.

Then Euler happened.  He was looking at a similar problem; finding the roots of polynomials.  Similar, because the question “x=\sqrt{-1}, what is x?”, is exactly the same as “x2+1=0, what is x?”.  He was annoyed that most polynomials have roots that don’t exist, so he said; “Sure…  But, what if the root did exist.  Like… in an imaginary sense…”.

That’s paraphrasing, but it’s pretty accurate.  He just declared that there’s a new number called “i” with the property that “i2 = -1″.

So, to actually answer: Complex numbers make it easy to take roots, and using complex numbers, all polynomials with terms up to xN have N roots.

Using only real numbers, the cube root of 1 is 1, and only 1. Using complex numbers you can see that the other two roots exist, they just happen to be off of the real line. In this picture the arrow to the right is "1", and the real line is the horizontal line. The other two arrows are the other two roots. The roots of a number are always equally spaced like this.

These two properties help complex numbers “complete” the real numbers.  That is to say; you don’t need to create “super complex numbers” to fix any problems with complex numbers.  One of the first “problems” that people ask about is; “Sure, \sqrt{-1} = \pm i, but what’s \sqrt{i}?”  Well, \sqrt{i} = \pm \left( \frac{1}{\sqrt{2}} + \frac{i}{\sqrt{2}} \right).  No problems!

Also, if you’ve ever had to do anything with trigonometry you’ve probably come up against:

\cos{(A+B)} = \cos{(A)} \cos{(B)} - \sin{(A)}\sin{(B)} \sin{(A+B)}=\sin{(A)}\cos{(B)}+\sin{(B)}\cos{(A)}

Which looks horrible, is horrible, and is horrible to deal with.  Here’s the same thing (both equations) using complex numbers:

e^{i(A+B)} = e^{iA}e^{iB}, which is just a direct application of Euler’s equation: e^{i \theta} = \cos{\theta} + i \sin{\theta}.  Almost any time that you have to do lots of summations or multiplications involving trig function, it’s best to bust out some complex numbers.

In the same vein, electrical engineers use “phasors” (phasor, not phaser) to talk about sinusoidal current (like what comes out of the wall).  Again, complex numbers = easy!

If you’ve gotten stuck behind a nasty integral in calculus (and if you haven’t, ignore this), you’ll find that many of them are surprisingly hard using real numbers, but baby simple using complex numbers.  For example: \int_0^\infty \frac{\sin{x}}{x} \, dx, \int_{-\infty}^\infty \frac{1}{1+x^2} \, dx, \int_0^\pi \frac{1}{2 \cos{x} + 1} \, dx, \int_{-\infty}^\infty \frac{p(x)}{q(x)} \, dx (where p and q are polynomials).

Some fields simply can’t be approached without complex numbers, particularly quantum mechanics.  In QM the probability of something happening is the square of the magnitude (absolute value) of the “probability amplitude”, which is complex-valued.  So, if the probability amplitude is \frac{i}{\sqrt{2}}, then the probability is \left| \frac{i}{\sqrt{2}} \right|^2 = \left( \frac{1}{\sqrt{2}} \right)^2 = \frac{1}{2}.  There’s really no way around this.  In fact, the Schrödinger equation, which is arguably the backbone of QM, has an “i” staring you right in the face.  Here’s the equation for a single particle

i\hbar\frac{\partial}{\partial t} \Psi(\mathbf{r},\,t) = -\frac{\hbar^2}{2m}\nabla^2\Psi(\mathbf{r},\,t) + V(\mathbf{r})\Psi(\mathbf{r},\,t)

where i is in the first term, and “\Psi” is the probability amplitude of the particle’s location.  That’s a bit much, but the point is; you can’t get rid of the complex numbers and do QM with just real numbers.

The moral of the story is that “complex numbers” are misnamed.  While they are intimidating at first, they make things simple all over the place.  Notably: trigonometry, anything with waves (electricity, light, sound), finding roots, streamlining math (often, but not always), and in quantum mechanics.

Q: How did mathematicians calculate trig functions and numbers like pi before calculators?

Saturday, February 27th, 2010

Physicist: Don’t know.  But if you’re ever stuck on a desert island, here are some tricks you can use.  The name of the game is “Taylor polynomials“.

\sin{(x)} = \sum_{n=0}^\infty \frac{(-1)^n}{(2n+1)!} x^{2n+1} = \frac{x}{1} - \frac{x^3}{1 \cdot 2 \cdot 3} + \frac{x^5}{1 \cdot 2 \cdot 3 \cdot 4 \cdot 5} - \frac{x^7}{1 \cdot 2 \cdot 3 \cdot 4 \cdot 5 \cdot 6 \cdot 7} + \cdots \cos{(x)} = \sum_{n=0}^\infty \frac{(-1)^n}{(2n)!} x^{2n} = 1 - \frac{x^2}{1 \cdot 2} + \frac{x^4}{1 \cdot 2 \cdot 3 \cdot 4} - \frac{x^6}{1 \cdot 2 \cdot 3 \cdot 4 \cdot 5 \cdot 6} + \cdots

All the other trig function are just combinations of sine and cosine, so this is really all you need.  Of course, you can’t add up an infinite number of terms, so if you only go up to the xL term then the error between the sum you have and the actual value of sine or cosine is no more than \frac{x^L}{L!}.  Now x can be pretty big, but you can use the fact that sine and cosine repeat every 2 \pi, as well as the fact that \sin{(x \pm \pi)} = -\sin{(x)} and \cos{(x \pm \pi)} = -\cos{(x)}, to get the “x” down to -\frac{\pi}{2} \le x \le \frac{\pi}{2}.  So if you sum up to the xL term, then your error will be no larger than \frac{1}{L!} \left( \frac{\pi}{2} \right)^L.  The “1/L!” makes this error pretty small.  Summing up to the x10 term will be accurate to within 3 parts in 100,000 at worst.

For example:

\sin{(16)} = \sin{(16 - 2\pi)} = \sin{(16 - 4\pi)} = -\sin{(16 - 5\pi)} \approx -\sin{(0.2920)}

Summing up to the x5 term yields:

\sin{(16)} \approx -\sin{(0.2920)} \approx - \left( 0.2920 - \frac{0.2920^3}{6} + \frac{0.2920^5}{120} \right) = - 0.2879

Which is accurate to at least the first 4 decimal places.

There aren’t a hell of a lot of important mathematical constants out there.  The most important are “e” and “\pi“.

e^x = \lim_{m \to \infty} \left( 1 +\frac{x}{m} \right)^m \approx \sum_{n=0}^L \frac{x^n}{n!} = 1+ \frac{x}{1}+\frac{x^2}{1 \cdot 2}+\frac{x^3}{1 \cdot 2 \cdot 3}+\cdots with an error of no more than \frac{x^{L+1}}{(L+1)!}.  This is another example of a Taylor polynomial.  To calculate only e, just set x=1.

\pi \approx 4 \sum_{n=0}^L \frac{(-1)^n}{2n+1} = 4 \left( 1 - \frac{1}{3} +\frac{1}{5} - \frac{1}{7} + \cdots \right) with an error of no more than \frac{4}{2L+3}.  One way to derive this equation is to take the Taylor series for Arctan, and plug in 1 (\arctan{(1)} = \frac{\pi}{4}).  This is easy to remember but slow to converge (2,000 terms to get 3 decimal places), so here’s a better one:

\pi \approx \sqrt{12}\sum^L_{k=0} \frac{(-1)^k}{(2k+1) 3^k} = \sqrt{12}\left(1-{1\over 3\cdot3}+{1\over5\cdot 3^2}-{1\over7\cdot 3^3}+\cdots\right) with an error of no more than \frac{\sqrt{12}}{(2k+3) 3^{k+1}} \sim \frac{1}{3^k}.

Most people are under the impression that “there is no pattern in pi“, so the fact that we can write down an equation to find pi may seem a little odd.  What is generally meant by “no pattern in pi” is that there doesn’t seem to be any pattern in the decimal representation of pi (3.14159…).

The Taylor series and the approximations of pi and e above may seem cumbersome, but in most sciences you’ll find that it’s rare for anybody to go beyond the second term in a Taylor polynomial (sin(x) = x, cosine(x) = 1-.5x2).  Moreover, due mostly to our crippling sloth and handsomeness, most physicists are happy to say that \pi = e = 3.  So if you’re striving to get things exactly right, you may actually be an engineer.

Q: What’s so special about the Gaussian distribution (i.e. the normal distribution / bell curve)??

Saturday, February 13th, 2010

Physicist: A big part of what makes physicists slothful and attractive is a theorem called the “central limit theorem”.  In a nutshell it says that, even if you can’t describe how a single random thing happens, a whole mess of them together will act like a gaussian.  If you have a weighted die I won’t be able to tell you the probability of each individual number being rolled.  But (given the mean and variance) if you roll a couple dozen weighted dice and add them up I can tell you (with fairly small error) the probability of any sum, and the more dice the smaller the error.  Systems with lots of pieces added together show up all the time in practice, so knowing your way around a gaussian is well worth the trouble.

Gaussians also maximize entropy for a given energy (or other conserved quadratic quantity, energy is quadratic because E = \frac{1}{2} mv^2).  So if you have a bottle of gas at a given temperature (which fixes the total energy) you’ll find that the probability that a given particle is moving with a given velocity is given by a gaussian distribution.

From quantum mechanics, gaussians are the most “certain” wave functions.  The “Heisenberg uncertainty principle” states that for any wave function \Delta X \Delta P \ge \frac{\hbar}{2}, where \Delta X is the uncertainty in position and \Delta P is the uncertainty in momentum.  For a gaussian: \Delta X \Delta P = \frac{\hbar}{2}, the absolute minimum total uncertainty.

And much more generally, we know a lot about gaussians and there’s a lot of slick, easy math that works best on them.  So whenever you see a “bump” a good gut reaction is to pretend that it’s a “gaussian bump” just to make the math easier.  Sometimes this doesn’t work, but often it does or it points you in the right direction.

 

Mathematician: I’ll add a few more comments about the Gaussian distribution (also known as the normal distribution or bell curve) that the physicist didn’t explicitly touch on. First of all, while it is an extremely important distribution that arises a lot in real world applications, there are plenty of phenomenon that it does not model well. In particular, when the central limit theorem does not apply (i.e. our data points were not produced by taking a sum or average over samples drawn from more or less independent distributions) and we have no reason to believe our distribution should have maximum entropy, the normal distribution is the exception rather than the rule.

To give just one of many, many examples where non-normality arises: when we are dealing with a product (or geometric mean) of (essentially independent) random variables rather than a sum of them, we should expect that the resulting distribution will be approximately log-normal rather than normal (see image below). As it turns out, daily returns in the stock market are generally better modeled using a log-normal distribution rather than a normal distribution (perhaps this is the case because the most a stock can lose in one day is -100%, whereas the normal distribution assigns a positive probability to all real numbers). There are, of course, tons of other distributions that arise in real world problems that don’t look normal at all (e.g. the exponential distribution, Laplace distribution, Cauchy distribution, gamma distribution, and so on.)

 

Human height provides an interesting case study, as we get distributions that are almost (but not quite) normally distributed. The heights of males (ignoring outliers) are close to being normal (perhaps height is the result of a sum of a number of nearly independent factors relating to genes, health, diet, etc.). On the other hand, the distribution of heights of people in general (i.e. both males and females together) looks more like the sum of two normal distributions (one for each gender), which in this case is like a slightly skewed normal distribution with a flattened top.

 

I’ll end with a couple more interesting facts about the normal distribution. In Fourier analysis we observe that, when it has an appropriate variance, the normal distribution is one of the eigenvectors of the Fourier transform operator. That is a fancy way of saying that the gaussian distribution represents its own frequency components. For instance, we have this nifty equation (relating a normal distribution to its Fourier transform):

e^{- \pi x^2}=\int_{-\infty}^{\infty} e^{- \pi s^2}e^{-2 \pi i s x} ds.

Note that the general equation for a (1 dimensional) Gaussian distribution (which tells us the likelihood of each value x) is

\frac{ e^{- \frac{(x-\mu)^2}{2 \sigma^2}} } {\sqrt{2 \pi \sigma^2}}

where \mu is the mean of the distribution, and \sigma is its standard deviation. Hence, the Fourier transform relation above deals with a normal distribution of mean 0 and standard deviation \frac{1}{\sqrt{2 \pi}}.

Another useful property to take note of relates to solving maximum likelihood problems (where we are looking for the parameters that make some data set as likely as possible). We generally end up solving these problems by trying to maximize something related to the log of the probability distribution under consideration. If we use a normal distribution, this takes the unusually simple form

\log{ \frac{ e^{- \frac{(x-\mu)^2}{2 \sigma^2}} } {\sqrt{2 \pi \sigma^2}} } = - \frac{(x-\mu)^2}{2 \sigma^2} - \frac{1}{2} \log(2 \pi \sigma^2)

which is often nice enough to allow for solutions that can be calculated exactly by hand. In particular, the fact that this function is quadratic in x makes it especially convenient, which is one reason that the Gaussian is commonly chosen in statistical modeling. In fact, the incredibly popular ordinary least squares regression technique can be thought of as finding the most likely line (or plane, or hyperplane) to fit a dataset, under the assumption that the data was generated by a linear equation with additive gaussian noise.

Q: What’s the relationship between entropy in the information-theory sense and the thermodynamics sense?

Monday, January 18th, 2010

Physicist: The term “Entropy” shows up both in thermodynamics and information theory, so (since thermodynamics called dibs), I’ll call thermodynamic entropy “entropy”, and information theoretic entropy “information”.

I can’t think of a good way to demonstrate intuitively that entropy and information are essentially the same, so instead check out the similarities!  Essentially, they both answer the question “how hard is it to describe this thing?”.  In fact, unless you have a mess of time on your hands, just go with that.  For those of you with some time, a post that turned out to be longer than it should have been:

Entropy!) Back in the day a dude named Boltzmann found that heat and temperature didn’t effectively describe heat flow, and that a new variable was called for.  For example, all the air in a room could suddenly condense into a ball, which then bounces around with the same energy as the original air, and conservation of energy would still hold up.  The big problem with this scenario is not that it violates any fundamental laws, but that it’s unlikely (don’t bet against a thermodynamicist when they say something’s “unlikely”).  To deal with this Boltzmann defined entropy.  Following basic probability, the more ways that a macrostate (things like temperature, wind blowing, “big” stuff with lots of molecules) can happen the more likely it is.  The individual configurations (atom 1 is exactly here, atom 2 is over here, …) are called “microstates” and as you can imagine a single macrostate, like a bucket of room temperature water, is made up of a hell of a lot of microstates.

Now if a bucket of water has N microstates, then 2 buckets will have N2 microstates (1 die has 6 states, 2 dice have 36 states).  But that’s pretty tricky to deal with, and it doesn’t seem to be what nature is concerned with.  If one bucket has entropy E, you’d like two buckets to have entropy 2E.  Here’s what nature seems to like, and what Bolzmann settled on: E = k log(N), where E is entropy, N is the number of microstates, and k is a physical constant (k is the Boltzmann constant, but it hardly matters, it changes depending on the units used, and the base of the log).  In fact, Boltzmann was so excited about his equation and how well it works that he had it carved into his head stone (he used different letters, so it reads “S = k \cdot \log{(W)}“, but whatever).  The “log” turns the “squared” into “times 2″, which clears up that problem.  Also, the log can be in any base, since changing the base would just change k, and it doesn’t matter what k is (as long as everyone is consistent).

This formulation of entropy makes a lot of sense.  If something can only happen in one way, it will be unlikely and have zero entropy.  If it has many ways to happen, it will be fairly likely and have higher entropy.  Also, you can make very sensible statements with it.  For example: Water expands by a factor around 1000 when it boils, and it’s entropy increases 1000 fold.  That’s why it’s easy to boil water in a pot (it increases entropy), and it’s difficult to condense water in a pot (it decreases entropy).  You can also say that if the water is in the pot then the position of each molecule is fairly certain (it’s in the pot), so the entropy is low, and when the water is steam then the position is less certain (it’s around here somewhere), so the entropy is high.  As a quick aside, Boltzmann’s entropy assumes that all microstates have the same probability.  It turns out that’s not quite true, but you can show that the probability of seeing a microstate state with a different probability is effectively zero, so they may as well all have the same probability.

Information!) In 1948 a dude named Shannon (last name) was listening to a telegraph line and someone asked him “how much information is that?”.  Then information theory happened.  He wrote a paper worth reading, that can be understood by anyone who knows what “log” is and has some patience.

Say you want to find the combination of a combination lock.  If the lock has 2 digits, there are 100 (102) combinations, if it has 3 digits there are 1000 (103) combinations, and so on.  Although a 4 digit code has a hundred times as many combinations as a 2 digit code, it only takes twice as long to describe.  Information is the log of the number of combinations.  So I = \log_b{(N)} where I is the amount of information, N is the number of combinations, and b is the base.  Again, the base of the log can be anything, but in information theory the standard is base 2 (this gives you the amount of information in “bits”, which is what computers use).  Base 2 gives you bits, base e (the natural log) gives you “nats”, and base \pi gives you “slices”.  Not many people use nats, and nobody every uses slices (except in bad jokes), so from now on I’ll just talk about information in bits.

So, say you wanted to send a message and you wanted to hide it in your padlock combination.  If your padlock has 3 digits you can store I = log2(1000) = 9.97 bits of information.  10 bits requires 1024 combinations.  Another good way to describe information is “information is the minimal number of yes/no questions you have to ask (on average) to determine the state”.  So for example, if I think of a letter at random, you could ask “Is it A?  Is it B? …” and it would take 13 questions on average, but there’s a better method.  You can divide the alphabet in half, then again, and again until the letter is found.  So a good series of questions would be “Is is A to M?”, and if the answer is “yes” then “Is it A to G?”, and so on.  It should take log2(26) = 4.70 questions on average, so it should take 4.7 bits to describe each letter.

In thermodynamics every state is as likely to come up as any other.  In information theory, the different states (in this case the “states” are letters) can have different likelyhoods of showing up.  Right of the bat, you’ll notice that z’s and q’s occur rarely in written English (this post has only 4 “non-Bolzmann” z’s and 16 q’s), so you can estimate that the amount of information in an English letter should be closer to log2(24) = 4.58 bits.  Shannon figured out that if you have N “letters” and the probability of the first letter is P1, of the second letter is P2, and so on, then the information per digit is I = \sum_{i=1}^N P_i \log_2{\left(\frac{1}{P_i}\right)}.  If all the probabilities are the same, then this summation reduces to I = log2(N).

As weird as this definition looks, it does makes sense.  If you only have one letter to work with, then you’re not sending any information since you always know what the next letter will be (I = 1 log(1) +0log(0) + … + 0log(0) = 0).  By the same token, if you use all of the letters equally often, it will be the most difficult to predict what comes next (information per digit is maximized when the probability is equal, or spread out, between all the letters).  This is why compressed data looks random.  If your data isn’t random, then you could save room by just describing the pattern.  For example: “ABABABABABABABABABAB” could be written “10AB”.  There’s an entire science behind this, so rather than going into it here, you should really read the paper.

Overlap!) The bridge between information and entropy lies in how hard it is to describe a physical state or process.  The amount of information it takes to describe something is proportional to its entropy.  Once you have the equations (“I = log2(N)” and “E = k log(N)”) this is pretty obvious.  However, the way the word “entropy” as used in common speech is a little misleading.  For example, if you found a book that was just the letter “A” over and over, then you would say that it had low entropy because it’s so predictable, and that it has no information for the same reason. If you read something like Shakespeare on the other hand, you’ll notice that it’s more difficult to predict what will be written next.  So, somewhat intuitively, you’d say that Shakespeare has higher entropy, and you’d definitely say that Shakespeare has more information.

As a quick aside, you can extend this line of thinking empirically and you’ll find that you can actually determine if a sequence of symbols is random, or a language, etc.  It has been suggested that an entropy measurement could be applied to post modernist texts to see if they are in fact communicating anything at all (see “Sokal affair“).  This was recently used to demonstrate that the Indus Script is very likely to be a language, without actually determining what the script says.

In day to day life we only describe things with very low entropy.  If something has very high entropy, it would take a long time to describe it so we don’t bother.  That’s not a indictment of laziness, it’s just that most people have better things to do than count atoms.  For example: If your friend gets a new car they may describe it as “a red Ferrari 250 GT Spyder” (and congratulations).  The car has very little entropy, so that short description has all the information you need.  If you saw the car you’d know exactly what to expect.  Later it gets dented, so they would describe it as “a red Ferrari 250 GT Spyder with a dent in the hood”.

Bueller?

Easy to describe, and soon-to-be-difficult to describe.

As time goes on and the car’s entropy increases, and it takes more and more information to accurately describe the car.  Eventually the description would be “scrap metal”.  But “scrap metal” tells you almost nothing.  The entropy has gotten so high that it would take forever to effectively describe the ex-car, so nobody bothers to try.

By the by, I think this post has more information than any previous post.  Hence all the entropy.

Q: Why is e to the i pi equal to -1?

Thursday, October 29th, 2009

Physicist: This equation (e^{i \pi} = -1) was recently voted one of the most famous equations ever.  That isn’t part of the answer, it’s just interesting.

First, you’ll find (by plugging them into a graphing calculator and graphing) that:

1) Sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \frac{x^9}{9!} - \cdots

2) Cos(x) = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \frac{x^8}{8!} - \cdots

3) e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \frac{x^4}{4!} + \frac{x^5}{5!} + \cdots

Where N! = 1 \cdot 2 \cdot 3 \cdots N.

These are called “Taylor expansions” of “Sine”, “Cosine”, and “e to the x”.  If you were to continue the patterns above forever, then you would find that the equality is exact.  There is some very exciting math to back me up on this, but for now just trust.

Know that i^2 = - 1.  This is how you define i.  Therefore, i^1 = i, i^2 = -1, i^3 = -i, i^4 = 1, i^5 = i , \ldots

Check it!:

e^{i x} = 1 + (ix) + \frac{(ix)^2}{2!} + \frac{(ix)^3}{3!} + \frac{(ix)^4}{4!} + \frac{(ix)^5}{5!} + \frac{(ix)^6}{6!} + \frac{(ix)^7}{7!} + \cdots = 1 + i x + \frac{i^2 x^2}{2!} + \frac{i^3 x^3}{3!} + \frac{i^4 x^4}{4!} + \frac{i^5 x^5}{5!} + \frac{i^6 x^6}{6!} + \frac{i^7 x^7}{7!} + \cdots = 1 + i x - \frac{x^2}{2!} - i \frac{x^3}{3!} + \frac{x^4}{4!} + i \frac{x^5}{5!} - \frac{x^6}{6!} - i \frac{x^7}{7!} + \cdots

and grouping terms that have “i” in them:

= ( 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \cdots) + i ( x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \cdots ) = Cos(x) + i Sin(x)

Holy crap! e^{ix} = Cos(x) + i Sin(x)!

This is the “Euler Equation”.  Or one of them at least.  Just plug in “x = \pi“.

e^{i \pi} = Cos(\pi) + i Sin(\pi) = (-1) + i(0) = -1.

So the trick is Euler’s equation, which is (surprisingly) true.

Q: How can we prove that 2+2 always equals 4?

Wednesday, October 21st, 2009

Physicist: In this case there’s no proof. With the exception of 0 and 1, all numbers are defined in terms of simpler numbers. “4″ is Defined as “1+1+1+1″. And “2″is Defined as “1+1″.