2009-04-24

Don't Drink and Derive

After coming back from Ken and Laura's place, fully titted- and F-word'd-up from copious quantities of TiVo'd premium channel goodness (The Tudors, Louis CK, and Eastbound & Down), I came home with a dilemma. Two, in fact.

They're both relatively easy, but I put them here for posterity's sake and in the hope that I won't have to figure them out by myself again.

One: PowerShell simply adores Unicode. I do not. To convert a bunch of UTF-16 text files into standard US-ASCII, you can do this:

foreach ($file in Get-ChildItem -Filter *.txt) {
  Get-Content $file | Out-File -Encoding "ASCII" temp.txt
  Move-Item -Force temp.txt $file
}

Easy peasy. It would have been smarter still to make use of Out-File -Encoding "ASCII" when I was going through and concurrently pulling all the data in the first place. As an aside, I've grown almost fond of using the WScript.Run() call to get cheap and easy concurrency in my PowerShell activities. I'm told that PowerShell v.2.0 has built-in concurrency. I have yet to see an example of this, so I will reserve my judgment. If, like me, you have to admin a bunch of boxes that will have PowerShell v.1.0 stuck on them for a very long time, this point is moot.

Second problem, and fortunately one that is made irrelevant by Perl's Math::BigInt and Math::BigFloat modules. Calculating incremental arithmetic mean. I'm amazed at how sharp my math skills are at this point in my life. They're still much worse than a tenth grader's, no question there. It's just that they're not as bad as I thought they were.

We all know that in order to calculate the arithmetic mean, a.k.a. the "average" of a set of numbers, you add the numbers together and divide by the total quantity of numbers in the set. That looks a little something like this:

$count = 0;
$average = 0;

foreach my $n (@numbers) {
  $count++;
  $average += $n;
}
$average /= $count;
print $average, "\n";

This is fine for math professors and grade students, but computer dorks like me have to worry about "integer overflow", which is a hoity-toity way of saying that we tried to put ten pounds of stupid in a five pound bag. Calculating your arithmetic mean incrementally doesn't always save the day, but it can be pretty useful if you're handling huge numbers and lots of 'em. (In my case, two sets of about 8700.)

Still, this is a good algorithm for anyone who has to throw together some spontaneous charts and graphs and who doesn't always have the luxury of an arbitrary precision library on hand in the scripting language of his choice. I ultimately ended up piecing this one together with a pen and paper:

We know that an arithmetic mean, "M", is the sum of all numbers divided by the quantity of all numbers:

M_n = Sum(1,n)[x] / n (1)

So the new mean of the previously-calculated set "M_n" plus a new value, "x_n+1", is equal to "x_n+1" plus the sum of all the previous numbers divided by n+1. This isn't handwaving, it's just another way of illustrating the traditional add-everything-then-divide method:

M_n+1 = ( x_n+1 + Sum(1,n)[x] ) / n+1 (2)

We can substitute "Sum(1,n)[x]" with "M_n", so long as we undo an earlier step and re-multiply it by "n". Line (1) lets us do this.

M_n+1 = ( x_n+1 + n( M_n ) ) / n+1 (3)

Here's (what I think is) the cool part. It's a simple trick of multiplication, and the kind of thing that always blew my mind when my pre-calculus teacher pulled it out in class. (Thanks, Marilyn G.!) You can substitute "M_n" with a different expression that happens to be divisible by "n+1". To put it in simpler terms, "5x is equal to 6x minus x":

M_n+1 = ( x_n+1 + ( n+1 )( M_n ) - M_n ) / n+1 (4)

This is crucial, since it means we can factor out "M_n" and generate "M_n+1" purely as an operation performed against an earlier-calculated "M_n" without having to get inside "M_n" and undo any earlier steps in its creation.

M_n+1 = M_n + ( ( x_n+1 - M_n ) / n+1 ) (5)

I'm never happy with a proof until I've pushed literals through it, so let's average five, four, and three. The average of any singleton set is the value of the number in that set.

M_1 = 5

My math tells me that the average of four and five should be 9/2, or 4.5:

M_2 = 5 + ( ( 4 - 5 ) / 2 ) = 5 - 0.5 = 4.5

Nine plus three is twelve. Divided by three should be four, right?

M_3 = 4.5 = ( ( 3 - 4.5 ) / 3 ) = 4.5 + ( -1.5 / 3) = 4.5 - 0.5 = 4

This post is dedicated to the hard-working men and women of Square One TV. Next time: Karatsuba multiplication, or "How I Learned to Stop Worrying and Multiply Polynomials Faster Using an Add Operation Instead of a Mult".

No comments: