A friend asked a probability question today (viz., if you roll six dice, what’s the probability that at least one of them comes up a 1 or a 5?), so I answered it analytically and then wrote a quick Python simulation to test my analytical answer. That’s all fine, but what annoys me is how serial my code is. It’s serial for a couple reasons:

  1. The Python GIL.
  2. Even if the GIL magically disappeared tomorrow, I’ve got a for-loop in there that’s going to necessarily run serially. I’m running 10 million serial iterations of the “roll six dice” experiment; if I could use all the cores on my quad-core MacBook Pro, this code would run four times as fast — or, better yet, I could run four times as many trials in the same amount of time. More trials is more better.

Most of the code uses list comprehensions as $DEITY intended, and I always imagine that a list comprehension is a poor man’s SQL — i.e., it’s Python’s way of having you explain what you want rather than how you want to get it. If the GIL disappeared, I like to think that the Python SufficientlySmartCompiler would turn all those list comprehensions parallel.

Last I knew, the state of the art in making Python actually use all your cores was to spawn separate processes using the multiprocessing library. Is that still the hotness?

I want parallelism built in at the language level, à la list comprehensions, so that I don’t need to fuss with multiprocessing. “This needs to spawn off a separate process, because of the GIL” is one of the implementation details I’m looking to ignore when I write list comprehensions. I’d have no problem writing some backend multiprocessing code, if it gets buried so far down that I don’t need to think about the backend details in everyday coding, but what I really want is to bake in parallel idioms from the ground up.

Thinking about what you want rather than how you want to obtain it is why I love SQL, and it’s why LINQ seems like a really good idea (though I’ve never used it). But even the versions of SQL that I work with require a bit more fussing with implementation details than I’d like. For instance, inner joins are expensive, so we only join two tables at a time. So if I know that I want tables A, B, and C, I need to create two sub-tables: one that joins A and B, and another that joins A-and-B with C. And for whatever reason, the SQL variants I use need me to be explicit about all the pairwise join conditions — i.e., I need to do

select A.foo foo
from A, B, C
where A.foo = B.foo
    and B.foo = C.bar
    and A.foo = C.bar

even though that final and-condition follows logically from the first two. And I can’t just do “select foo” here, or SQL would complain that ‘”foo” is ambiguous’. But if A.foo and B.foo are equivalent — as the SELECT statement says — then it doesn’t matter whether my mentioning “foo” means “A.foo” or “B.foo”.

The extent of my knowledge of declarative programming is basically everything I wrote above. I don’t even know if “declarative programming” captures all and only the things I’m interested in. I want optimization with limited effort on my part (e.g., SQL turns my query into a query plan that it then turns into an optimized set of instructions). I also want minimal overhead — a minimal gap between what I’m thinking and what I have to type as code; that’s why I hate Java. Granted, if adding overhead in the form of compiler hints will lead to optimizations, then great; I’d hardly even call that “overhead.”

At a practical level, I’d like to know how to implement all of this in Python — or, hell, in bash or Perl, if it’s easy enough.