Name for a particular SQL pattern? — March 31, 2017

Name for a particular SQL pattern?

A colleague just came to me to ask if I knew how to make a particular thing happen in SQL; turns out I’d done what he wanted before, so it jogged a memory and I could solve it. But I don’t know what the term of art is for this pattern. Basically, it’s a missing-values problem: you have a large number of observations from, let’s say, every hour of the day; they’re stored in the form of (day, hour, value) triples. You expect one value for every hour of every day. When the value should be there but isn’t, fill it in with some default.

For this I use a cross product, like so:

create view expected as
select distinct day, hour
from
(
    select distinct day from my_table
) a,
(
    select distinct hour from my_table
) b;

The values we actually do have would be retrieved with “select * from my_table”. Now if we want to find (day, hour) pairs for which we don’t have values, we do

select a.day day, a.hour hour, nvl(b.value, 0) value
from expected a left outer join my_table b
    on a.day = b.day
    and a.hour = b.hour

where nvl(a, b) is a function that returns b if a is null, a otherwise; the PostgreSQL equivalent is coalesce().

This solution works okay, so long as the cross product of days and hours isn’t too large. In the specific case of days and hours, it probably won’t be. But in the more generic case where you’d left-outer-joining a cross product to a set of actual data, there’s no guarantee that the cross product will be small.

So is there a SQL term of art for this? Is there a more elegant solution?

SQL WITH-statement versus CREATE VIEW: have I been doing it all wrong? — February 9, 2016

SQL WITH-statement versus CREATE VIEW: have I been doing it all wrong?

Today I learned about the SQL WITH-statement. I’d been using CREATE VIEW whenever I needed to join a complex query to itself; have I been doing it all wrong all these years?

If nothing else, WITH is a lot more concise than what I had been doing, which was CREATE OR REPLACE TEMPORARY VIEW. Are there arguments for the one versus the other, apart from concision?

P.S.: I’m told — haven’t confirmed — that if you join a table A to itself using the WITH-statement, then A will be materialized just once, whereas it will be materialized twice under CREATE VIEW.

A SQL question! — July 11, 2012

A SQL question!

Suppose you have

SELECT a+b AS some_sum, a-b AS some_diff
FROM some_table

It would be syntactically invalid, as far as I understand it, to have

SELECT a+b AS some_sum, a-b AS some_diff,
some_sum + some_diff AS something_else
FROM some_table

, because `some_sum` and `some_diff` aren’t names that yet exist in the scope of that `SELECT` statement. To use `some_sum` and `some_diff`, you need an outer scope that has access to all the names created in the inner scope — so you need either

SELECT some_sum + some_diff AS something_else
FROM
(
SELECT a+b AS some_sum, a-b AS some_diff
FROM some_table
)

or, if you don’t need to attach a name to `some_sum + some_diff`, you can just do

SELECT a+b AS some_sum, a-b AS some_diff
FROM some_table
HAVING some_sum + some_diff >= 0

(or whatever it is that you might want to do with `some_sum + some_diff`)

So … are there some nightmare scenarios that would result from allowing the names ‘`some_sum`’ and ‘`some_diff`’ to be visible in the inner scope? Does anyone know the SQL spec well enough to understand the problem?

__P.S.__: Oh, I think I can answer the question a moment after asking it: the problem is that you can create a name in the inner scope that’s the same as the name of a field in the base table. That is, you can do something like

SELECT 2*a AS a, 3*a AS triple_a
FROM some_table

, and that `3*a` statement is ambiguous: are you referring to triple the *derived* `a`, or triple the *base* `a`?

Even still, I can imagine allowing one scope to refer to a name created within that scope, but then making it a syntax error to introduce a name in a given scope that’s the same as a name in a base table. The way to resolve the ambiguity in this case would be to do

SELECT 2*a AS double_a, 3*(double_a) AS sextuple_a
FROM some_table

I wonder if a Sufficiently Smart Compiler (“a classic argument often pulled out in a LanguagePissingMatch”) could find all ambiguous uses of names, or whether there’s some logical reason why it just couldn’t.

Then again, forbidding the creation of a name in a given scope that is the same as the name of a name in an inner scope or base table would make a lot of common usage impossible. I routinely do

SELECT one_thing, sum(something) something_else
FROM some_table
GROUP BY one_thing

I.e., I’m reusing `something_else` in this scope, even though it also exists in the base table. If I had to constantly use new names for derived fields, just because the derived fields might be referred to in the outer scope, I think I’d find that annoying.

Then, of course, there’s the ever-available argument that it’s just simpler to implement a compiler which imposes certain mild aesthetic constraints on its users.

Tools for making the impossible possible — January 29, 2012

Tools for making the impossible possible

I’ve been thinking a lot lately about tools that help make the difficult easy, which has got me thinking again about probably my favorite quote of all time, by A.N. Whitehead:

> It is a profoundly erroneous truism, repeated by all copy-books and by eminent people when they are making speeches, that we should cultivate the habit of thinking of what we are doing. The precise opposite is the case. Civilisation advances by extending the number of operations we can perform without thinking about them.

I think about this at work all the time, because our use of SQL makes possible a lot of data-gathering and -analysis tasks which really would have been impossible without it. Answering some question about many thousands of servers (100,770 as of the end of September, 2011) would be unimaginable. Certainly getting many *quick* answers to many *quick* queries would be absolutely unimaginable. Without some tool that allows quick aggregation across many different datasets, we’d have to resort to home-brewed scripts that, say, ssh to thousands of machines in parallel and ask them questions. Or we’d have to reinvent Query, more likely.

There are two aspects to SQL that I think about constantly at work: first this trick of turning an impossible problem into a triviality, and second the sense of *playfulness* that it enables. It takes only a tiny bit more effort to turn from the question you were trying to answer into something unexpected, something more general, or something more nuanced. Often answering questions that you didn’t know you had involves finding a table you didn’t know the company published, which in turn involves asking around to see who would know best about a given kind of data. Finding answers to questions you didn’t know you had seems to me part and parcel of what SQL is all about.

The term “generative technology” gets at this. I would link to the first Google search result for this, except that I don’t really like how Jonathan Zittrain — who is, in fairness, most associated with this term — runs with it. iPhones versus non-iPhones isn’t really at all related to what I have in mind here, and I don’t think the definition he has there gets at what even he means by it. The term “generative” comes ultimately from generative grammar, which my non-linguistically-trained self understands to mean “a set of simple rules for the formation of sentences, which rules can be combined in infinitely many ways to construct infinitely many distinct sentences.” In mathematics, think of axioms and rules for their combination: there aren’t that many axioms defining the integers, but they can be combined with only a few more rules (about ordered pairs and what, exactly, multiplication of two ordered pairs means) to build rational numbers, and thence real numbers, and thence complex numbers. The simple axioms, and simple rules for their combination, lead to infinitely complex objects.

(Because I cannot resist a filthy quote when given the opportunity, it’s here that I’ll quote Stephen King’s advice on writing: ‘When asked, “How do you write?” I invariably answer, “One word at a time,” and the answer is invariably dismissed. But that is all it is. It sounds too simple to be true, but consider the Great Wall of China, if you will: one stone at a time, man. That’s all. One stone at a time. But I’ve read you can see that motherfucker from space without a telescope.’)

And so it is with SQL and other generative technologies. They don’t give you a single product that you use in the 10 or 20 or 100 ways that you’ve been told to use it; in this sense, I view Facebook as non-generative. A generative technology, like Unix or SQL, might have a steep learning curve, but once you’ve learned it you can do infinitely many things.

There are lots of complexities once you’ve learned the atoms, and even once you’ve learned how to combine the atoms. In Unix, for instance, your first task is learning to string together programs with pipes. Once you’ve done that, you’ll soon enough be writing your own programs. but you have to write them in the Unix Way, which often involves allowing them to sit with a pipe on their left side and a pipe on their right; in this way, they themselves become part of the generative toolkit. Again, invoking Whitehead, the point is to make complicated action reflexive and doable without thinking. Take a common Unix pattern:

[some commands] | sort | uniq | sort -nr

This takes the output of [some commands] — assumed to contain one interesting pattern per line — and displays it in descending order of frequency, with the frequency in the left column and the pattern on the right. This isn’t many characters, so typing it out becomes second nature; the smart thing to do, though, would be to put this Unix fragment in its own script, which we might call sort_by_pop (or really ‘popsort’, which would save you some keystrokes: there’s already a command that starts with ‘sort’, but no commands that start with ‘pops’, so ‘popsort’ would be easier to get from tab-completion; Unix people think this way):

(19:44 -0500) slaniel@example.com~$ cat sort_by_pop
#!/bin/bash
sort |uniq -c |sort -nr

Now you can just pipe things through sort_by_pop if you want to sort them by popularity:

(19:44 -0500) slaniel@example.com:~$ grep -o ‘^[^ ]+’ access.log |sort_by_pop |head
607 46.165.197.141
520 173.242.125.206
309 199.21.99.67
229 130.195.253.1
169 66.249.71.18
162 72.14.199.102
83 72.246.0.10
37 142.167.21.94
34 198.228.223.217
33 80.58.205.47

Hm, what’s that grep(1) bit? Looks like that bit of script could be usefully abstracted into something called ‘get_ip’:

(19:48 -0500) slaniel@example.com:~$ cat get_ip
#!/bin/bash
grep -o ‘^[^ ]+’

whence we simplify to “cat access.log | get_ip | sort_by_pop”. Now you don’t need to understand the nuances of how sort(1) and uniq(1) work if you don’t want to; in fact, you may never need to know that those atomic tools are sitting underneath your molecular abstractions. If you trust the person who wrote the tools, you can assume that get_ip gets an IP address from a suitably formatted Apache access log, and that sort_by_pop sorts a file containing one pattern per line in descending order of popularity.

And so forth. The idea is to constantly combine the atoms of your knowledge into larger and larger molecules, which allows you to forget about the individual atoms unless you really need them. (Where you often need to remember the atoms is for performance reasons.)

In SQL, one way of combining atoms into higher-order molecules is by means of views. A view is a new table (“relation” for the relational-calculus purists in the room) constructed from lower-order “base tables”. There may be some very clever way to get by without views, but I don’t know what it might be. Often you’ll end up with a query that requires you to join one complicated sub-query to itself; without views, you’d be repeating the sub-query, which would probably involve copying and pasting a bunch of text. This would make editing one of the sub-queries a hassle, because you’d have to repeat your edits once for every sub-query. With views, you create the view once, give it some shorthand name, then use the shorthand on every subsequent reference. Any edit only has to happen once, in the view. Again, the point is to make higher-order thought effortless.

(Java, by contrast, requires so much boilerplate that it gets in the way of quickly scanning a piece of code and understanding what it’s trying to do. Either that, or it requires the developer to carefully shunt his boilerplate off into a little boilerplate area of his code. Or it requires the code reader to develop a finely honed skill of skipping over boilerplate. One organizing principle for writing code of any sort ought to be that it puts the least possible distance between the task you’re envisioning and the code you write for it.)

Having developed such a love for SQL, and having long ago learned how to build high-order castles in Unix, I’m now on the hunt for other generative technologies that will make difficult tasks possible. My goal for 2012 is to discover such a set of technologies for time series. It’s not just a matter of writing formulas that allow me to manipulate time series in any way I see fit, though that’s hard enough (it will probably involve R, and may also involve [book: Data Analysis with Open Source Tools], recommended in the highest terms by my awesome friend Dan Milstein). And it’s not just a matter of manipulating them in a way that makes exploring them, combining them, and being surprised possible, though that’s part and parcel of the generative idea.

Rather, the difficulty with making these things work right starts, it seems to me, way down in the guts. Akamai’s Query system is brilliant — one of the most brilliant technologies I’ve ever seen at a company, central to everything I do at every minute of every day — and works so well because there’s a lot of stuff going on under the hood which, again, I mostly don’t need to think about. The low levels do break, just as they do in any software system (all abstractions are leaky); and when they break, I’m forcibly reminded that all my simplifying abstractions rest very tentatively on a lot of lower-level foundations. Without someone doing a lot of low-level grunt work, Whitehead’s dictum doesn’t hold. (Perhaps the grandest abstractions of all in the modern world are “the market economy” and “industrial democracy” — abstractions that we forget are based on very concrete things like cheap fossil fuels or policemen who will enforce contracts at the point of a gun.) In the case of SQL, someone has to build a backend data-storage method that allows quick lookups. In the case of time series, what will the backend storage system look like? Will we need something like MapReduce? Do we need a different high-level language to concisely encapsulate high-level time-series concepts like “the trend component” or “the spectrum”?

Here is the place to note a lesson that I find I have to repeat to myself over and over: don’t think any harder than you need to. My interest in time series is very non-abstract; I have some specific questions I want to answer about some specific datasets at work. And yes, I want to make sure that I can combine them in new and interesting ways in a reasonable amount of time. But until I’ve asked a single specific question of a single specific dataset, I shouldn’t think too hard about making an apple pie from scratch.

So anyway, there’s a general point in here, and a specific one. The general point is to hunt for abstractions that make it possible to get a lot done without thinking, and make it possible to explore areas you didn’t even know you *could* explore. The specific point is that in 2012, I want to see what I can do with Akamai’s time-series data. I imagine one of these points will be interesting to you, the other less so.