Miscellaneous Mac-appropriate locate(1) script of the day — June 9, 2017

Miscellaneous Mac-appropriate locate(1) script of the day

If you’re on a Unix variant of most any sort, this command will find you every file whose name contains foo anywhere on your machine:

find / -name '*foo*'`

It will also run impossibly slowly, because it’s scanning all of the (maybe millions of) files on your machine. A better option is to periodically generate a database of all files, then do something akin to

select distinct filename
from my_filesystem
where name like '%foo%'

As luck would have it, most Linux distributions that I’m familiar with come with a tool called locate(1) that does this for you, which allows you to write

locate foo

and get what you want. It requires periodically running a tool called updatedb(8) to refresh the database. A quick check of a Linux box I have on hand says that updatedb(8) runs once a day. That’s probably fine, but works less well if you’re changing a lot of filenames; then locate(1) will be behind the times. Which is sad.

Macs have it quite a lot better. They are constantly running a tool called Spotlight, which indexes everything all the time. As I understand it, the Spotlight DB is updated whenever you modify anything about a file, including its contents or its name. There’s a vast Spotlight architecture lurking underneath everything you do on a Mac. At the command line, the mdfind(1) tool lets you search for files whose Spotlight metadata matches any number of criteria, and mdls(1) lists the metadata for a specified file. I’m a Spotlight-metadata novice at this point, but here are some good resources:

  1. This is a good set of tips for accessing the full variety of metadata available to you through the md* tools.
  2. The documentation on Uniform Type Identifiers (which are one part of the Spotlight metadata).

I wrote a tiny little script whose job is to invoke mdfind(1) when you’re on a Mac, and invoke locate(1) on any other Unix. This is handy to me, since I copy my ~/bin directory to every Unix machine I use, and would very much like to use the same locate command on every machine. (You should see my ~/.bashrc, he said sexily.) Here it is:

(15:32 -0400) slaniel@laptop:~$ cat ~/bin/locate 
#!/bin/bash
pattern=$1
os_ver=`uname`
if [ ${os_ver} == "Darwin" ]; then
    mdfind "kMDItemFSName=\"*${pattern}*\"c"
else
    /usr/bin/locate -i $@
fi

That mdfind line initially said

mdfind "kMDItemContentType=\"public*\" && kMDItemFSName=\"*${pattern}*\"c"

until I realized that not every file I cared about was in the public domain: public would only catch files of some publicly known type (e.g., QuickTime videos, GIFs, etc.). For instance, here’s what mdls(1) says about a .DOCX file:

kMDItemContentType             = "org.openxmlformats.wordprocessingml.document"
kMDItemContentTypeTree         = (
    "org.openxmlformats.wordprocessingml.document",
    "org.openxmlformats.openxml",
    "public.zip-archive",
    "com.pkware.zip-archive",
    "public.data",
    "public.item",
    "public.archive",
    "public.composite-content",
    "public.content"
)
[…]
kMDItemKind                    = "Microsoft Word document (.docx)"

That last item (kMDItemKind) is interesting. It leads to a quick bit of command-line hackery to find every Microsoft-format file in your filesystem:

(15:53 -0400) slaniel@AKAMAI_laptop:~$ mdfind "kMDItemKind == '*Microsoft*'" | tr '\n' '\0' | xargs -0 mdls | grep kMDItemKind | grep -o '[^"]\+"$' |sed 's#"$##' | sort | uniq -c | sort -nr
    542 Microsoft Word 97 - 2004 document (.doc)
    534 Microsoft Excel Workbook (.xlsx)
    307 Microsoft Word document (.docx)
    145 Microsoft Word 97 - 2004 document
    145 Microsoft Excel 97-2004 Workbook (.xls)
     33 Microsoft Excel template
     29 Microsoft PowerPoint presentation
     28 Microsoft Excel workbook
     28 Microsoft Excel Macro-Enabled Workbook (.xlsm)
     28 Microsoft Excel 97-2004 workbook
     27 Microsoft Word document
     20 Microsoft PowerPoint 97-2004 presentation
     15 Microsoft personal dictionary
      9 Microsoft Word 97 - 2004 template
      8 Microsoft Outlook document
      8 Microsoft Graph preferences
      3 Microsoft Word Macro-Enabled template (.dotm)
      1 Microsoft Word MHTML document (.mht)
      1 Microsoft Word 97 - 2004 template (.dot)
      1 Microsoft PowerPoint toolbar
      1 Microsoft PowerPoint 97 - 2004 Template
      1 Microsoft Outlook signatures
      1 Microsoft Office Theme

There’s a whole world of Spotlight metadata here, of which I’m only skimming the surface. Suffice to say that it’s an extraordinarily powerful tool to have in your kit.

Name for a particular SQL pattern? — March 31, 2017

Name for a particular SQL pattern?

A colleague just came to me to ask if I knew how to make a particular thing happen in SQL; turns out I’d done what he wanted before, so it jogged a memory and I could solve it. But I don’t know what the term of art is for this pattern. Basically, it’s a missing-values problem: you have a large number of observations from, let’s say, every hour of the day; they’re stored in the form of (day, hour, value) triples. You expect one value for every hour of every day. When the value should be there but isn’t, fill it in with some default.

For this I use a cross product, like so:

create view expected as
select distinct day, hour
from
(
    select distinct day from my_table
) a,
(
    select distinct hour from my_table
) b;

The values we actually do have would be retrieved with “select * from my_table”. Now if we want to find (day, hour) pairs for which we don’t have values, we do

select a.day day, a.hour hour, nvl(b.value, 0) value
from expected a left outer join my_table b
    on a.day = b.day
    and a.hour = b.hour

where nvl(a, b) is a function that returns b if a is null, a otherwise; the PostgreSQL equivalent is coalesce().

This solution works okay, so long as the cross product of days and hours isn’t too large. In the specific case of days and hours, it probably won’t be. But in the more generic case where you’d left-outer-joining a cross product to a set of actual data, there’s no guarantee that the cross product will be small.

So is there a SQL term of art for this? Is there a more elegant solution?

SQL WITH-statement versus CREATE VIEW: have I been doing it all wrong? — February 9, 2016

SQL WITH-statement versus CREATE VIEW: have I been doing it all wrong?

Today I learned about the SQL WITH-statement. I’d been using CREATE VIEW whenever I needed to join a complex query to itself; have I been doing it all wrong all these years?

If nothing else, WITH is a lot more concise than what I had been doing, which was CREATE OR REPLACE TEMPORARY VIEW. Are there arguments for the one versus the other, apart from concision?

P.S.: I’m told — haven’t confirmed — that if you join a table A to itself using the WITH-statement, then A will be materialized just once, whereas it will be materialized twice under CREATE VIEW.

Querying Oracle databases from cx_Oracle under El Capitan — January 31, 2016

Querying Oracle databases from cx_Oracle under El Capitan

The following is my attempt to document a remarkably painful process that got even more remarkably painful under El Capitan than it was under earlier versions of Mac OS. It concerns querying an Oracle database from within a Python script via the cx_Oracle library. If you don’t query Oracle databases from Python scripts via the cx_Oracle library, you are hereby advised to do literally anything else.

The general process of the steps below is:

  1. Get Oracle Instant Client (aka sqlplus) running. This does a couple things. First, it lets you confirm that you can run Oracle queries at all, outside of Python. Second, it installs some libraries and executables that cx_Oracle uses.
  2. Muck around in the libraries that sqlplus installed, to change internal references from /ade/... to @loader_path/....
  3. Install cx_Oracle.
  4. Muck around in the cx_Oracle shared-object (.so) files to change their internal references from relative paths to absolute ones.

Here goes, start to finish:

  1. Download Oracle Instant Client. You want to download “Instant Client Package – Basic”, “Instant Client Package – JDBC Supplement”, “Instant Client Package – SQL*Plus”, “Instant Client Package – SDK”, and “Instant Client Package – WRC: Workload Replay Client”, because why download one file when you can download five? Additional notes about downloading these files:
    1. You need to accept the license-agreement radio button at the top of the page first.
    2. You likely want the 64-bit files, unless your machine is ancient.
    3. You need to log in to download these files, which means you need to create an Oracle login first. Because why make downloading easy when you can make it annoying?
  2. Unzip all the files you just downloaded. For the purposes of this exercise, I assume you downloaded these files to ${HOME}/Downloads.
    mkdir oracle
    cp *.zip !$
    for i in *.zip; do unzip $i; done
  3. Make a system-level directory that will house sqlplus. Since El Cap, you can’t touch anything under /usr, so you’ll want to put it in /usr/local or /opt. For reasons that now elude me, I put it in /opt/oracle.
  4. Change the ownership on /opt/oracle so that you, rather than root, own all the files therein.
    cp -r ~/Downloads/oracle/instantclient_11_2 /opt/oracle
    sudo chown -R $(whoami): !$

  5. In pre-El Cap versions of OS X, you’d set the ${DYLD_LIBRARY_PATH} environment variable to point to where your libraries lived. In the sqlplus case, you’d add /directory/where/sqplus/lives to ${DYLD_LIBRARY_PATH} by adding a line like

    export DYLD_LIBRARY_PATH=${DYLD_LIBRARY_PATH}:/directory/where/sqplus/lives

    to your ~/.bashrc file.

    El Cap seems to have changed the treatment of DYLD_LIBRARY_PATH as part of System Integrity Protection. Maybe DYLD_LIBRARY_PATH stops working? That’s a problem, because all the pre-El Cap steps to configure cx_Oracle that you’ll find on the Internet involve setting that DYLD_LIBRARY_PATH to override the library paths inside the sqlplus executable. The library paths that sqlplus comes with are all absolute paths that seem to refer to Oracle’s own internal build system — like so:

    (16:40 -0500) slaniel@work_laptop:/opt/oracle/instantclient_11_2$ otool -L sqlplus
    sqlplus:
    /ade/dosulliv_sqlplus_mac/oracle/sqlplus/lib/libsqlplus.dylib (compatibility version 0.0.0, current version 0.0.0)
    /ade/b/2475221476/oracle/rdbms/lib/libclntsh.dylib.11.1 (compatibility version 0.0.0, current version 0.0.0)
    /ade/b/2475221476/oracle/ldap/lib/libnnz11.dylib (compatibility version 0.0.0, current version 0.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 159.1.0)

    All those need to change to use the path to your sqlplus install.

  6. So if the $DYLD_LIBRARY_PATH option is out, one option is to muck around in the binary and change the library paths from /ade/... to some sort of variable that means “whatever path the executable is in”. Various places on the web suggested using @executable_path as that variable, but — again — that seems to work only under pre-El Cap versions of OS X. Now you need to use @loader_path instead (for reasons that I’ve not bothered to investigate). The method is like so:

    (16:43 -0500) slaniel@work_laptop:/opt/oracle/instantclient_11_2$ sudo install_name_tool -change /ade/dosulliv_sqlplus_mac/oracle/sqlplus/lib/libsqlplus.dylib @loader_path/libsqlplus.dylib sqlplus
    (16:43 -0500) slaniel@work_laptop:/opt/oracle/instantclient_11_2$ otool -L !$
    otool -L sqlplus
    sqlplus:
    @loader_path/libsqlplus.dylib (compatibility version 0.0.0, current version 0.0.0)
    /ade/b/2475221476/oracle/rdbms/lib/libclntsh.dylib.11.1 (compatibility version 0.0.0, current version 0.0.0)
    /ade/b/2475221476/oracle/ldap/lib/libnnz11.dylib (compatibility version 0.0.0, current version 0.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 159.1.0)

    Note that the linked library is now @loader_path/.... Repeat this for all of the Oracle Instant Client files, and for all of the libraries linked within them. There’s a tool that does this. We end up, e.g., with

(16:45 -0500) slaniel@work_laptop:/opt/oracle/instantclient_11_2$ otool -L sqlplus
sqlplus:
    @loader_path/libsqlplus.dylib (compatibility version 0.0.0, current version 0.0.0)
    @loader_path/libclntsh.dylib.11.1 (compatibility version 0.0.0, current version 0.0.0)
    @loader_path/libnnz11.dylib (compatibility version 0.0.0, current version 0.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 159.1.0)

Okay, so we fix up all the libraries. On my end, this allowed me to get a sqlplus console and connect to our Oracle server. Nice.

Next up: make the Python cx_Oracle library work with it. First, you need to create a couple of symlinks:

(16:49 -0500) slaniel@work_laptop:/opt/oracle/instantclient_11_2$ sudo ln -s libclntsh.dylib{.11.1,}
(16:50 -0500) slaniel@work_laptop:/opt/oracle/instantclient_11_2$ sudo ln -s libocci.dylib{.11.1,}

The reason here is apparently that cx_Oracle is trying to link in those libraries without referencing their version numbers; if you try to install cx_Oracle without those symlinks, you’ll get make(1) puke that contains this telling line:

    ld: library not found for -lclntsh

So you create the symlinks, then you do your classic pip install cx_oracle. Various places on the web suggested, more fully,

pip install --no-cache-dir --allow-external --allow-unverified cx_oracle

, and by this point I was past the point of arguing. That command worked, though it did warn me that --allow-external was deprecated and would eventually stop working.

Sure it’s installed, but that doesn’t mean it’ll actually work. The cx_Oracle.so file now links to libclntsh.dylib.11.1 — a relative path with no /path/to/libraries/... prefix. You need to find the .egg file that Python actually loads when you do import cx_Oracle; in my case, that path is /Library/Python/2.7/site-packages/cx_Oracle-5.2.1-py2.7-macosx-10.11-intel.egg . That’s just a zip file, so

  1. copy it to some new empty directory under ${HOME}
  2. unzip it
  3. change the .so path from relative to absolute, via
    install_name_tool -change {,/opt/oracle/instantclient_11_2/}libclntsh.dylib.11.1 ./cx_Oracle.so
  4. zip the directory back up into the .egg file and copy it from your ${HOME} subdirectory back into the global library path:
sudo mv cx_Oracle-5.2.1-py2.7-macosx-10.11-intel.egg /Library/Python/2.7/site-packages/cx_Oracle-5.2.1-py2.7-macosx-10.11-intel.egg
  1. Does it work?
(17:19 -0500) slaniel@work_laptop:~$ python -c 'import cx_Oracle' && echo 'worked'
worked

Yes! It worked! (Contrast with what happens if you try ‘import cx_Oracl’ with no trailing ‘e’.)

And finally — finally — I can run queries against

More-automatic parallelism in Python — November 29, 2015

More-automatic parallelism in Python

A friend asked a probability question today (viz., if you roll six dice, what’s the probability that at least one of them comes up a 1 or a 5?), so I answered it analytically and then wrote a quick Python simulation to test my analytical answer. That’s all fine, but what annoys me is how serial my code is. It’s serial for a couple reasons:

  1. The Python GIL.
  2. Even if the GIL magically disappeared tomorrow, I’ve got a for-loop in there that’s going to necessarily run serially. I’m running 10 million serial iterations of the “roll six dice” experiment; if I could use all the cores on my quad-core MacBook Pro, this code would run four times as fast — or, better yet, I could run four times as many trials in the same amount of time. More trials is more better.

Most of the code uses list comprehensions as $DEITY intended, and I always imagine that a list comprehension is a poor man’s SQL — i.e., it’s Python’s way of having you explain what you want rather than how you want to get it. If the GIL disappeared, I like to think that the Python SufficientlySmartCompiler would turn all those list comprehensions parallel.

Last I knew, the state of the art in making Python actually use all your cores was to spawn separate processes using the multiprocessing library. Is that still the hotness?

I want parallelism built in at the language level, à la list comprehensions, so that I don’t need to fuss with multiprocessing. “This needs to spawn off a separate process, because of the GIL” is one of the implementation details I’m looking to ignore when I write list comprehensions. I’d have no problem writing some backend multiprocessing code, if it gets buried so far down that I don’t need to think about the backend details in everyday coding, but what I really want is to bake in parallel idioms from the ground up.

Thinking about what you want rather than how you want to obtain it is why I love SQL, and it’s why LINQ seems like a really good idea (though I’ve never used it). But even the versions of SQL that I work with require a bit more fussing with implementation details than I’d like. For instance, inner joins are expensive, so we only join two tables at a time. So if I know that I want tables A, B, and C, I need to create two sub-tables: one that joins A and B, and another that joins A-and-B with C. And for whatever reason, the SQL variants I use need me to be explicit about all the pairwise join conditions — i.e., I need to do

select A.foo foo
from A, B, C
where A.foo = B.foo
    and B.foo = C.bar
    and A.foo = C.bar

even though that final and-condition follows logically from the first two. And I can’t just do “select foo” here, or SQL would complain that ‘”foo” is ambiguous’. But if A.foo and B.foo are equivalent — as the SELECT statement says — then it doesn’t matter whether my mentioning “foo” means “A.foo” or “B.foo”.

The extent of my knowledge of declarative programming is basically everything I wrote above. I don’t even know if “declarative programming” captures all and only the things I’m interested in. I want optimization with limited effort on my part (e.g., SQL turns my query into a query plan that it then turns into an optimized set of instructions). I also want minimal overhead — a minimal gap between what I’m thinking and what I have to type as code; that’s why I hate Java. Granted, if adding overhead in the form of compiler hints will lead to optimizations, then great; I’d hardly even call that “overhead.”

At a practical level, I’d like to know how to implement all of this in Python — or, hell, in bash or Perl, if it’s easy enough.

Python SIGPIPE — September 25, 2015

Python SIGPIPE

I ran into the same Python problem as this fellow. Namely: he’s written a script that dumps lines to stdout, and then runs

my_script.py | head

and gets this:

Traceback (most recent call last):
  File "/home/slaniel/bin/my_script.py", line 25, in 
    main()
  File "/home/slaniel/bin/my_script.py", line 22, in main
    print "".join(["%s %s\n" % (value, key) for key, value in sorted_list])
IOError: [Errno 32] Broken pipe

I.e., Python still has data in the pipe, ready to go to stdout, but it can’t send it because head(1) exited. So my_script.py gets SIGPIPE, and Python traps that as an IOError exception. The solution is straightforward:

from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)

This DFL thing is new to me:

signal.SIG_DFL
This is one of two standard signal handling options; it will simply perform the default function for the signal. For example, on most systems the default action for SIGQUIT is to dump core and exit, while the default action for SIGCHLD is to simply ignore it.

If I’m reading that right, Python replaces the default SIGPIPE behavior with a thrown exception. To make the signal yield the system default, you need to tell Python explicitly to do that.

Two questions:

  1. Why would Python do this? Is the general logic that it’s trying to “internalize” systemwide behaviors? Maybe it wants a script to be “write once, run anywhere”, so it can’t just accept the systemwide behavior. Instead, it has to turn external system events (like SIGPIPE) into internal program behaviors (like exceptions). Is that the idea?
  2. I don’t want to have to tell every one of my scripts to exit silently when it receives SIGPIPE. So I would prefer not to write
    from signal import signal, SIGPIPE, SIG_DFL
    signal(SIGPIPE,SIG_DFL)

    in every script that I ever produce. Do people have a general approach here? E.g., every script does an import Steve_lib (or your own equivalent) that sets up the expected signal-handling defaults?
Tech-support bleg: iOS Notes app not syncing to Mac OS X — September 13, 2015

Tech-support bleg: iOS Notes app not syncing to Mac OS X

A bunch of Googling and StackOverflowing didn’t give any good answers to this, so I wonder if the following fact pattern speaks deeply to any of the sleuths on this blog.

  1. Not all notes on my phone sync to my Mac. The last note that synced is from August 20. I’ve created plenty of notes since then.
  2. Notes are correctly syncing from phone to iCloud. I confirm this through iCloud.com.
  3. If I make a change to a note on iCloud.com, that change is reflected immediately on the phone but is not reflected on the Mac. Likewise if I create an altogether new note through iCloud.com. So both the phone→iCloud and iCloud→phone directions work fine.
  4. If I create a new note on the Mac, it doesn’t show up on iOS.
  5. If I edit the aforementioned August 20 note on the Mac, I would expect the August 20 note to move to the top of the list of notes on iOS and iCloud.com. It does not. In fact it looks like that note is only on the Mac, not on iOS or iCloud.com. If I instead pick a note from August 8 that is on the Mac, on iOS, and on iCloud (this is a note containing thoughts I jotted down as I read the book about Nixon, as it happens), and I edit that note on the Mac, those edits don’t show up on iCloud.com or on iOS.

    What I conclude up to here is that Notes from the Mac is just not talking with the rest of the world at all.

  6. iCloud drive from the Mac works fine: I create a note in TextEdit and save to iCloud Drive, and it appears in the iCloud Drive app on iOS. This would seem to suggest that the Mac is having no problem reaching iCloud.
  7. I use lots of other iCloud services from the Mac, including Messages. iMessages sent from the Mac show up on iOS instantly.
  8. Logging out of iCloud on the Mac, then logging back in, doesn’t change things: the latest note on the Mac is still from August 20.
  9. Both iOS and Mac (and, indeed, iCloud.com) are configured to use the same user ID, so that’s not the problem.
  10. I’m syncing notes with multiple accounts (iCloud, home, work), but the problem isn’t that the missing notes are on another account: even if I display ‘All Notes’ on the Mac, I don’t see the post-August 20 notes.
  11. Phone is running iOS 9.1 beta 1. Mac is running OS X Yosemite 10.10.5.

I wish Mac stuff — particularly the networking layer — were easier to debug. Console.app doesn’t seem to have much to say on the Mac side, and I wouldn’t even know where to begin debugging the iOS side. And for all I know, this problem doesn’t exist on either the Mac or iOS; instead, maybe the problem is somewhere out in the æther in one of Apple’s server farms.

I’m curious if other people know what might be going on here.

Update (2015-09-13): I’m told that the iOS 9 beta asks me, the first time I run Notes there, whether I want to upgrade their format to some new version. I don’t remember this, but it wouldn’t surprise me if I said yes to it without paying much attention. If I did say yes to it, then I’ll need to upgrade to El Capitan through the beta program. I’m in the middle of doing that. We’ll see if that fixes this problem.

Update 2 (2015-09-13): Yep; updating to El Cap solved it. So it was just a beta-software incompatibility … though would I have suffered the same problem had I waited until iOS 9 was no longer in beta, and updated then without updating my Mac to El Cap?