I’ve been going kind of buck-wild recently on writing Perl. I’ve become really, really obsessive about it, and I actually think I’m becoming a noticeably better programmer as the days go by. Today I got to work at 8:45, started working at least by 9:15, kept going until 7 p.m., didn’t break for lunch, then came home and finished up something that had been nagging at me toward the end of the day. We had found that a number of files which we’re converting from one system to another were duplicates, so I wanted to write a little Perl script that finds the dups in a list of files and deletes all but one of them. And since I tend to obsess over details, I had to make it clean (or cleanish; those of you with computer-science degrees should bear in mind that I’m just a lowly hacker). So maybe this will be a little Exploration of Steve’s Mind.
The basic tool you’d use to find dups is md5sum(1). This uses the MD5 algorithm to compute a number (called an MD5 hash) for each file that it’s given; with very high probability, files share the same MD5 hash if and only if they’re identical.
So then you could do a shell-scripty thing like so to find all the MD5 hashes for a given set of files:
md5sum [list of files] \ |cut -f1 -d' ' \ |sort \ |uniq -c \ |sort -nr \ |sed 's/^ +//' \ |grep -v '^1'
This will give you back a list of MD5 hashes, with the files stripped off. I couldn’t think of any way to leave the filenames in while still using the sort -nr trick, so I decided to do the whole thing in Perl. The result is the set of functions in my md5funcs file. The md5sum function itself doesn’t do much more right now than call out to md5sum(1); I figured it was worthwhile to write my own function for this, just in case someday I find a Perl implementation of md5sum and want to replace the shell version with the new one.
Rather than do something like
sub md5sum { my @files = @; foreach my $filename ( @files ) { md5sum $filename; [some other stuff] } }
I figured it was better to combine all the filenames into one long string and then pass that string to md5sum(1) en masse so that Perl wouldn’t have to invoke a bunch of external processes. But since we’re calling out to the shell, I decided that I had to write a little function to escape strings for use by the shell:
sub shellEscape { my $inString = shift; my $shellEscapeCharacters = qr/[`\${}() \"'\&]/; $inString =~ s/($shellEscapeCharacters)/\$1/gi; return $inString; }
There are probably other characters that I’d need to escape, but that seems like a fine list for now. When I asked what seemed like the appropriate mailing list a while back which characters to escape, or whether there was a well-known way to ask bash that question, I got no response. So this is the best hack that I could come up with.
The rest of the functions in md5_funcs are there mostly because I just wanted to break down the problem into chunks. The functions will probably see no reuse in any other problem, so breaking it down this way may be kind of pointless. It seems like a good habit to get into, in any case.
So now this little task is over, and it’s time for me to sleep — perchance to get obsessed again tomorrow.