Thursday, August 2, 2007

The Right Tool For The Job: Scripting


Though it's barely planned
The kludgiest of Perl scripts
Is one day maintained


I've been learning Perl lately, after having used Python wherever possible for a couple of years. It's gut-wrenching. So today's pedantry is on the topic of scripting languages -- interpreted, batteries-included, "!#/usr/bin/env"-ready languages for getting a simple job done with a minimum of hassle, as I'm defining it.

Google for "little $LANG script", in quotes, replacing $LANG with each of the most well-known scripting languages. My results:

Table 1:

$LANG @Hits
===== =====
Perl 32,300
shell 24,400 * what does this mean, exactly?
PHP 15,500
Python 12,000
VB 1080 * Skewed, because "vb script" is also a language
bash 808
batch 624
Ruby 511
Tcl 411
js 271
sh 266
vim 76
C++ 7
scheme 7
lisp 6
emacs 3
haskell 3


To further abuse Internet statistics, let's search for each language on Google Code:

Table 2:

$LANG @HITS
===== =====
C++ 6,000,000
Perl 1,420,000
Python 1,050,000
PHP 1,590,000
shell 879,000
Ruby 304,000
Lisp 238,000 * includes elisp
Javascript 212,000
Basic 202,000
Tcl 186,000
bat 183,000
Scheme 103,000
Haskell 67,700


Now, combine these two tables to get a ratio representing the "scriptability" of each language. Or rather, divide the Google Code hits by "little script" hits to get a "Script Factor" inversely proportional to the fraction of existing code that qualifies as little scripts. This is hard science.

Table 3:

$LANG Google Code "Little" Script Factor Notes
==== =========== ======== ============= =====
shell 879,000 24,400 36 * "Shell" is vague
Perl 1,420,000 32,300 44
Python 1,050,000 12,000 88
PHP 1,590,000 15,500 103
Basic 202,000 1080 187 * Includes non-Visual basics
Batch 183,000 624 293
Tcl 186,000 411 453
Ruby 304,000 511 595
Javascript 212,000 271 782
Scheme 103,000 7 14,714
Haskell 67,700 3 22,567
Lisp 238,000 9 26,444 * Includes elisp and common lisp
C++ 6,000,000 7 857,143


Interestingly, this is close to the first list of "little script" languages, with the three P's right up top. The functional languages I threw in for fun are ranked by absurdly small denominators, so I wouldn't
say the results are meaningful beyond indicating that even the hardcore people using these languages for real projects are using P-languages and the shell for simple scripts.

What does this all mean?

  • Scripters are using the right tool for the job. Good scripting languages float to the top.

  • Even the most hardcore Lisp and Haskell programmers use something else for scripting. In other words, they know multiple languages, and they, too, use the right tool for the job.

  • There are seven idiots in the world writing scripts in C++. One would only do this if unaware of any other scriptable language, and therefore capable of using only one tool for any job.

  • Emacs users call their customizations "packages" or "modes," not "scripts." Foiled.

Now let's get specific.

Shell

"Shell" came in first in scriptability, second-place in "little $lang script," but merely fifth in Google Code usage. So shell scripting is a popular way to get things done, but not so much for writing full-on applications.

What language is shell scripting, exactly? I'm assuming the search hits refer to bash, ksh, csh, zsh, and the rest of the Unix shells, mostly because that's how it showed up on Google Code, and because bash seems to be the default on the major Linux distros. Plus, Windows programmers don't talk about the "shell"; if they wade into the muck of cmd.exe at all, they call it batch, DOS, or occasionally command-line scripting. And they don't talk about it online as much as Unix/Linux gurus, outside a few Microsoft-specific websites, from what I've seen.

The strengths of the shell are (1) everything is a string; (2) courtesy of Unix design, the sources and recipients of character streams consistently look like filenames; (3) complex programs can be used like functions and filters, directly adding to the shells abilities (the ultimate FFI, in a way); (4) since code can be data and commands can be piped and redirected around, flow control can be pretty concise. The flaws, as I see them, are (1) everything is a string, meaning nontrivial structures must be serialized and parsed at every step; (2) there are few guarantees about what's actually available to the shell on a given system -- paths, environmental variables, program versions -- so sharing scripts between systems is wildly unreliable. Still, I've never seen a GUI tool as broadly useful as the shell is for getting computer tasks done.

Perl

Legend has it that Larry Wall designed Perl to pull together all of the various Unix sysadmin tools into one effective package, with the plan for it to be especially useful for text manipulation (Reporting and Extraction). So C, bash, awk, sed, grep, and friends are all in there -- in short, it keeps the shell's advantages and does its best to eliminate the disadvantages. (Best of all, it finally got regular expressions right.) And then there's CPAN. I'm not surprised that Perl is #1 for "little scripts" that are just complex enough to be worth saving.

What is Perl the right tool for?
  • One-liners that bash doesn't have an equivalent for -- Perl is installed almost everywhere bash is
  • Straightforward text-processing scripts (Python's immutable strings are a weakness here, and Ruby installations still aren't a universal default)
  • It was a great server-side scripting language during the first dotcom boom (though Java managed to cast itself as the more legit (enterprisey) big brother here). Since Perl coders weren't afraid to get things done "right now," mod_perl made the combination of Apache and Perl effective, scalable, and most importantly, available just when it was needed.

Python

Python fixes Perl, says the next legend. But its strength as a scripting language is that it fixes Java, too -- and as it turns out, Python's "scriptability" is exactly half that of Perls. Spooky, no?

I like Python. It makes sense to C programmers and Unix hermits. And, thanks to Guido's diligent attention to aesthetics, ugly Python code almost always means you're doing something awkward, slow or wrong. The language rewards good behavior with readable, concise code. You know that whitespace issue where if you copy code from a forum and paste it into your own code, the interpreter will crap out on the indentation? It's punishing you for blind copy-and-paste. Doesn't that creep you out a little? Guido is basically handing out candy if you read the documentation on generator expressions, and slapping you on the wrist if you don't read your own code before running it.

There doesn't seem to be a single theoretical approach that guarantees a language will work that way, but for Python, it seemed to work.

What is Python the wrong job for?
  • One-liners -- remember that thing about whitespace?
  • Unix tasks that have already been thoroughly solved with existing command-line tools (see
    Bash).
  • Number crunching (by itself, but see SciPy and Parallel Python). Python 3.0 borrows most of Scheme's numerical tower, so that may improve the situation.

No comments: