Showing posts with label programming. Show all posts
Showing posts with label programming. Show all posts

Wednesday, August 15, 2012

Code Harvest: The Refactoring

I've been hacking on bioinformatics code for four years now, but until now the only work I've really made available to "the community" is in Biopython, mainly Bio.Phylo.

The code I write in the lab is under one big Mercurial repository called esgb; there's a shell script to install everything, including a bunch of scripts, sub-projects and a sprawling Python library called esbglib. Most of my Python programs depend on some functionality in esbglib, and usually Biopython and sometimes SciPy as well.

Having signed the Science Code Manifesto, duty calls for me to bundle some of the programs I've written with the next paper I'm working on, and so I've begun a mighty refactoring of esbglib to extract the general-purpose, reusable components into Python packages. At the moment it looks like I'll end up with two: biofrills and biocma.

Wednesday, August 1, 2012

The well-organized data science project


Someone recently asked me about the basic setup a computational scientist needs to conduct research efficiently. I'm pretty satisfied with my current arrangement, which was inspired by this: "A Quick Guide to Organizing Computational Biology Projects"

My work is organized into individual "projects" which are each supposed to become papers at some point. I keep each project in Dropbox to ensure everything is synced and backed up remotely all the time -- no file left behind. I also use Mendeley, with a folder for each project's references. Mendeley can generate a project-specific BibTex file from a folder.

A well-organized project might look like this:

Monday, January 16, 2012

Building an analysis: How to avoid repeating intermediate tasks in a computational pipeline

In my projects, I tend to start with a simple analysis of a limited dataset, then incrementally expand on it with more data and deeper analyses. This means each time I update the data (e.g. add another species' protein sequences) or add another step to the analysis pipeline, everything must be re-run -- but only a small part of the pipeline actually needs to be re-run.

This is a common problem in bioinformatics:
http://biostar.stackexchange.com/questions/79/how-to-organize-a-pipeline-of-small-scripts-together

How can we automate a pipeline like this, without running it all from scratch each time? This is the same problem faced when compiling large programs, and that particular case has been solved fairly well by build tools.

Thursday, April 8, 2010

Google Summer of Code 2010: The final draft

The Google Summer of Code 2010 application period is in its final 24 hours.

I volunteered to mentor with two organizations this year, OBF and NESCent. Last month I posted a couple of ideas with each org:

The applications that have come in have been pretty good; the only thing I can complain about is that nobody has followed through with my MIAPA project -- we got a nibble from one student, but nothing after that.

Since we're doing the last round of application reviews now before the deadline, here's some general guidance on what mentors are looking for in a student application.

First, a couple of outside references:

The Zen of GSoC

Google Summer of Code is a program to recruit and foster new long-term open-source contributors.

Broadly, the mentoring organizations are asking three questions:
  1. Are you motivated enough about this work to continue contributing after the summer?
  2. Can you write useful code on your own?
  3. Do you interact well with the community, so that we can work with you to merge your work cleanly into the trunk and rely on you to maintain the codebase?
You can get a sense of what Google and the mentoring orgs are looking for from the applications the orgs themselves submit to Google. For example: NESCent's 2010 app.

Here are some specific tips for demonstrating that you have some committer in you.

Put your previous work online

It's remarkable how many ostensible programmers just can't write decent code. They'll have a list of successful past projects they worked on, maybe a legitimate degree in computer science, but their code itself was clearly never fully understood by anyone, original programmer included. (Remember, programming languages exist for humans to understand -- the computer itself runs on machine code.) The only way we can be sure you can write code we can use is if we can look at something you've written previously.

Biopython uses GitHub for development, so putting a project of your own on GitHub demonstrates two useful things: you can write functioning code, and you're already up to speed with the build tools that Biopython uses.

If the most relevant code you've written is tied up in some way -- say, it's part of a research project still being prepared for publication -- see if you can use at least a few snippets of it. So far, it seems most professors have been willing to allow that.

Subscribe to your mentoring organization's mailing list

I know, e-mail mailing lists seem at least a decade behind the times. But open-source projects like to have a permanent public record of the discussions that happen, and everyone has an e-mail account. We also have IRC channels and Twitter tags (#phylosoc and #obfsoc), but project proposals are generally more than 140 characters so it's best to use e-mail at some point.

Plus, you'll be able to read all the advice the other students are getting -- mentors get fatigued as the application season wears on, and once we've written the same thing a few times we start skipping details.

Write a weekly project schedule

The GSoC application has fields for pointing to external info. Create a Google document or spreadsheet (or README.md on GitHub if you're fancy) detailing your project plan week-by-week.

Suggested fields:
  • Date, or week number for referencing later
  • GSoC events and guidelines (see the official timeline)
  • Deliverables for the week — what's produced, e.g. documentation sections, unit tests, classes, modules
  • Approach for each of these tasks, in a few words
  • Potential problems that could occur, specific to the tasks — perhaps a dependency turns out to be inadequate, or an integration step is required
  • Proposed mitigation for each of the foreseen issues

(If you want to estimate the number of hours or days each task will take, that's cool too.)

Here are the examples from previous GSoC projects that we've been sharing on the mailing lists:

Respect the deadlines

Submit a draft of your application to Google at least a day before the deadline, April 9. There are thousands of applicants each year, and Google has no reason to let the deadline slide — an important function of the application process itself is to screen out students who won't deliver by the stated deadlines. In effect, if your application isn't submitted to Google by noon PST on April 9, then you didn't apply.

BUT: If you submit something even partially complete, we can contact you later during the review stage and get the remaining information from you. And if you included a link to your weekly plan (as a separate online document), you can edit that after the deadline too.

Best of luck!

Wednesday, February 24, 2010

Python workshop #2: Biopython

As promised, here are the slides from Monday's Biopython programming workshop:



This was another 2-hour session, with a short snack break in the middle this time -- which was also a nice opportunity to ask everyone about the pacing, and see if who's been following along with the examples in IPython (versus staring at a BSOD or lolcats -- which I didn't notice any of).

This went well:
  • Pacing
  • Using IPython to inspect objects and display documentation -- this lets some people "read ahead" and perhaps answer their own minor questions, leading to other, better questions
  • The general introductory pattern of:
    1. Demonstrate how to import a module and instantiate the basic class
    2. Review, in English, the core features of the module and why they exist
    3. Walk through a short script that uses real data to accomplish some simple but useful task(s)
    4. Display the result, completing the mental pipeline of input -> transformation -> output
Room for improvement:
  • I didn't always execute the final draft of each example, so there were a couple typos -- inconvenient for those following along in Python. (I've fixed them in the slides here.)
  • Consequently, I didn't have an output file to show at the end of each example -- so I had to describe or draft one on the spot.
  • The PDB module was the coolest part of the workshop, and I rushed it a bit. I was afraid the visitors from Genetics and Plant Bio would be bored with it, but I don't think they were, and the Bioinformatics folks were left wanting more.
I'm planning to host both Python workshops again in the next academic year, either 1 per semester (as it was this year) or both each semester, maybe 2 weeks apart. The Biopython workshop in particular will be different next time because Bio.Phylo will finally be included with the main Biopython distribution -- evolution is cool, and more of the pretty is always a good thing to have in a programming workshop.

Monday, July 20, 2009

Faster string concatenation in Python

Nick Matzke pointed me to this discussion of string concatenation approaches in Python:

Efficient String Concatenation in Python

The issue here is whether adding strings together in a for loop is inefficient enough to be worth working around. Python strings are immutable, so this:
s = 'om'
for i in xrange(1000):
    s += 'nom'

means doing this 1000 times:

  1. Translate the assignment to "s = s + 'nom'"

  2. Allocate another string, 'nom'. (Or reuse a reference if it's already interred.)

  3. Call the __add__ method on s, with 'nom' as the argument

  4. Allocate the new string created by __add__

  5. Assign a reference to that string back to s

So, using the + operator 1000 times in a loop has to create 1000 ever-larger string objects, but only the last one gets used outside the loop. There are good reasons Python works this way, but still, there's a trap here in an operation that gets used a lot in ordinary Python code.

There are a few ways to cope:

  • Use a mutable object to build the string as a sequence of bytes (or whatever) and then convert it back to a Python string in one shot at the end. Reasonable intermediate objects are array and StringIO (preferably cStringIO).

  • Let the string object's join method do the dirty work -- strings are a basic Python type that's been optimized already, so this method probably drops down to a lower level (C/bytecode in the CPython interpreter, not sure about the details) where full allocation of each intermediate string isn't necessary.

  • Build a single format string and interpolate with the % operator (or the format method, if you're fancy) to fill it in, under the same rationale as with the join method. This fits real-world scenarios better — filling in a template of a plain-text table or paragraph with computed values, either all at once with % or incrementally with string addition. It could be a performance bottleneck, and it's not obvious which approach would be better.

The original article gives a nice analysis and comes out in favor of intermediate cStringIO objects, with a list comprehension inside the string join method as a strong alternative. But it was written in 2004, and Python has changed since then. Also, it doesn't include interpolation among the tested methods, and that was the one I was the most curious about.

Methods


I downloaded and updated the script included with that article, and ran it with Python 2.6 and 2.5 to get some new results. (Source code here.)

First, a changelog:

  • The method numbers are different, and there are a couple more. Method #2 is for the % operator, in which I build a gigantic format string and a gigantic tuple out of the number list, then smash them together. It trades memory for CPU time, basically. Method #8 uses map instead of a list comprehension or generator expression; no lambda is required and the necessary function (str()) is already available, so this is a good candidate.

  • I used the standard lib's time.clock() to measure CPU time around just the relevant loop for each string concatenation method.

  • Fetching the process memory size is similar but uses the subprocess module and different options.

  • Docstrings are (ab)used to identify the output.

For example, the string addition method now looks like this:
def method1():
    """1. string addition"""
    start = clock()
    out_str = ''
    for num in NUMS:
        out_str += str(num)
    cpu = clock() - start
    return (out_str, cpu, memsize())


Results


Each method concatenates the string representation of the numbers 0 through 999,999. The methods were run sequentially in separate processes, via a for loop in the shell, for Python versions 2.5 and 2.6. The best of three runs for each method are shown below.
Python 2.6:
1. string addition   CPU (s): 1.99   Mem (K): 11.7
2. %-interpolation   CPU (s): 2.42   Mem (K): 23.0
3. array object      CPU (s): 3.42   Mem (K): 17.3
4. cStringIO object  CPU (s): 3.24   Mem (K): 19.7
5. join + for loop   CPU (s): 2.29   Mem (K): 48.0
6. join + list comp  CPU (s): 1.93   Mem (K): 11.6
7. join + gen expr   CPU (s): 2.08   Mem (K): 11.6
8. join + str map    CPU (s): 1.47   Mem (K): 11.6

The winner is map, with string addition, the list comprehension, and the generator expression also doing well. String addition in a loop did much better than would be expected from reading the original article; the Python developers have put effort into making this less of a trap. Specifically, there's a flag on string objects internally that indicates whether the string is the result of an addition operation. This helps the interpreter identify when a string is being concatenated in a loop, and optimize that case by performing in-place concatenation. Nice. So really, there's no need to worry about the quadratic time behavior that we expected — at least in Python 2.6.

The array object, a sequence of packed bytes, is supposed to be a low-level but high-performance workhorse. It was embedded in the minds of performance-conscious Python programmers by this essay by Guido van Rossum:

Python Patterns — An Optimization Anecdote

At a glance, that problem looks similar to this one. However, converting ints to chars is a problem that can be described well in bytes. Converting integers to their string representation is not — we're not even using any features of the array object related to byte representation. Going low-level doesn't help us here; as Guido indicates in his conclusion, if you keep it short and simple, Python will reward you. The StringIO object in method 5 performs similar duties, and the shape of both functions is the same; the only difference in performance seems to be that cStringIO trades some memory space for CPU time.

The string join method is recommended by the Python standard library documentation for string concatenation with well-behaved performance characteristics. Conveniently, str.join() accepts any iterable object, including lists and generator expressions. Method 5 is the dump approach: build a list in a for loop, pass it to join. Method 6 pushes the looping operation deeper into the interpreter via list comprehension; it saves some bytecode, variable and function lookups, and a substantial number of memory allocations.

Using a generator expression in method 7 instead of a list comprehension should have been equivalent or faster, by avoiding the up-front creation of a list object. But memory usage is the same, and the list comprehension runs faster by a small but consistent amount. Maybe join isn't able to take advantage of lazy evaluation, or is helped by knowing the size of the list object early on... I'm not sure. Interesting, though. In Python 3, the list comprehension is equivalent to building a list object from a generator expression, so results would probably be different there.

Finally, in method 8, map allows the interpreter to look up the str constructor just once, rather than for each item in the given sequence. This is the only approach that gives an impressive speedup over string addition in a loop. So how portable is this result?

Python 2.5:
1. string addition   CPU (s): 3.77   Mem (K): 10.8
2. %-interpolation   CPU (s): 2.43   Mem (K): 22.0
3. array object      CPU (s): 5.16   Mem (K): 16.4
4. cStringIO object  CPU (s): 4.93   Mem (K): 18.7
5. join + for loop   CPU (s): 3.98   Mem (K): 47.1
6. join + list comp  CPU (s): 3.30   Mem (K): 10.5
7. join + gen expr   CPU (s): 3.59   Mem (K): 10.5
8. join + str map    CPU (s): 2.72   Mem (K): 10.5

Python 2.6.2 has had the benefit of additional development time, and in notably the Unladen Swallow project's first quarter of interpreter optimizations, with impressive improvements across the board. By comparison, Python 2.5 uses generally less memory and more CPU time. String interpolation, however, seems to already have been optimized to the max in Python 2.5, and actually wins the performance shootout here! String addition, on the other hand, is slightly less adept at optimizing in a loop. It still avoids the quadratic-time issue (that enhancement was added in Python 2.4), and memory usage is quite respectable.

Conclusion


The recommendations at the end of Guido's essay are still exactly right. In general, Python performs best with code that "looks right", with abstractions that fit the problem and a minimum of branching and explicit looping.

  • Adding strings together in simple expressions will be optimized properly in recent Pythons, but could bite you in older ones

  • Using string interpolation or templates plays well enough with more complex formatting

  • Going too low-level can deprive you of Python's own optimizations

  • If built-in functions can do what you need, use them, and basic Haskell-style functional expressions can make your code very concise

There's more discussion on Stack Overflow.

Saturday, January 19, 2008

On Blub

There's an interesting (old) discussion thread at Raganwald.com:
http://weblog.raganwald.com/2006/10/are-we-blub-programmers.html

Blub Theory evolves over the course of the comments. First off, the requirements for Blubbiness are defined as:
  1. There's at least one language that's worse for the task at hand, and the programmer realizes (validly) that it's less suited for the task than Blub.
  2. There's at least one language that's better for the task, and the programmer doesn't realize that it's better for the task than Blub.
That's Blub from the programmer's perspective. For the adventurous programmer, Blub is the language you're fighting against when you try to introduce Python or OCaml to your programming team. (Citations: Beating the Averages, The Python Paradox)

Another commenter glances on the management perspective of Blub:
I misread "Blub" to be "bulb". As in when a programmer burns out his employer just throws him away and screws in a new one.
'Nuff said.

When we're not trying to pin it down too carefully, we know exactly what Blub is. The first Blub was Fortran — and in some circles, it's still Blub. Guy Steele Jr. (who was involved in the design of Scheme, Common Lisp, and Java in previous efforts to take down previous Blubs) is currently working on Fortress with the same goal. Fortress looks nothing like Fortran, but the name's close enough to get the point across, with the point being that it's intended to be much better suited for tasks where scientific programmers instinctively reach for Fortran. C++ was Blub for the '90s, and since real Blubs never die, it's still the Blub of choice for most performance-critical stuff. Java out-Blubbed C++, and now Java and C# are splitting the Blub market. However, examples of Blub code are still generally a simplified C++, since the equivalent in Java would take too much boilerplate to be worth the column space.

(Sidebar: Modern Fortran looks almost nothing like the original Fortran. C++, especially variety found in Visual Studio now, and with the 200X extensions, is also a mutant. And it's not a superset of C99, either, defeating the original purpose of the language. Visual Basic, VBA and VB.NET are not compatible, despite the naming. How can a language take its users for such a ride over the years when another language with a different name might be the more logical next step? Javascript and C# surely got a bit more mileage on name recognition. All the evidence points to human psychology still working on programmers.)

But the argument breaks down when we try to explain why another language is better suited for the task than Blub. There are C++ programmers out there who can bust out a better program than you can write with any other language. Out of the last 10 years of the ICFP Programming Contest, 3 winners used Haskell, 3 used OCaml, and 2 used C++ — and this is a contest arranged by functional-programming gurus. Python and Ruby have never received any prizes, though Perl was recognized by the second-place team last year.

I see two axes to evaluate languages on: something like the front end and the back end. Semantics and implementation. Both are labeled "power", but for the front end that implies what the language does for the programmer, and for the back end it's what the program does for the machine.

Example 1: Ruby has an excellent front end. It's one of the most expressive languages available; that's probably why 37signals picked it for Rails. The best semantic ideas from Smalltalk, Common Lisp and Perl are all in there; most of the famous Design Patterns are built in either implicitly or explicitly. (It's not that the language makes them all obsolete; the language designer just had the foresight to implement the tricky ones for you.)

But the back end has been playing catch-up. Performance lagged well behind even Python and Perl until the very recent v1.9, and there's no native-code compiler. I could be misinformed, but I've also heard that: threading suffers the same issue as Python of limiting the interpreter process to a single processor (or so I've heard); there's no built-in foreign-function interface a la Haskell or Python's ctypes module; Unicode support has rough spots; and large-scale parallelism and concurrency basically mean running a bunch of separate Ruby processes.

There's a lot for a Blub programmer to pick on.

Example 2: Delphi, a.k.a. Object Pascal, has an excellent native-code compiler, with support for cross-platform compilation and single-file (DLL-free) executables, and can also run on .NET, with all these options available through the same IDE. It's competitive with C on benchmarks, often faster. Integration with databases and other external components is solid. Refactoring tools are included with the IDE, lots of fun with static analysis. Object Pascal itself was originally designed at Apple some years before Borland picked it for their offering, then abandoned, but there seems to be something inherent in the language that enables highly optimized compilation. The Free Pascal implementation, for instance, comes well ahead of every other language in the Computer Language Benchmarks Game when memory and speed are weighted equally. On the combined benchmarks, Free Pascal uses only half as much memory as C (gcc)!

The catch is, Object Pascal is a cheesy-looking language. On the same benchmarks set, comparing the size of the code in gzipped bytes (emphasizing tokens instead of characters), Object Pascal comes in 24th out of 33 languages, just behind C. It beats Fortran, Java and C++, but not C#. I think I'd just buy more RAM rather than rewrite a Blub program in Object Pascal.

The benchmarks tend to be heavy mathematical algorithms, rather than general-use applications, so certain things like I/O, libraries and support for bottom-up programming and meta-programming are discounted. Regardless, Python, Perl and Ruby are the top 3 languages for code size on these benchmarks — I think Lisp was hurt more by this aspect of the benchmarks, since syntactic sugar isn't built in; there's no room for code reduction via mini-language. Haskell was probably helped by the absence of I/O. In general the benchmarks show that Blub languages perform well but are somewhat verbose, while scripting and Web-friendly languages are concise but have poor performance; Prolog ranks badly in every way, while OCaml and Haskell do well in every way; this fits reality fairly well for number-crunching but not for the Web or AI. Let's acknowledge once again that benchmarks aren't perfect, and forge ahead.

Fact: A language and a compiler are not the same thing.

But the Object Pascal example should show that a language can baby the compiler to give better results. The three arguments go:
  1. Declaring static types and generally programming close to the metal gives the compiler the information it needs to generate an optimally efficient program. That's why C can be fast — it's a close fit to the hardware. Same goes for Java and the JVM.
  2. Using the right abstractions and strong type inferencing lets the compiler get a high-level view of what your algorithm is doing, allowing it to do more optimizations itself. That's why OCaml and Haskell can be fast — they're a close fit to the pure algorithm.
  3. While the expressiveness of new languages like Ruby and Python is appealing, the race to incorporate imperative, object-oriented and functional programming styles into every major language is actually resulting in weaker languages. Borrowing features doesn't bring a language any closer to providing a new model of computation, and it certainly doesn't give a better angle of attack at the whole point of all of this — making the computer hardware do what we want.
The third argument was made by John Backus, creator of Fortran, the Backus-Naur Form for defining programming language syntaxes, and later the FP and FL programming languages.
"Programming languages appear to be in trouble. Each successive language incorporates, with a little cleaning up, all the features of its predecessors plus a few more. [...] Each new language claims new and fashionable features... but the plain fact is that few languages make programming sufficiently cheaper or more reliable to justify the cost of producing and learning to use them."
— John Backus

The talk that began with this argument went on to introduce function-level programming. At the time, everyone thought Backus was talking about functional programming, so it unintentionally gave a boost to Lisp and later the ML family, of which Haskell and OCaml are derived. But no: it was actually about a new language called FP, somewhat based on APL. FP begat FL, which went nowhere, but Morgan Stanley created an ASCII-friendly variant of APL called A+ (which is now free, GPL'd software), and the proprietary J and K have carried the torch since then. The use of these languages now seems to be mostly in the financial world. (Perhaps because it's really very well-suited to financial tasks, and perhaps because that's where APL made its splash — who knows, it may have even Blubbed its way in.)

The main idea is point-free programming: rather than pushing values around (as even functional programming languages do), compose functions together to create an algorithm that only references functions, not values. Then create a basic set of operators that can be composed together to create higher-level functions. This is an excellent way to do manipulate arrays and matrices. Haskell touches on this idea but doesn't emphasize it.

Benchmarks for any of these languages are hard to find, but I see one cryptic statement about K here:
 [k is much faster than c on strings and memory management.]

And another startling statement on Wikipedia:
The performance of modern CPUs is improving at a much faster rate than their memory subsystems. The small size of the interpreter and compact syntax of the language makes it possible for K applications to fit entirely within the level 1 cache of the processor. Vector processing makes efficient use of the cache row fetching mechanism and posted writes without introducing bubbles into the pipeline by creating a dependency between consecutive instructions.

It looks pretty convincing to me. Finally, a fresh look at how programming languages make a machine do work.

This seems to be the argument missing from every language war: by removing the non-orthogonal parts of a language, it becomes more powerful. K doesn't have objects or continuations, and it doesn't need them. Likewise, Haskell restricts the ability to modify state to monads, Erlang's flow-control constructs throw out traditional iteration entirely, and Lisp virtually strips out syntax itself.

Programming languages should be designed not by piling feature on top of feature, but by removing the weaknesses and restrictions that make additional features appear necessary.
— Revised5 Report on the Algorithmic Language Scheme


The corollary to (and irony of) the Blub paradox is that since these optimized languages are missing constructs found in Blub — by design — a Blub programmer will always have plenty to pick on.

Thursday, November 8, 2007

Curly-brace wrangling

This is not new, but neither is C:

http://www.chris-lott.org/resources/cstyle/witters_tao_coding.html

I'm investigating coding style, and this is the best style I've seen so far -- at least for C, C++ and Javascript -- which appear to be the two languages most susceptible to oddly formatted code, anyway. I flailed against the extra space inside parens for a little while, but now I see the wisdom of it (for C and C++, since parens should sting a little there). Combining Witter's style and typedefs for function pointers, and I think I'm set for inflicting well-informed pedantry on any new code that comes under my gaze.

The two other sources on programming style/zen that impressed me are also referenced here: the style guide for the Linux kernel is short and sweet, and Fred Brooks' "No Silver Bullet" is, like, important. Rob Pike's article is also interesting, especially at the end. There's a lot of cross-referencing between all of these and the Wikipedia entry on Unix philosophy.

Thursday, August 2, 2007

The Right Tool For The Job: Scripting


Though it's barely planned
The kludgiest of Perl scripts
Is one day maintained


I've been learning Perl lately, after having used Python wherever possible for a couple of years. It's gut-wrenching. So today's pedantry is on the topic of scripting languages -- interpreted, batteries-included, "!#/usr/bin/env"-ready languages for getting a simple job done with a minimum of hassle, as I'm defining it.

Google for "little $LANG script", in quotes, replacing $LANG with each of the most well-known scripting languages. My results:

Table 1:

$LANG @Hits
===== =====
Perl 32,300
shell 24,400 * what does this mean, exactly?
PHP 15,500
Python 12,000
VB 1080 * Skewed, because "vb script" is also a language
bash 808
batch 624
Ruby 511
Tcl 411
js 271
sh 266
vim 76
C++ 7
scheme 7
lisp 6
emacs 3
haskell 3


To further abuse Internet statistics, let's search for each language on Google Code:

Table 2:

$LANG @HITS
===== =====
C++ 6,000,000
Perl 1,420,000
Python 1,050,000
PHP 1,590,000
shell 879,000
Ruby 304,000
Lisp 238,000 * includes elisp
Javascript 212,000
Basic 202,000
Tcl 186,000
bat 183,000
Scheme 103,000
Haskell 67,700


Now, combine these two tables to get a ratio representing the "scriptability" of each language. Or rather, divide the Google Code hits by "little script" hits to get a "Script Factor" inversely proportional to the fraction of existing code that qualifies as little scripts. This is hard science.

Table 3:

$LANG Google Code "Little" Script Factor Notes
==== =========== ======== ============= =====
shell 879,000 24,400 36 * "Shell" is vague
Perl 1,420,000 32,300 44
Python 1,050,000 12,000 88
PHP 1,590,000 15,500 103
Basic 202,000 1080 187 * Includes non-Visual basics
Batch 183,000 624 293
Tcl 186,000 411 453
Ruby 304,000 511 595
Javascript 212,000 271 782
Scheme 103,000 7 14,714
Haskell 67,700 3 22,567
Lisp 238,000 9 26,444 * Includes elisp and common lisp
C++ 6,000,000 7 857,143


Interestingly, this is close to the first list of "little script" languages, with the three P's right up top. The functional languages I threw in for fun are ranked by absurdly small denominators, so I wouldn't
say the results are meaningful beyond indicating that even the hardcore people using these languages for real projects are using P-languages and the shell for simple scripts.

What does this all mean?

  • Scripters are using the right tool for the job. Good scripting languages float to the top.

  • Even the most hardcore Lisp and Haskell programmers use something else for scripting. In other words, they know multiple languages, and they, too, use the right tool for the job.

  • There are seven idiots in the world writing scripts in C++. One would only do this if unaware of any other scriptable language, and therefore capable of using only one tool for any job.

  • Emacs users call their customizations "packages" or "modes," not "scripts." Foiled.

Now let's get specific.

Shell

"Shell" came in first in scriptability, second-place in "little $lang script," but merely fifth in Google Code usage. So shell scripting is a popular way to get things done, but not so much for writing full-on applications.

What language is shell scripting, exactly? I'm assuming the search hits refer to bash, ksh, csh, zsh, and the rest of the Unix shells, mostly because that's how it showed up on Google Code, and because bash seems to be the default on the major Linux distros. Plus, Windows programmers don't talk about the "shell"; if they wade into the muck of cmd.exe at all, they call it batch, DOS, or occasionally command-line scripting. And they don't talk about it online as much as Unix/Linux gurus, outside a few Microsoft-specific websites, from what I've seen.

The strengths of the shell are (1) everything is a string; (2) courtesy of Unix design, the sources and recipients of character streams consistently look like filenames; (3) complex programs can be used like functions and filters, directly adding to the shells abilities (the ultimate FFI, in a way); (4) since code can be data and commands can be piped and redirected around, flow control can be pretty concise. The flaws, as I see them, are (1) everything is a string, meaning nontrivial structures must be serialized and parsed at every step; (2) there are few guarantees about what's actually available to the shell on a given system -- paths, environmental variables, program versions -- so sharing scripts between systems is wildly unreliable. Still, I've never seen a GUI tool as broadly useful as the shell is for getting computer tasks done.

Perl

Legend has it that Larry Wall designed Perl to pull together all of the various Unix sysadmin tools into one effective package, with the plan for it to be especially useful for text manipulation (Reporting and Extraction). So C, bash, awk, sed, grep, and friends are all in there -- in short, it keeps the shell's advantages and does its best to eliminate the disadvantages. (Best of all, it finally got regular expressions right.) And then there's CPAN. I'm not surprised that Perl is #1 for "little scripts" that are just complex enough to be worth saving.

What is Perl the right tool for?
  • One-liners that bash doesn't have an equivalent for -- Perl is installed almost everywhere bash is
  • Straightforward text-processing scripts (Python's immutable strings are a weakness here, and Ruby installations still aren't a universal default)
  • It was a great server-side scripting language during the first dotcom boom (though Java managed to cast itself as the more legit (enterprisey) big brother here). Since Perl coders weren't afraid to get things done "right now," mod_perl made the combination of Apache and Perl effective, scalable, and most importantly, available just when it was needed.

Python

Python fixes Perl, says the next legend. But its strength as a scripting language is that it fixes Java, too -- and as it turns out, Python's "scriptability" is exactly half that of Perls. Spooky, no?

I like Python. It makes sense to C programmers and Unix hermits. And, thanks to Guido's diligent attention to aesthetics, ugly Python code almost always means you're doing something awkward, slow or wrong. The language rewards good behavior with readable, concise code. You know that whitespace issue where if you copy code from a forum and paste it into your own code, the interpreter will crap out on the indentation? It's punishing you for blind copy-and-paste. Doesn't that creep you out a little? Guido is basically handing out candy if you read the documentation on generator expressions, and slapping you on the wrist if you don't read your own code before running it.

There doesn't seem to be a single theoretical approach that guarantees a language will work that way, but for Python, it seemed to work.

What is Python the wrong job for?
  • One-liners -- remember that thing about whitespace?
  • Unix tasks that have already been thoroughly solved with existing command-line tools (see
    Bash).
  • Number crunching (by itself, but see SciPy and Parallel Python). Python 3.0 borrows most of Scheme's numerical tower, so that may improve the situation.