Tuesday, October 5, 2010

Bio.Phylo: A unified phylogenetics toolkit for Biopython

I presented this at the Bioinformatics Open Source Conference (BOSC 2010) in early July, but somehow forgot to post it here too. It's an overview of my somewhat new sub-package for working with phylogenetic trees in Biopython, based on my Google Summer of Code 2009 project (a phyloXML parser in Biopython).

In a nutshell, Bio.Phylo is a library for manipulating finished phylogenetic trees and integrating them into a Biopython-based workflow. It can handle the standard file formats — Newick, Nexus and phyloXML, with the current exception of NeXML — and has particularly good support for phyloXML.

This presentation walks through an example of loading a Newick tree, viewing it a few different ways, adding branch colors, and saving it as a phyloXML file.


The conference abstract is here. I also recommend the main documentation in the Biopython Tutorial (see chapter 12) and the wiki page.

Thursday, April 8, 2010

Google Summer of Code 2010: The final draft

The Google Summer of Code 2010 application period is in its final 24 hours.

I volunteered to mentor with two organizations this year, OBF and NESCent. Last month I posted a couple of ideas with each org:

The applications that have come in have been pretty good; the only thing I can complain about is that nobody has followed through with my MIAPA project -- we got a nibble from one student, but nothing after that.

Since we're doing the last round of application reviews now before the deadline, here's some general guidance on what mentors are looking for in a student application.

First, a couple of outside references:

The Zen of GSoC

Google Summer of Code is a program to recruit and foster new long-term open-source contributors.

Broadly, the mentoring organizations are asking three questions:
  1. Are you motivated enough about this work to continue contributing after the summer?
  2. Can you write useful code on your own?
  3. Do you interact well with the community, so that we can work with you to merge your work cleanly into the trunk and rely on you to maintain the codebase?
You can get a sense of what Google and the mentoring orgs are looking for from the applications the orgs themselves submit to Google. For example: NESCent's 2010 app.

Here are some specific tips for demonstrating that you have some committer in you.

Put your previous work online

It's remarkable how many ostensible programmers just can't write decent code. They'll have a list of successful past projects they worked on, maybe a legitimate degree in computer science, but their code itself was clearly never fully understood by anyone, original programmer included. (Remember, programming languages exist for humans to understand -- the computer itself runs on machine code.) The only way we can be sure you can write code we can use is if we can look at something you've written previously.

Biopython uses GitHub for development, so putting a project of your own on GitHub demonstrates two useful things: you can write functioning code, and you're already up to speed with the build tools that Biopython uses.

If the most relevant code you've written is tied up in some way -- say, it's part of a research project still being prepared for publication -- see if you can use at least a few snippets of it. So far, it seems most professors have been willing to allow that.

Subscribe to your mentoring organization's mailing list

I know, e-mail mailing lists seem at least a decade behind the times. But open-source projects like to have a permanent public record of the discussions that happen, and everyone has an e-mail account. We also have IRC channels and Twitter tags (#phylosoc and #obfsoc), but project proposals are generally more than 140 characters so it's best to use e-mail at some point.

Plus, you'll be able to read all the advice the other students are getting -- mentors get fatigued as the application season wears on, and once we've written the same thing a few times we start skipping details.

Write a weekly project schedule

The GSoC application has fields for pointing to external info. Create a Google document or spreadsheet (or README.md on GitHub if you're fancy) detailing your project plan week-by-week.

Suggested fields:
  • Date, or week number for referencing later
  • GSoC events and guidelines (see the official timeline)
  • Deliverables for the week — what's produced, e.g. documentation sections, unit tests, classes, modules
  • Approach for each of these tasks, in a few words
  • Potential problems that could occur, specific to the tasks — perhaps a dependency turns out to be inadequate, or an integration step is required
  • Proposed mitigation for each of the foreseen issues

(If you want to estimate the number of hours or days each task will take, that's cool too.)

Here are the examples from previous GSoC projects that we've been sharing on the mailing lists:

Respect the deadlines

Submit a draft of your application to Google at least a day before the deadline, April 9. There are thousands of applicants each year, and Google has no reason to let the deadline slide — an important function of the application process itself is to screen out students who won't deliver by the stated deadlines. In effect, if your application isn't submitted to Google by noon PST on April 9, then you didn't apply.

BUT: If you submit something even partially complete, we can contact you later during the review stage and get the remaining information from you. And if you included a link to your weekly plan (as a separate online document), you can edit that after the deadline too.

Best of luck!

Wednesday, February 24, 2010

Python workshop #2: Biopython

As promised, here are the slides from Monday's Biopython programming workshop:



This was another 2-hour session, with a short snack break in the middle this time -- which was also a nice opportunity to ask everyone about the pacing, and see if who's been following along with the examples in IPython (versus staring at a BSOD or lolcats -- which I didn't notice any of).

This went well:
  • Pacing
  • Using IPython to inspect objects and display documentation -- this lets some people "read ahead" and perhaps answer their own minor questions, leading to other, better questions
  • The general introductory pattern of:
    1. Demonstrate how to import a module and instantiate the basic class
    2. Review, in English, the core features of the module and why they exist
    3. Walk through a short script that uses real data to accomplish some simple but useful task(s)
    4. Display the result, completing the mental pipeline of input -> transformation -> output
Room for improvement:
  • I didn't always execute the final draft of each example, so there were a couple typos -- inconvenient for those following along in Python. (I've fixed them in the slides here.)
  • Consequently, I didn't have an output file to show at the end of each example -- so I had to describe or draft one on the spot.
  • The PDB module was the coolest part of the workshop, and I rushed it a bit. I was afraid the visitors from Genetics and Plant Bio would be bored with it, but I don't think they were, and the Bioinformatics folks were left wanting more.
I'm planning to host both Python workshops again in the next academic year, either 1 per semester (as it was this year) or both each semester, maybe 2 weeks apart. The Biopython workshop in particular will be different next time because Bio.Phylo will finally be included with the main Biopython distribution -- evolution is cool, and more of the pretty is always a good thing to have in a programming workshop.

Friday, February 19, 2010

Python workshop #1, now on SlideShare

Last November I hosted a workshop on basic Python programming at UGA. The attendees were mostly from the bioinformatics department, but this workshop didn't go into science at all -- just practical Python usage. Today I finally got around to cleaning up the slides and uploading them to SlideShare:



It looks like LaTeX Beamer and SlideShare's PDF/Flash converter don't play well together. Meh, it's still easy enough to read.

I'm working on a Biopython-specific followup right now for a workshop on Monday, 2/22. I'll post that here when it's done, too, with reasonable haste.