<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-266234734515043410</id><updated>2012-01-30T10:26:42.542-05:00</updated><category term='linux'/><category term='scripting'/><category term='education'/><category term='cnr'/><category term='gsoc'/><category term='tools'/><category term='javascript'/><category term='opensuse'/><category term='arch'/><category term='programming'/><category term='perl'/><category term='lisp'/><category term='linspire'/><category term='presentation'/><category term='suse'/><category term='style'/><category term='c'/><category term='function-level'/><category term='blub'/><category term='biopython'/><category term='shell'/><category term='python'/><category term='unix'/><category term='functional'/><category term='languages'/><category term='debian'/><category term='vim'/><category term='statistics'/><category term='productivity'/><category term='ubuntu'/><category term='c++'/><category term='userfriendly'/><category term='R'/><category term='science'/><title type='text'>etalog</title><subtitle type='html'>Kind of a science/tech blog.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>18</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-828981467660701421</id><published>2012-01-16T17:26:00.000-05:00</published><updated>2012-01-16T17:26:04.996-05:00</updated><title type='text'>Building an analysis: How to avoid repeating intermediate tasks in a computational pipeline</title><content type='html'>In my projects, I tend to start with a simple analysis of a limited dataset,then incrementally expand on it with more data and deeper analyses. This meanseach time I update the data (e.g. add another species' protein sequences) oradd another step to the analysis pipeline, everything must be re-run -- butonly a small part of the pipeline actually needs to be re-run.&lt;br /&gt;&lt;br /&gt;This is a common problem in bioinformatics:&lt;br /&gt;&lt;a class="reference external" href="http://biostar.stackexchange.com/questions/79/how-to-organize-a-pipeline-of-small-scripts-together"&gt;http://biostar.stackexchange.com/questions/79/how-to-organize-a-pipeline-of-small-scripts-together&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;How can we automate a pipeline like this, without running it all from scratcheach time? This is the same problem faced when compiling large programs, andthat particular case has been solved fairly well by build tools.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Make&lt;/h3&gt;The &lt;tt&gt;make&lt;/tt&gt; utility is normally used to determine which portions of a programneed to be recompiled, and the dependencies between them; it includes specialfeatures for managing C code, but actions are specified in terms of shellcommands. You can just as well use it to specify dependencies between arbitrarytasks.&lt;br /&gt;&lt;br /&gt;Giovanni Dall'Olio has posted some helpful instructional material:&lt;a class="reference external" href="http://bioinfoblog.it/?p=29"&gt;http://bioinfoblog.it/?p=29&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;(One of his links is broken -- here's &lt;a href="http://software-carpentry.org/4_0/make/"&gt;Software Carpentry's current course on &lt;tt&gt;make&lt;/tt&gt;&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Rake, a Really Awesome Make&lt;/h3&gt;If your pipeline is already organized as a pipeline of stand-alone scripts,Rake is a more-or-less ideal solution:&lt;a class="reference external" href="http://rake.rubyforge.org/"&gt;http://rake.rubyforge.org/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;At some point in your project, you'll likely need to do some string processing;this is where &lt;tt&gt;make&lt;/tt&gt; falls down. Ruby happens to be great for this task. Don'tworry if you don't know Ruby -- the basic string methods in Ruby, Python andPerl are very similar, and regular expressions work roughly the same way.You can look up everything you need to know on the fly. (I did.)&lt;br /&gt;&lt;br /&gt;Also, Rake's the mini-language for specifying tasks is both concise andintuitive. There's a built-in facility for maintaining short descriptions ofeach step, which is enticing from a "lab notebook" perspective.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;How about a Python solution?&lt;/h3&gt;Inspired by Rake, I wrote a small module called tasks.py. Here's the code, plusa simple script to demonstrate:&lt;br /&gt;&lt;a class="reference external" href="https://gist.github.com/1623117"&gt;https://gist.github.com/1623117&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This isn't just for the sake of Python evangelism; I have a bunch ofspecial-purpose modules written in Python (for my lab) that I would like touse in an elaborate pipeline of my own.  It's silly to put each of thesesteps in a separate, throwaway Python script that then gets called in aseparate process by Rake.  Instead, I can import&amp;nbsp;&lt;a href="https://gist.github.com/1623117"&gt;tasks.py&lt;/a&gt; into a Pythonscript and include the dependency-scheduling features in my own programs.&lt;br /&gt;&lt;br /&gt;Key differences from other build systems (e.g. &lt;a href="http://www.scons.org/"&gt;SCons&lt;/a&gt;, &lt;a href="http://code.google.com/p/waf/"&gt;waf&lt;/a&gt;, &lt;a href="http://code.google.com/p/ruffus/"&gt;ruffus&lt;/a&gt;):&lt;br /&gt;&lt;ul class="simple"&gt;&lt;li&gt;This module is not meant to be run from the command-line -- only forspecific parts of your own code&lt;/li&gt;&lt;li&gt;The implementation of each processing step is separate from thedependency specification (although single-command tasks can still bedefined inline with a lambda expression).Separate the algorithm from the data, I always say.&lt;/li&gt;&lt;li&gt;Cleanup is specified within the task, not in a separate area of theRakefile/Makefile. This makes more sense for a project with heterogenousprocessing steps &amp;amp; intermediate files.&lt;/li&gt;&lt;/ul&gt;To be clear, &lt;tt&gt;tasks.py&lt;/tt&gt; is not better than Rake or &lt;tt&gt;make&lt;/tt&gt; for arranging a set ofscripts that you've already written. I didn't even implement concurrent jobs(because most of the CPU-intensive steps I use are calls to programs that arealready multicore-aware; though try adding that feature yourself if you'dlike).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-828981467660701421?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/828981467660701421/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=828981467660701421' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/828981467660701421'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/828981467660701421'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2012/01/building-analysis-how-to-avoid.html' title='Building an analysis: How to avoid repeating intermediate tasks in a computational pipeline'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-4690444897214319883</id><published>2011-11-02T14:42:00.000-04:00</published><updated>2011-11-02T14:42:33.069-04:00</updated><title type='text'>Journal article: Comparative kinomics of the malaria pathogen and its relatives</title><content type='html'>Hot off the presses!&lt;br /&gt;&lt;a href="http://www.biomedcentral.com/1471-2148/11/321/abstract"&gt;Structural and evolutionary divergence of eukaryotic protein kinases in Apicomplexa&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;It's a thorough paper, so I'll cover the highlights here.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Why we study apicomplexans&lt;/h3&gt;Apicomplexans are a group of related single-celled organisms which are exclusively parasitic. The best-known member is &lt;a href="https://secure.wikimedia.org/wikipedia/en/wiki/Plasmodium_falciparum_biology"&gt;&lt;i&gt;Plasmodium falciparum&lt;/i&gt;&lt;/a&gt;, which causes the most virulent form of malaria. Another well-studied species is &lt;i&gt;Toxoplasma gondii&lt;/i&gt;, which primarily lives in cats but can infect most mammals.&lt;br /&gt;&lt;br /&gt;It's a hugely diverse group. But overall, we know very little about them.&lt;br /&gt;&lt;br /&gt;Our main motivation for studying apicomplexan proteins is to find what features make them distinct from human proteins, so we can then design drugs to target those features specifically -- the drug will identify and disable the parasite protein without the risk of affecting the host proteins, too. We study protein kinases, in particular, because a number of drugs have already been designed to &lt;a href="http://www.cancerquest.org/kinase-inhibitors.html"&gt;inhibit kinases in cancer&lt;/a&gt;. The same or similar compounds could be used to treat parasitic diseases, potentially.&lt;br /&gt;&lt;br /&gt;From an evolutionary biologist's perspective, apicomplexans are also interesting to study because they belong to an &lt;a href="http://tolweb.org/Eukaryotes"&gt;evolutionary branch&lt;/a&gt; that is quite divergent from the animals, plants and fungi more familiar to us. By learning about apicomplexan biology, and comparing to other model organisms, we can learn more about eukaryotic diversity and the origin of eukaryotes.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Another perspective on the tree of life&lt;/h3&gt;Many people, including scientists, think of evolution as a ladder, with single-celled organisms at the bottom and humans at the top. Different lineages, like green plants and fungi, each branch off the ladder at some intermediate point, but evolution is nonetheless mistakenly thought of as a directed progression from bacteria to protists to fish to humans.&lt;br /&gt;&lt;br /&gt;That's wrong. It leads to mistakes, such as considering all protists (single-celled eukaryotes) to be closely related to each other. But even within Apicomplexa, the evolutionary distance between &lt;i&gt;Plasmodium falciparum&lt;/i&gt; and &lt;i&gt;Toxoplasma gondii&lt;/i&gt; is as great as the distance between humans and sponges.&lt;br /&gt;&lt;br /&gt;I'm particularly proud of Figure 1 in the paper, which includes a species tree that inverts the traditional view: The closest human relative, yeast, is at the bottom, and layers of increasingly strange and unfamiliar protists build up to the &lt;i&gt;Plasmodium&lt;/i&gt; genus.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Interesting features of proteins and genomes&lt;/h3&gt;When apicomplexan parasites invade a host, they secrete a mixture of dozens of different proteins into a &lt;a href="http://www.nature.com/nrmicro/journal/v6/n1/fig_tab/nrmicro1800_F2.html"&gt;protective vacuole&lt;/a&gt; formed from the host cell membrane. We'd expect that some of these proteins are essential for invasion and virulence, and therefore good targets for inhibition or diagnosis.&lt;br /&gt;&lt;br /&gt;Two apicomplexan-specific protein kinase families are known to be exported. The FIKK family appears in 1 copy in most apicomplexans, but is amplified to 21 copies in &lt;i&gt;P. falciparum&lt;/i&gt; and 6 copies in &lt;i&gt;P. reichenowi&lt;/i&gt;, and does not appear in any species outside the Apicomplexa. Another family, called rhoptry kinases (ROPK) after the apicomplexan organelle they're localized to, appears in dozens of copies in coccidians (&lt;i&gt;T. gondii&lt;/i&gt;, &lt;i&gt;Neospora caninum&lt;/i&gt;, &lt;i&gt;Eimeria tenella&lt;/i&gt;, &lt;i&gt;Sarcocystis neurona&lt;/i&gt;), but not in any other lineage of Apicomplexa. &lt;i&gt;Plasmodium&lt;/i&gt; and others still contain rhoptries, but there are no kinases in the protein cocktail those rhoptries contain.&lt;br /&gt;&lt;br /&gt;As obligate parasites, apicomplexans evolve under different evolutionary contraints than free-living organisms like yeast and humans. Many genes are no longer necessary, and some may even be a liability if they interact with the host's own biochemical pathways. Because of this, we see widespread gene loss and overall compaction of apicomplexan genomes.&lt;br /&gt;&lt;br /&gt;One especially curious case is the loss of upstream regulators of the MAPK cascade -- a signaling pathway found in almost all eukaryotes, consisting of 3 or 4 protein kinases each activating the next in a sort of biochemical relay. Apicomplexans contain 2 to 3 copies of the downstream protein kinase, MAPK, but the rest of the pathway components (STE7, STE11, STE20) are generally lost, and none of the surveyed apicomplexans had a complete MAPK cascade. So there's an open question: What other proteins take the place of the STEs in this important pathway, or have MAPK-like features? Is there an Achilles heel to be discovered?&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;The project&lt;/h3&gt;We:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Identified and &lt;b&gt;classified&lt;/b&gt; the full set of protein kinases in each of the 17 apicomplexan proteomes available &lt;/li&gt;&lt;li&gt;Devised a pipeline to identify &lt;b&gt;apicomplexan-specific ortholog groups&lt;/b&gt; in known protein kinase families&lt;/li&gt;&lt;li&gt;Compared these ortholog groups to the typical members of the kinase family to find specific &lt;b&gt;sequence motifs&lt;/b&gt; that distinguish the divergent ortholog group&lt;/li&gt;&lt;li&gt;Mapped these motifs onto protein &lt;b&gt;structures&lt;/b&gt;; reviewed the literature to understand possible functions and &lt;b&gt;functional differences&lt;/b&gt; related to these motifs&lt;/li&gt;&lt;/ol&gt;Read about what we found &lt;a href="http://www.biomedcentral.com/1471-2148/11/321/abstract"&gt;here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-4690444897214319883?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/4690444897214319883/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=4690444897214319883' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/4690444897214319883'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/4690444897214319883'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2011/11/journal-article-comparative-kinomics-of.html' title='Journal article: Comparative kinomics of the malaria pathogen and its relatives'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-3691021823526045130</id><published>2011-10-28T10:32:00.000-04:00</published><updated>2011-10-28T10:32:01.873-04:00</updated><title type='text'>Journal article: Our insights into the structure and activation mechanism of ErbB/EGFR protein kinases</title><content type='html'>Here's an article my lab published in &lt;i&gt;PLoS One&lt;/i&gt;:&lt;br /&gt;&lt;a href="http://dx.plos.org/10.1371/journal.pone.0014310"&gt;Co-Conserved Features Associated with cis Regulation of ErbB Tyrosine Kinases&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I'll give a quick summary of it here. (Don't worry, this isn't a new direction for this blog.)&lt;br /&gt;&lt;br /&gt;This is a study of the structural mechanisms of a certain protein family, called &lt;a href="http://en.wikipedia.org/wiki/ErbB"&gt;ErbB&lt;/a&gt; or EGFR (epidermal growth factor receptor), which is frequently involved in cancer. This family belongs to a protein superfamily called &lt;b&gt;&lt;a href="http://kinase.com/wiki/index.php/Introduction_to_Kinases"&gt;protein kinases&lt;/a&gt;&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Biochemistry background&lt;/h3&gt;Kinases are enzymes which perform a type of post-translational modification, &lt;b&gt;phosphorylation&lt;/b&gt;: The kinase transfers a phosphate group from adenosine triphosphate (&lt;b&gt;ATP&lt;/b&gt;) to another &lt;b&gt;substrate&lt;/b&gt; molecule, leaving adenosine diphosphate (ADP) and the phosphorylated substrate.&lt;br /&gt;&lt;br /&gt;Protein kinases are kinases that act on protein substrates, i.e. the phosphorylated molecule is another protein. The substrate could even be another protein kinase, so activation of the first protein kinase causes it to phosphorylate and activate another protein kinase, and so on. This is a type of &lt;b&gt;signal transduction&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;Signal transduction is how the cell senses and reacts to its environment, and also its own internal conditions. In the case of ErbB and other receptor tyrosine kinases, the signal starts at the surface of the cell (e.g. epidermal growth factor binds to the extracellular portion of EGFR) and activates the kinase, which then begins sending these phosphorylation signals. These signals are then relayed throughout the cell to trigger other activities, such as &lt;b&gt;cell division&lt;/b&gt; or the &lt;b&gt;transcription&lt;/b&gt; of certain genes.&lt;br /&gt;&lt;br /&gt;What happens if a protein kinase gets "locked" into the active state, somehow? In the case of EGFR, it's as if the cell thinks it's constantly receiving the growth factor. If this signal isn't blocked by another "gatekeeper" in the cell, then the cell will grow uncontrollably -- and become &lt;b&gt;cancer&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;How the enzyme works&lt;/h3&gt;Protein kinases (PKs) all consist of two large lobes connected by a flexible hinge. Between the lobes is a binding pocket for ATP; this molecule binds inside the smaller lobe (N-terminal lobe, or N-lobe). The larger lobe (C-terminal lobe or C-lobe) provides a binding site for another protein, which will be the kinase's substrate.&lt;br /&gt;&lt;br /&gt;The general mechanism of all protein kinases goes like this:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The kinase is initially in an &lt;b&gt;inactive&lt;/b&gt; state, with the hinge "open" and the two lobes a bit further apart. Since ATP binds in the N-lobe and the substrate binds to the C-lobe, no phosphate is transferred when the two lobes are apart like this.&lt;/li&gt;&lt;li&gt;By some mechanism (it varies between different kinase families), the two lobes are brought closer together, and the kinase becomes &lt;b&gt;active&lt;/b&gt;.&lt;/li&gt;&lt;li&gt;ATP binds to the ATP binding pocket, a substrate binds to the C-lobe, some amino acids shift, and a &lt;b&gt;phosphate&lt;/b&gt; group is detached from ATP and reattached to a specific amino acid on the substrate.&lt;/li&gt;&lt;li&gt;The ADP and phosphorylated substrate are released.&lt;/li&gt;&lt;/ol&gt;Step 2 is the part we're interested in. How do some recurring, cancer-associated mutations cause EGFR to become "locked" in the active conformation? And, can we reverse it?&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;How we think ErbB kinases work&lt;/h3&gt;In the ErbB family, it's not just the two lobes of the kinase domain that are involved in activating the enzyme -- the adjacent sections of the protein, outside the kinase domain, are also involved.&lt;br /&gt;&lt;br /&gt;The long C-terminal tail wraps back around the entire kinase domain and associates with the N-lobe, tethered in place by a few residues in the N-lobe and the other N-terminal flanking region (the juxtamembrane segment, between the kinase domain and the cell membrane). The C-tail is placed so that it can influence the movement and relative positioning of the N- and C-lobes, and therefore regulate the activation of the kinase.&lt;br /&gt;&lt;br /&gt;We also examined the locations of two EGFR mutations (S768I and L861Q) that have been previously identified as occurring frequently in cancers, mapping them onto the structure. These mutations appear in locations that would disrupt the switching mechanism we proposed -- breaking necessary interactions, or forming new interactions that shouldn't be there for proper EGFR function.&lt;br /&gt;&lt;br /&gt;If you'd like to know more, read about it &lt;a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0014310"&gt;here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-3691021823526045130?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/3691021823526045130/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=3691021823526045130' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/3691021823526045130'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/3691021823526045130'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2011/10/journal-article-our-insights-into.html' title='Journal article: Our insights into the structure and activation mechanism of ErbB/EGFR protein kinases'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-2232711059182312779</id><published>2011-07-07T15:28:00.002-04:00</published><updated>2011-09-27T21:32:16.043-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>The statistics of the "Like" button</title><content type='html'>The launch of &lt;b&gt;Google+&lt;/b&gt; reminded me of a question I've had about Facebook and YouTube for a while: What happens when you click the "Like" button?&lt;br /&gt;&lt;br /&gt;Facebook isn't so much about sharing &lt;i&gt;content&lt;/i&gt; as &lt;i&gt;sharing&lt;/i&gt; content. But YouTube and many other sites like it recommend content based on users' response to the content itself, rather than the shape of the surrounding social network. If you're building an application like this yourself, this article is for you.&lt;br /&gt;&lt;br /&gt;Think of a collection of user-generated pages with a "Like" or "+1" button on each. Users can browse pages at will, or arrive at them randomly from an external site, and after viewing a page will either click the "Like" button or do nothing. I'll refer to the number of times viewers do nothing on a page as "Don't care".  I'll also assume you have a site-wide "Top Pages" chart that users can view to see the highest-scoring pages and jump to them.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;1. Counting "Likes"&lt;br /&gt;&lt;/h3&gt;The very simplest way to score a page is to count the number of times it's been "Liked":&lt;br /&gt;&lt;pre class="literal-block"&gt;score = page_likes&lt;/pre&gt;The main problem with this method is inertia. Old "champion" pages accumulate votes over time, and dominate the rankings. New pages don't have a chance to unseat the champs, even temporarily, to gain visibility for themselves. The site appears stagnant.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;2. "Likes" versus views&lt;br /&gt;&lt;/h3&gt;To make good recommendations, you want to measure the quality of a page &amp;mdash; the chance that a user will like the recommended page themselves:&lt;br /&gt;&lt;pre class="literal-block"&gt;score = page_likes / page_views&lt;/pre&gt;In the long run, this is correct &amp;mdash; you estimate the probability based on the frequency of "Likes" in previous views. Old "champion" pages will be unseated in the rankings if a newer page earns a better proportion of likes.&lt;br /&gt;&lt;br /&gt;But for new or little-viewed pages, there's an issue of sample size.&lt;br /&gt;&lt;blockquote&gt;&lt;ul class="simple"&gt;&lt;li&gt;A page where the first view is "Liked" (probably by the creator/uploader) scores 100% and shoots to the top of the rankings. If a few friends all immediately "Like" the page, it becomes difficult to unseat. There's a lot of noise at the top of the rankings.&lt;/li&gt;&lt;li&gt;A page where the first view is not "Liked" scores 0% and sinks to the bottom of the rankings. If you have some mechanism for purging bad content from your site (i.e. deleting low-scoring pages that are likely spam, trolls or just lame), then this makes that task more difficult.&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;Intuitively, a page with 80 likes out of 100 views is more likely to be good than a page with 4 likes out of 5 views. A page with zero likes out of 100 views is almost certainly junk, but 5 views without any likes may not mean much at all.&lt;br /&gt;&lt;br /&gt;So, your next goal: Make the best possible estimate of a page's likeability based on the first few views and likes, using some prior knowledge. After that, all reasonable methods should converge on the same score (probability of liking). If a meme catches, people will be able to find that page through other means, and your own rankings will be less crucial to its success.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;3. Pseudocounts&lt;br /&gt;&lt;/h3&gt;A &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Pseudocount"&gt;pseudocount&lt;/a&gt; is a prior estimate of the probability of an event. To make an estimate of actual probabilities based on a small number of samples (the problem at the end of Step 2), add the pseudocounts to the actual counts of each event.&lt;br /&gt;&lt;br /&gt;I'll demonstrate.&lt;br /&gt;&lt;br /&gt;The events here are (a) "Like" and (b) "Don't care".  I'm going to use &lt;i&gt;b&lt;/i&gt; to represent the pseudocount for "Like".  For this section, I choose the probabilities:&lt;br /&gt;&lt;pre class="literal-block"&gt;like = b = .1&lt;br /&gt;dontcare = (1 - b) = .9&lt;/pre&gt;The two probabilities should sum to 1.&lt;br /&gt;&lt;br /&gt;How do you get these values? Since &lt;i&gt;b&lt;/i&gt; represents the probability that a random user will like an arbitrary page, taking the site-wide average of likes versus views is a good choice:&lt;br /&gt;&lt;pre class="literal-block"&gt;b = all_likes / all_views&lt;/pre&gt;To use the pseudocounts, add them to the counts in the formula in step 2:&lt;br /&gt;&lt;pre class="literal-block"&gt;score = (page_likes + b) / (page_views + 1)&lt;/pre&gt;(Recall: &lt;tt class="docutils literal"&gt;views = likes + dontcares&lt;/tt&gt;; after adding pseudocounts, &lt;tt class="docutils literal"&gt;(likes + b) + (dontcares + 1 - b) = likes + dontcares + 1 = views + 1&lt;/tt&gt;.)&lt;br /&gt;&lt;br /&gt;If the database-wide sums of likes and views are large numbers, this won't significantly affect the "average" score, &lt;tt class="docutils literal"&gt;all_likes / all_views&lt;/tt&gt;. But it smoothes out the initial scoring for new pages.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Example:&lt;/b&gt; Assume 10% of all views result in a "Like" (b = 0.1).&lt;br /&gt;&lt;br /&gt;A single view without a "Like" places the page slightly below the global average, but not too much. (Odds are, 90% of pages will start out this way.) Additional views without a "Like" slowly sink the page score toward 0.&lt;br /&gt;&lt;br /&gt;&lt;table border="1" class="docutils"&gt;&lt;colgroup&gt; &lt;col width="34%"&gt;&lt;/col&gt; &lt;col width="34%"&gt;&lt;/col&gt; &lt;col width="31%"&gt;&lt;/col&gt; &lt;/colgroup&gt; &lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;th class="head"&gt;Likes:Views&lt;/th&gt; &lt;th class="head"&gt;Calculation&lt;/th&gt; &lt;th class="head"&gt;Percentage&lt;/th&gt; &lt;/tr&gt;&lt;/thead&gt; &lt;tbody valign="top"&gt;&lt;tr&gt;&lt;td&gt;0:1&lt;/td&gt; &lt;td&gt;0.1 / 2&lt;/td&gt; &lt;td&gt;5.0%&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt;&lt;td&gt;0:2&lt;/td&gt; &lt;td&gt;0.1 / 3&lt;/td&gt; &lt;td&gt;3.3%&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt;&lt;td&gt;0:3&lt;/td&gt; &lt;td&gt;0.1 / 4&lt;/td&gt; &lt;td&gt;2.5%&lt;/td&gt; &lt;/tr&gt;&lt;/tbody&gt; &lt;/table&gt;&lt;br /&gt;A single view with a "Like" gives the page a boost, but not to 100%.  This can help it gain traction, but probably won't put it in the top rankings (yet).  If subsequent views are also liked, the score continues to rise:&lt;br /&gt;&lt;br /&gt;&lt;table border="1" class="docutils"&gt;&lt;colgroup&gt; &lt;col width="34%"&gt;&lt;/col&gt; &lt;col width="34%"&gt;&lt;/col&gt; &lt;col width="31%"&gt;&lt;/col&gt; &lt;/colgroup&gt; &lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;th class="head"&gt;Likes:Views&lt;/th&gt; &lt;th class="head"&gt;Calculation&lt;/th&gt; &lt;th class="head"&gt;Percentage&lt;/th&gt; &lt;/tr&gt;&lt;/thead&gt; &lt;tbody valign="top"&gt;&lt;tr&gt;&lt;td&gt;1:1&lt;/td&gt; &lt;td&gt;1.1 / 2&lt;/td&gt; &lt;td&gt;55.0%&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2:2&lt;/td&gt; &lt;td&gt;2.1 / 3&lt;/td&gt; &lt;td&gt;70.0%&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3:3&lt;/td&gt; &lt;td&gt;3.1 / 4&lt;/td&gt; &lt;td&gt;77.5%&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4:4&lt;/td&gt; &lt;td&gt;4.1 / 5&lt;/td&gt; &lt;td&gt;82.0%&lt;/td&gt; &lt;/tr&gt;&lt;/tbody&gt; &lt;/table&gt;&lt;br /&gt;Away from the extremes (0% or 100% liked), the effect of pseudocounts is less dramatic, and a mix of "Like" and "Don't Care" (viewed without liking) results in a score closer to what you'd see without pseudocounts &amp;mdash; just shifted slightly toward the site-wide average. Notice that a page with two "Likes" out of three views (2:3) is scored almost as well as one "Like" and one view (1:1 above).&lt;br /&gt;&lt;br /&gt;&lt;table border="1" class="docutils"&gt;&lt;colgroup&gt; &lt;col width="34%"&gt;&lt;/col&gt; &lt;col width="34%"&gt;&lt;/col&gt; &lt;col width="31%"&gt;&lt;/col&gt; &lt;/colgroup&gt; &lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;th class="head"&gt;Likes:Views&lt;/th&gt; &lt;th class="head"&gt;Calculation&lt;/th&gt; &lt;th class="head"&gt;Percentage&lt;/th&gt; &lt;/tr&gt;&lt;/thead&gt; &lt;tbody valign="top"&gt;&lt;tr&gt;&lt;td&gt;1:2&lt;/td&gt; &lt;td&gt;1.1 / 3&lt;/td&gt; &lt;td&gt;36.7%&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt;&lt;td&gt;1:3&lt;/td&gt; &lt;td&gt;1.1 / 4&lt;/td&gt; &lt;td&gt;27.5%&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2:3&lt;/td&gt; &lt;td&gt;2.1 / 4&lt;/td&gt; &lt;td&gt;52.5%&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2:4&lt;/td&gt; &lt;td&gt;2.1 / 5&lt;/td&gt; &lt;td&gt;42.0%&lt;/td&gt; &lt;/tr&gt;&lt;/tbody&gt; &lt;/table&gt;&lt;br /&gt;To increase the effect of pseudocounts, you can put a higher weight on the prior by multiplying the pseudocounts by some constant. If the weighting factor is &lt;i&gt;w&lt;/i&gt;, then the calculation is:&lt;br /&gt;&lt;pre class="literal-block"&gt;score = (page_likes + (b * w)) / (page_views + w)&lt;/pre&gt;Think of this as the number of "imaginary" users you have rating each page before any real users see it. The calculations above use a weight of 1, equivalent to one user giving a fractional score of .1 to every page before it goes live, and you can see the effect of it. Play with it a bit to see how it affects your rankings.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;4. Statistical significance&lt;br /&gt;&lt;/h3&gt;You now have a score for each page in your database and a top-to-bottom ranking. Where do you draw the line for "recommending" a page?&lt;br /&gt;&lt;h4&gt;Quantiles: Best and worst&lt;br /&gt;&lt;/h4&gt;Having sorted all the pages by score, take the top 5% as the "best" and bottom 5% as the "worst". Or choose a fixed number, like 25. It's really up to you.&lt;br /&gt;&lt;br /&gt;The "best" ranking is for users, especially new visitors. Depending on your application, some users might also be interested in the "worst" pages &amp;mdash; how else would we find gems like "Friday"?&lt;br /&gt;&lt;h4&gt;Contingency: Is the score meaningful?&lt;br /&gt;&lt;/h4&gt;Another challenge is to determine when a page's score is statistically meaningful &amp;mdash; i.e. the difference between a score of 55% based on 1000 views versus a single view. Using pseudocounts addresses this to some extent at the extremes, but it's still possible for pages with low view counts to score highly. You may also want to purge "junk" content with horribly low rankings &amp;mdash; but only once it's been given a fair chance.&lt;br /&gt;&lt;br /&gt;With the &lt;tt class="docutils literal"&gt;like&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;dontcare&lt;/tt&gt; counts, site-wide and per-page, set up a 2x2 contingency table:&lt;br /&gt;&lt;br /&gt;&lt;table border="1" class="docutils"&gt;&lt;colgroup&gt; &lt;col width="32%"&gt;&lt;/col&gt; &lt;col width="26%"&gt;&lt;/col&gt; &lt;col width="42%"&gt;&lt;/col&gt; &lt;/colgroup&gt; &lt;thead valign="bottom"&gt;&lt;tr&gt;&lt;th class="head"&gt;&lt;/th&gt; &lt;th class="head"&gt;like &lt;/th&gt; &lt;th class="head"&gt;dontcare &lt;/th&gt; &lt;/tr&gt;&lt;/thead&gt; &lt;tbody valign="top"&gt;&lt;tr&gt;&lt;td&gt;Page&lt;/td&gt; &lt;td&gt;A&lt;/td&gt; &lt;td&gt;B&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Global&lt;/td&gt; &lt;td&gt;C&lt;/td&gt; &lt;td&gt;D&lt;/td&gt; &lt;/tr&gt;&lt;/tbody&gt; &lt;/table&gt;&lt;br /&gt;To evaluate the significance, use a Chi-square test with one degree of freedom (df=1), or if you're picky, &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Fisher%27s_exact_test"&gt;Fisher's exact test&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The chi-square test, in R:&lt;br /&gt;&lt;pre class="literal-block"&gt;&amp;gt; abcd = matrix(c(4, 10, 1000, 10000), nrow=2, byrow=T)&lt;br /&gt;&amp;gt; chisq.test(abcd)&lt;br /&gt;&lt;br /&gt;        Pearson's Chi-squared test with Yates' continuity correction&lt;br /&gt;&lt;br /&gt;data:  abcd&lt;br /&gt;X-squared = 4.2691, df = 1, p-value = 0.03881&lt;/pre&gt;With the common p-value cutoff ("alpha") of 0.05, we'd say this page with 4 likes out of 10 views is significant &amp;mdash; for that cutoff, at least. And if we applied the same test across all pages in the database, we'd be wrong.&lt;br /&gt;&lt;br /&gt;I'll try to be quick about this, because it matters.&lt;br /&gt;&lt;br /&gt;Remember: A p-value of 0.05 means the given like/view ratio will occur by chance 1 in 20 times.  Since the same test is being applied to every page in your database, you need to account for &lt;b&gt;multiple hypothesis testing&lt;/b&gt;, or else many pages will meet the cutoff by chance.&lt;br /&gt;&lt;br /&gt;If you only have a few pages &amp;mdash; say, less than 40 &amp;mdash; then you can divide &lt;i&gt;alpha&lt;/i&gt; by the number of pages (e.g. 20) and use that in place of the original cutoff (0.05 / 40 = 0.00125, so the previous p-value of 0.03881 would &lt;i&gt;not&lt;/i&gt; be significant).&lt;br /&gt;&lt;br /&gt;More likely, you have many more pages than that &amp;mdash; hence the need to use grown-up statistics in the first place. Bonferroni correction (described above) would produce a cutoff that's much too stringent, so you'll need a more powerful method.&lt;br /&gt;&lt;br /&gt;R makes this easy.  Starting with a single array of the p-values from Chi-square tests of each page:&lt;br /&gt;&lt;pre class="literal-block"&gt;&amp;gt; pvals = sapply(chisq.test, contingencytables)&lt;/pre&gt;Adjust these raw p-values for multiple testing (using the &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Familywise_error_rate"&gt;familywise error rate&lt;/a&gt;, by default &amp;mdash; read the help page for p.adjust for all the details):&lt;br /&gt;&lt;pre class="literal-block"&gt;&amp;gt; pvals.adj = p.adjust(pvals)&lt;/pre&gt;What do you after this depends on your own code. You can get a boolean array signifying which adjusted p-values are now smaller than alpha, which is useful for selecting "significant" pages from the original page list:&lt;br /&gt;&lt;pre class="literal-block"&gt;&amp;gt; significant = pvals.adj &amp;lt; 0.05&lt;/pre&gt;Note that this selects both significantly liked and significantly disliked pages at the same time. To distinguish between the two, just compare each page's like/view ratio to the global average and select higher or lower.&lt;br /&gt;&lt;br /&gt;Another note about the contingency table: Once your application has counted a very large number of site-wide likes and views (cells C and D), this test will register significance for almost any page. You might have better results by replacing the global view and "Like" counts with a per-month or per-user average. And, you can cache these values and update them only occasionally.&lt;br /&gt;&lt;h4&gt;Trending&lt;br /&gt;&lt;/h4&gt;Calculating p-values is a lot more work than selecting the top and bottom quantiles. If you've put in the extra effort, here's another feature you can support: a list of newly significant winners and losers.&lt;br /&gt;&lt;br /&gt;Each day (or hour or so), perform the chi-square test (described above) across all pages and note which ones cross the significance threshold. Compare this to the previous run's results to see which pages have crossed over, and add these newly significant hits to a separate chart &amp;mdash; "Trending", I'll call it.&lt;br /&gt;&lt;br /&gt;This chart shows the pages that have just recently been determined to be likeable, but (probably) haven't accumulated enough votes to reach the "Top Pages" chart. It's a timelier list than "Top Pages", though the average quality of the "Trending" pages is not as high. This is the place where memes show up first. If they're truly good content, they'll eventually make it onto "Top Pages" &amp;mdash; but that's not usually the case with memes.&lt;br /&gt;&lt;br /&gt;I'd treat the "Trending" chart as a queue, adding newly trending pages to the top at the end of each run and dropping pages from the bottom as space permits. Or just keep it rolling by week, like a blog. By adjusting &lt;i&gt;alpha&lt;/i&gt; you can tune the number of newly significant pages found in each run, and therefore the turnover rate of your "Trending" queue.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;5. But is it general?&lt;br /&gt;&lt;/h3&gt;Under the Google Whatever model (YouTube, Picasa, etc.), the ratio of Likes to total views for any given page is small. The statistics here will work in other cases, though &amp;mdash; for example, an "Approve" button which clicked most of the time, or a "Dislike" button in place of the "Like" button. In the case where users have to click either "Like" or "Dislike" (Yes/No, Yay/Nay, or any other two options), this is also fine; just pick one option to count, and count "views" as the sum of likes and dislikes.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Update:&lt;/b&gt; The "Like" event can also be something implicit, like downloads or completed streams (out of started streams). This opens up options for sites that don't require users to register.&lt;br /&gt;&lt;br /&gt;What about sites with 5-star ratings, like Amazon? Well, there's an easy way and a hard way. Easy: count the ratings as fractional Likes ([0, .25, .5, .75, 1] if you allow 0 stars, [0, .33, .67, 1] if you don't), and use the pseudocounts just like before. The hard way is to treat each star ranking as a separate event category &amp;mdash; but that's going to have to wait for a later post.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-2232711059182312779?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/2232711059182312779/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=2232711059182312779' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/2232711059182312779'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/2232711059182312779'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2011/07/statistics-of-like-button.html' title='The statistics of the &quot;Like&quot; button'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-3505428721942468387</id><published>2010-10-05T11:23:00.000-04:00</published><updated>2010-10-05T11:23:10.705-04:00</updated><title type='text'>Bio.Phylo: A unified phylogenetics toolkit for Biopython</title><content type='html'>I presented this at the Bioinformatics Open Source Conference (&lt;a href='http://www.open-bio.org/wiki/BOSC_2010'&gt;BOSC 2010&lt;/a&gt;) in early July, but somehow forgot to post it here too. It's an overview of my somewhat new sub-package for working with phylogenetic trees in Biopython, based on my Google Summer of Code 2009 project (a phyloXML parser in Biopython).&lt;br /&gt;&lt;br /&gt;In a nutshell, Bio.Phylo is a library for manipulating finished phylogenetic trees and integrating them into a Biopython-based workflow. It can handle the standard file formats &amp;mdash; Newick, Nexus and phyloXML, with the current exception of NeXML &amp;mdash; and has particularly good support for &lt;a href="http://phyloxml.org"&gt;phyloXML&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;This presentation walks through an example of loading a Newick tree, viewing it a few different ways, adding branch colors, and saving it as a phyloXML file.&lt;br /&gt;&lt;br /&gt;&lt;div style="width:425px" id="__ss_4809399"&gt;&lt;strong style="display:block;margin:12px 0 4px"&gt;&lt;a href="http://www.slideshare.net/etalevich/biophylo-phylogenetics-in-biopython-bosc-2010" title="Bio.Phylo: Phylogenetics in Biopython (BOSC 2010)"&gt;Bio.Phylo: Phylogenetics in Biopython (BOSC 2010)&lt;/a&gt;&lt;/strong&gt;&lt;object id="__sse4809399" width="425" height="355"&gt;&lt;param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=bosc2010-biophylo-talevich-100721205437-phpapp01&amp;stripped_title=biophylo-phylogenetics-in-biopython-bosc-2010&amp;userName=etalevich" /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;embed name="__sse4809399" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=bosc2010-biophylo-talevich-100721205437-phpapp01&amp;stripped_title=biophylo-phylogenetics-in-biopython-bosc-2010&amp;userName=etalevich" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div style="padding:5px 0 12px"&gt;View more &lt;a href="http://www.slideshare.net/"&gt;presentations&lt;/a&gt; from &lt;a href="http://www.slideshare.net/etalevich"&gt;Eric Talevich&lt;/a&gt;.&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;The conference abstract is &lt;a href='http://www.open-bio.org/w/images/6/6a/8_BOSC2010.pdf'&gt;here&lt;/a&gt;. I also recommend the main documentation in the &lt;a href="http://biopython.org/DIST/docs/tutorial/Tutorial.html"&gt;Biopython Tutorial&lt;/a&gt; (see chapter 12) and the &lt;a href="http://www.biopython.org/wiki/Phylo"&gt;wiki page&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-3505428721942468387?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/3505428721942468387/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=3505428721942468387' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/3505428721942468387'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/3505428721942468387'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2010/10/biophylo-unified-phylogenetics-toolkit.html' title='Bio.Phylo: A unified phylogenetics toolkit for Biopython'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-7879744741897795857</id><published>2010-04-08T17:12:00.003-04:00</published><updated>2011-03-01T15:15:49.105-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gsoc'/><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><category scheme='http://www.blogger.com/atom/ns#' term='biopython'/><title type='text'>Google Summer of Code 2010: The final draft</title><content type='html'>The &lt;a href="http://socghop.appspot.com/gsoc/program/home/google/gsoc2010"&gt;Google Summer of Code 2010&lt;/a&gt; application period is in its final 24 hours.&lt;br /&gt;&lt;br /&gt;I volunteered to mentor with two organizations this year, &lt;a href="http://www.open-bio.org/wiki/Main_Page"&gt;OBF&lt;/a&gt; and &lt;a href="http://www.nescent.org/index.php"&gt;NESCent&lt;/a&gt;. Last month I posted a couple of ideas with each org:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#Jump-start_MIAPA_protocol_annotation_with_a_user-accessible_demo"&gt;Jump-start MIAPA protocol annotation with a user-accessible demo&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#Biopython_and_PyCogent_interoperabilityhttps://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#Biopython_and_PyCogent_interoperability"&gt;Biopython and PyCogent interoperability&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://biopython.org/wiki/Google_Summer_of_Code#PDB-Tidy:_command-line_tools_for_manipulating_PDB_files"&gt;PDB-Tidy: command-line tools for manipulating PDB files &lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://biopython.org/wiki/Google_Summer_of_Code#Integration_with_a_third-party_structural_biology_application"&gt;Biopython integration with a third-party structural biology application&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;The applications that have come in have been pretty good; the only thing I can complain about is that nobody has followed through with my MIAPA project -- we got a nibble from one student, but nothing after that.&lt;br /&gt;&lt;br /&gt;Since we're doing the last round of application reviews now before the deadline, here's some general guidance on what mentors are looking for in a student application.&lt;br /&gt;&lt;br /&gt;First, a couple of outside references:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://socghop.appspot.com/document/show/program/google/gsoc2009/studentallocations"&gt;Google's notes on student allocations&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://bcbio.wordpress.com/2010/03/26/biopython-projects-for-google-summer-of-code-2010/"&gt;Blue Collar Bioinformatics: Biopython projects for GSoC 2010&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;h4&gt;The Zen of GSoC&lt;/h4&gt;Google Summer of Code is a program to recruit and foster new long-term open-source contributors.&lt;br /&gt;&lt;br /&gt;Broadly, the mentoring organizations are asking three questions:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Are you motivated enough about this work to continue contributing after the summer?&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Can you write useful code on your own?&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Do you interact well with the community, so that we can work with you to merge your work cleanly into the trunk and rely on you to maintain the codebase?&lt;/li&gt;&lt;/ol&gt;You can get a sense of what Google and the mentoring orgs are looking for from the applications the orgs themselves submit to Google. For example: &lt;a href="http://docs.google.com/View?id=dhdjhbvd_10c898hdhc"&gt;NESCent's 2010 app&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Here are some specific tips for demonstrating that you have some committer in you.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Put your previous work online&lt;/h4&gt;It's remarkable how many ostensible programmers just can't write decent code. They'll have a list of successful past projects they worked on, maybe a legitimate degree in computer science, but their code itself was clearly never fully understood by anyone, original programmer included. (Remember, programming languages exist for humans to understand -- the computer itself runs on machine code.) The only way we can be sure you can write code we can use is if we can look at something you've written previously.&lt;br /&gt;&lt;br /&gt;Biopython uses GitHub for development, so putting a project of your own on GitHub demonstrates two useful things: you can write functioning code, and you're already up to speed with the build tools that Biopython uses.&lt;br /&gt;&lt;br /&gt;If the most relevant code you've written is tied up in some way -- say, it's part of a research project still being prepared for publication -- see if you can use at least a few snippets of it. So far, it seems most professors have been willing to allow that.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Subscribe to your mentoring organization's mailing list&lt;/h4&gt;I know, e-mail mailing lists seem at least a decade behind the times. But open-source projects like to have a permanent public record of the discussions that happen, and everyone has an e-mail account. We also have IRC channels and Twitter tags (#phylosoc and #obfsoc), but project proposals are generally more than 140 characters so it's best to use e-mail at some point.&lt;br /&gt;&lt;br /&gt;Plus, you'll be able to read all the advice the other students are getting -- mentors get fatigued as the application season wears on, and once we've written the same thing a few times we start skipping details.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Write a weekly project schedule&lt;/h4&gt;The GSoC application has fields for pointing to external info. Create a Google document or spreadsheet (or README.md on GitHub if you're fancy) detailing your project plan week-by-week.&lt;br /&gt;&lt;br /&gt;Suggested fields:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Date, or week number for referencing later&lt;/li&gt;&lt;li&gt;GSoC events and guidelines (see the &lt;a href="http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/faqs#timeline"&gt;official timeline&lt;/a&gt;)&lt;/li&gt;&lt;li&gt;Deliverables for the week — what's produced, e.g. documentation sections, unit tests, classes, modules&lt;/li&gt;&lt;li&gt;Approach for each of these tasks, in a few words&lt;/li&gt;&lt;li&gt;Potential problems that could occur, specific to the tasks — perhaps a dependency turns out to be inadequate, or an integration step is required&lt;/li&gt;&lt;li&gt;Proposed mitigation for each of the foreseen issues&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;(If you want to estimate the number of hours or days each task will take, that's cool too.)&lt;br /&gt;&lt;br /&gt;Here are the examples from previous GSoC projects that we've been sharing on the mailing lists:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://spreadsheets.google.com/pub?key=puFMq1smOMEo20j0h5Dg9fA&amp;amp;single=true&amp;amp;gid=0&amp;amp;output=html"&gt;phyloXML in Biopython&lt;/a&gt; (mine)&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby"&gt;phyloXML in BioRuby&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Phylogenetic_XML"&gt;Phylogenetic XML&lt;/a&gt; (the origin of NeXML)&lt;/li&gt;&lt;li&gt;&lt;a href="https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Build_a_Mesquite_Package_to_view_Phenex-generated_Nexml_files"&gt;NeXML for Mesquite&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;h4&gt;Respect the deadlines&lt;/h4&gt;Submit a draft of your application to Google at least a day before the deadline, April 9. There are thousands of applicants each year, and Google has no reason to let the deadline slide — an important function of the application process itself is to screen out students who won't deliver by the stated deadlines. In effect, if your application isn't submitted to Google by noon PST on April 9, then you didn't apply.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;BUT:&lt;/i&gt; If you submit something even partially complete, we can contact you later during the review stage and get the remaining information from you. And if you included a link to your weekly plan (as a separate online document), you can edit that after the deadline too.&lt;br /&gt;&lt;br /&gt;Best of luck!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-7879744741897795857?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/7879744741897795857/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=7879744741897795857' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/7879744741897795857'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/7879744741897795857'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2010/04/google-summer-of-code-2010-final-draft.html' title='Google Summer of Code 2010: The final draft'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-2327018934040653932</id><published>2010-02-24T12:29:00.003-05:00</published><updated>2010-02-24T13:34:53.671-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='education'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><category scheme='http://www.blogger.com/atom/ns#' term='biopython'/><category scheme='http://www.blogger.com/atom/ns#' term='presentation'/><title type='text'>Python workshop #2: Biopython</title><content type='html'>As promised, here are the slides from Monday's Biopython programming workshop:&lt;br /&gt;&lt;br /&gt;&lt;div style="width: 425px;" id="__ss_3266543"&gt;&lt;strong style="margin: 12px 0pt 4px; display: block;"&gt;&lt;a href="http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga" title="Biopython programming workshop at UGA"&gt;Biopython programming workshop at UGA&lt;/a&gt;&lt;/strong&gt;&lt;object height="355" width="425"&gt;&lt;param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=biopywork-100224112524-phpapp01&amp;amp;stripped_title=biopython-programming-workshop-at-uga"&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;param name="allowScriptAccess" value="always"&gt;&lt;embed src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=biopywork-100224112524-phpapp01&amp;amp;stripped_title=biopython-programming-workshop-at-uga" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" height="355" width="425"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div style="padding: 5px 0pt 12px;"&gt;View more &lt;a href="http://www.slideshare.net/"&gt;presentations&lt;/a&gt; from &lt;a href="http://www.slideshare.net/etalevich"&gt;Eric Talevich&lt;/a&gt;.&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;This was another 2-hour session, with a short snack break in the middle this time -- which was also a nice opportunity to ask everyone about the pacing, and see if who's been following along with the examples in IPython (versus staring at a BSOD or lolcats -- which I didn't notice any of).&lt;br /&gt;&lt;br /&gt;This went well:&lt;ul&gt;&lt;li&gt;Pacing&lt;/li&gt;&lt;li&gt;Using IPython to inspect objects and display documentation -- this lets some people "read ahead" and perhaps answer their own minor questions, leading to other, better questions&lt;/li&gt;&lt;li&gt;The general introductory pattern of:&lt;ol&gt;&lt;li&gt;Demonstrate how to import a module and instantiate the basic class&lt;/li&gt;&lt;li&gt;Review, in English, the core features of the module and why they exist&lt;/li&gt;&lt;li&gt;Walk through a short script that uses real data to accomplish some simple but useful task(s)&lt;/li&gt;&lt;li&gt;Display the result, completing the mental pipeline of &lt;span style="font-style: italic;"&gt;input&lt;/span&gt; -&gt; &lt;span style="font-style: italic;"&gt;transformation&lt;/span&gt; -&gt; &lt;span style="font-style: italic;"&gt;output&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;/ul&gt;Room for improvement:&lt;ul&gt;&lt;li&gt;I didn't always execute the final draft of each example, so there were a couple typos -- inconvenient for those following along in Python. (I've fixed them in the slides here.)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Consequently, I didn't have an output file to show at the end of each example -- so I had to describe or draft one on the spot.&lt;/li&gt;&lt;li&gt;The PDB module was the coolest part of the workshop, and I rushed it a bit. I was afraid the visitors from Genetics and Plant Bio would be bored with it, but I don't think they were, and the Bioinformatics folks were left wanting more.&lt;/li&gt;&lt;/ul&gt;I'm planning to host both Python workshops again in the next academic year, either 1 per semester (as it was this year) or both each semester, maybe 2 weeks apart. The Biopython workshop in particular will be different next time because &lt;a href="http://www.biopython.org/wiki/Phylo"&gt;Bio.Phylo&lt;/a&gt; will finally be included with the main Biopython distribution -- evolution is cool, and more of the pretty is always a good thing to have in a programming workshop.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-2327018934040653932?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/2327018934040653932/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=2327018934040653932' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/2327018934040653932'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/2327018934040653932'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2010/02/python-workshop-2-biopython.html' title='Python workshop #2: Biopython'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-4780125680402496704</id><published>2010-02-19T15:52:00.004-05:00</published><updated>2010-02-24T13:36:54.484-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='presentation'/><title type='text'>Python workshop #1, now on SlideShare</title><content type='html'>Last November I hosted a workshop on basic Python programming at UGA. The attendees were mostly from the bioinformatics department, but this workshop didn't go into science at all -- just practical Python usage. Today I finally got around to cleaning up the slides and uploading them to SlideShare:&lt;br /&gt;&lt;br /&gt;&lt;div style="width: 425px; text-align: left;" id="__ss_3228053"&gt;&lt;a style="margin: 12px 0pt 3px; font-family: Helvetica,Arial,Sans-serif; font-style: normal; font-variant: normal; font-weight: normal; font-size: 14px; line-height: normal; font-size-adjust: none; font-stretch: normal; display: block; text-decoration: underline;" href="http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics" title="Python workshop #1 at UGA"&gt;Python workshop #1 at UGA&lt;/a&gt;&lt;object style="margin: 0px;" height="355" width="425"&gt;&lt;param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=pywork1-100219144829-phpapp02&amp;amp;rel=0&amp;amp;stripped_title=python-workshop-1-uga-bioinformatics"&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;param name="allowScriptAccess" value="always"&gt;&lt;embed src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=pywork1-100219144829-phpapp02&amp;amp;rel=0&amp;amp;stripped_title=python-workshop-1-uga-bioinformatics" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" height="355" width="425"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div style="font-size: 11px; font-family: tahoma,arial; height: 26px; padding-top: 2px;"&gt;View more &lt;a style="text-decoration: underline;" href="http://www.slideshare.net/"&gt;presentations&lt;/a&gt; from &lt;a style="text-decoration: underline;" href="http://www.slideshare.net/etalevich"&gt;etalevich&lt;/a&gt;.&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;It looks like LaTeX Beamer and SlideShare's PDF/Flash converter don't play well together. Meh, it's still easy enough to read.&lt;br /&gt;&lt;br /&gt;I'm working on a Biopython-specific followup right now for a workshop on Monday, 2/22. I'll post that here when it's done, too, with reasonable haste.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-4780125680402496704?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/4780125680402496704/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=4780125680402496704' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/4780125680402496704'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/4780125680402496704'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2010/02/python-workshop-1-now-on-slideshare.html' title='Python workshop #1, now on SlideShare'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-1806315370728919008</id><published>2009-07-20T20:52:00.001-04:00</published><updated>2009-07-21T01:43:35.410-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><title type='text'>Faster string concatenation in Python</title><content type='html'>&lt;a href="https://www.nescent.org/wg/phyloinformatics/index.php?title=Phyloinformatics_Summer_of_Code_2009#Biogeographical_Phylogenetics_for_BioPython"&gt;Nick Matzke&lt;/a&gt; pointed me to this discussion of string concatenation approaches in Python:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.skymind.com/~ocrow/python_string/"&gt;Efficient String Concatenation in Python&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The issue here is whether adding strings together in a &lt;tt&gt;for&lt;/tt&gt; loop is inefficient enough to be worth working around. Python strings are immutable, so this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;s = 'om'&lt;br /&gt;for i in xrange(1000):&lt;br /&gt;    s += 'nom'&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;means doing this 1000 times:&lt;br /&gt;&lt;ol&gt;&lt;br /&gt;&lt;li&gt;Translate the assignment to "s = s + 'nom'"&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Allocate another string, "buzz". (Or reuse a reference if it's already interred.)&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Call the __add__ method on &lt;tt&gt;s&lt;/tt&gt;, with ' nom' as the argument&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Allocate the new string created by __add__&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Assign a reference to that string back to &lt;tt&gt;s&lt;/tt&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;So, using the + operator 1000 times in a loop has to create 1000 ever-larger string objects, but only the last one gets used outside the loop. There are good reasons Python works this way, but still, there's a trap here in an operation that gets used a lot in ordinary Python code.&lt;br /&gt;&lt;br /&gt;There are a few ways to cope:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Use a mutable object to build the string as a sequence of bytes (or whatever) and then convert it back to a Python string in one shot at the end. Reasonable intermediate objects are array and StringIO (preferably cStringIO).&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Let the string object's &lt;tt&gt;join&lt;/tt&gt; method do the dirty work -- strings are a basic Python type that's been optimized already, so this method probably drops down to a lower level (C/bytecode in the CPython interpreter, not sure about the details) where full allocation of each intermediate string isn't necessary.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Build a single format string and interpolate with the &lt;tt&gt;%&lt;/tt&gt; operator (or the format method, if you're fancy) to fill it in, under the same rationale as with the &lt;tt&gt;join&lt;/tt&gt; method. This fits real-world scenarios better &amp;mdash; filling in a template of a plain-text table or paragraph with computed values, either all at once with &lt;tt&gt;%&lt;/tt&gt; or incrementally with string addition. It could be a performance bottleneck, and it's not obvious which approach would be better.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;The original article gives a nice analysis and comes out in favor of intermediate cStringIO objects, with a list comprehension inside the string join method as a strong alternative. But it was written in 2004, and Python has changed since then. Also, it doesn't include interpolation among the tested methods, and that was the one I was the most curious about.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Methods&lt;/h2&gt;&lt;br /&gt;I downloaded and updated the script included with that article, and ran it with Python 2.6 and 2.5 to get some new results. (Source code &lt;a href="http://www.relapsecollapse.com/static/strcat.txt"&gt;here&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;First, a changelog:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;The method numbers are different, and there are a couple more. Method #2 is for the &lt;tt&gt;%&lt;/tt&gt; operator, in which I build a gigantic format string and a gigantic tuple out of the number list, then smash them together. It trades memory for CPU time, basically. Method #8 uses &lt;tt&gt;map&lt;/tt&gt; instead of a list comprehension or generator expression; no lambda is required and the necessary function (&lt;tt&gt;str()&lt;/tt&gt;) is already available, so this is a good candidate.&lt;/li&gt; &lt;br /&gt;&lt;li&gt;I used the standard lib's &lt;tt&gt;time.clock()&lt;/tt&gt; to measure CPU time around just the relevant loop for each string concatenation method.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Fetching the process memory size is similar but uses the subprocess module and different options.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Docstrings are (ab)used to identify the output.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;For example, the string addition method now looks like this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;def method1():&lt;br /&gt;    """1. string addition"""&lt;br /&gt;    start = clock()&lt;br /&gt;    out_str = ''&lt;br /&gt;    for num in NUMS:&lt;br /&gt;        out_str += str(num)&lt;br /&gt;    cpu = clock() - start&lt;br /&gt;    return (out_str, cpu, memsize())&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Results&lt;/h2&gt;&lt;br /&gt;Each method concatenates the string representation of the numbers 0 through 999,999. The methods were run sequentially in separate processes, via a for loop in the shell, for Python versions 2.5 and 2.6. The best of three runs for each method are shown below.&lt;br /&gt;&lt;b&gt;Python 2.6:&lt;/b&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;1. string addition   CPU (s): 1.99   Mem (K): 11.7&lt;br /&gt;2. %-interpolation   CPU (s): 2.42   Mem (K): 23.0&lt;br /&gt;3. array object      CPU (s): 3.42   Mem (K): 17.3&lt;br /&gt;4. cStringIO object  CPU (s): 3.24   Mem (K): 19.7&lt;br /&gt;5. join + for loop   CPU (s): 2.29   Mem (K): 48.0&lt;br /&gt;6. join + list comp  CPU (s): 1.93   Mem (K): 11.6&lt;br /&gt;7. join + gen expr   CPU (s): 2.08   Mem (K): 11.6&lt;br /&gt;8. join + str map    CPU (s): 1.47   Mem (K): 11.6&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The winner is &lt;tt&gt;map&lt;/tt&gt;, with string addition, the list comprehension, and the generator expression also doing well. String addition in a loop did much better than would be expected from reading the original article; the Python developers have put effort into making this less of a trap. Specifically, there's a flag on string objects internally that indicates whether the string is the result of an addition operation. This helps the interpreter identify when a string is being concatenated in a loop, and optimize that case by performing in-place concatenation. Nice. So really, there's no need to worry about the quadratic time behavior that we expected &amp;mdash; at least in Python 2.6.&lt;br /&gt;&lt;br /&gt;The array object, a sequence of packed bytes, is supposed to be a low-level but high-performance workhorse. It was embedded in the minds of performance-conscious Python programmers by this essay by Guido van Rossum:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.python.org/doc/essays/list2str.html"&gt;Python Patterns &amp;mdash; An Optimization Anecdote&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;At a glance, that problem looks similar to this one. However, converting ints to chars is a problem that can be described well in bytes. Converting integers to their string representation is not &amp;mdash; we're not even using any features of the array object related to byte representation. Going low-level doesn't help us here; as Guido indicates in his conclusion, if you keep it short and simple, Python will reward you. The StringIO object in method 5 performs similar duties, and the shape of both functions is the same; the only difference in performance seems to be that cStringIO trades some memory space for CPU time.&lt;br /&gt;&lt;br /&gt;The string join method is recommended by the Python standard library documentation for string concatenation with well-behaved performance characteristics. Conveniently, &lt;tt&gt;str.join()&lt;/tt&gt; accepts any iterable object, including lists and generator expressions. Method 5 is the dump approach: build a list in a for loop, pass it to &lt;tt&gt;join&lt;/tt&gt;. Method 6 pushes the looping operation deeper into the interpreter via list comprehension; it saves some bytecode, variable and function lookups, and a substantial number of memory allocations.&lt;br /&gt;&lt;br /&gt;Using a generator expression in method 7 instead of a list comprehension should have been equivalent or faster, by avoiding the up-front creation of a list object. But memory usage is the same, and the list comprehension runs faster by a small but consistent amount. Maybe &lt;tt&gt;join&lt;/tt&gt; isn't able to take advantage of lazy evaluation, or is helped by knowing the size of the list object early on... I'm not sure. Interesting, though. In Python 3, the list comprehension is equivalent to building a list object from a generator expression, so results would probably be different there.&lt;br /&gt;&lt;br /&gt;Finally, in method 8, &lt;tt&gt;map&lt;/tt&gt; allows the interpreter to look up the &lt;tt&gt;str&lt;/tt&gt; constructor just once, rather than for each item in the given sequence. This is the only approach that gives an impressive speedup over string addition in a loop. So how portable is this result?&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Python 2.5:&lt;/b&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;1. string addition   CPU (s): 3.77   Mem (K): 10.8&lt;br /&gt;2. %-interpolation   CPU (s): 2.43   Mem (K): 22.0&lt;br /&gt;3. array object      CPU (s): 5.16   Mem (K): 16.4&lt;br /&gt;4. cStringIO object  CPU (s): 4.93   Mem (K): 18.7&lt;br /&gt;5. join + for loop   CPU (s): 3.98   Mem (K): 47.1&lt;br /&gt;6. join + list comp  CPU (s): 3.30   Mem (K): 10.5&lt;br /&gt;7. join + gen expr   CPU (s): 3.59   Mem (K): 10.5&lt;br /&gt;8. join + str map    CPU (s): 2.72   Mem (K): 10.5&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Python 2.6.2 has had the benefit of additional development time, and in notably the &lt;a href="http://code.google.com/p/unladen-swallow/wiki/ProjectPlan#2009_Q1"&gt;Unladen Swallow&lt;/a&gt; project's first quarter of interpreter optimizations, with impressive improvements across the board. By comparison, Python 2.5 uses generally less memory and more CPU time. String interpolation, however, seems to already have been optimized to the max in Python 2.5, and actually wins the performance shootout here! String addition, on the other hand, is slightly less adept at optimizing in a loop. It still avoids the quadratic-time issue (that enhancement was added in Python 2.4), and memory usage is quite respectable.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Conclusion&lt;/h2&gt;&lt;br /&gt;The recommendations at the end of Guido's essay are still exactly right. In general, Python performs best with code that "looks right", with abstractions that fit the problem and a minimum of branching and explicit looping.&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Adding strings together in simple expressions will be optimized properly in recent Pythons, but could bite you in older ones&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Using string interpolation or templates plays well enough with more complex formatting&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Going too low-level can deprive you of Python's own optimizations&lt;/li&gt;&lt;br /&gt;&lt;li&gt;If built-in functions can do what you need, use them, and basic Haskell-style functional expressions can make your code very concise&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;There's more discussion on &lt;a href="http://stackoverflow.com/questions/376461/string-concatenation-vs-string-substitution-in-python"&gt;Stack Overflow&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-1806315370728919008?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/1806315370728919008/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=1806315370728919008' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/1806315370728919008'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/1806315370728919008'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2009/07/faster-string-concatenation-in-python.html' title='Faster string concatenation in Python'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-469615028017332339</id><published>2009-03-08T18:50:00.000-04:00</published><updated>2009-03-23T15:49:20.157-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='education'/><category scheme='http://www.blogger.com/atom/ns#' term='science'/><title type='text'>Mnemosyne: Getting Things Memorized</title><content type='html'>It had been bothering me since I joined this lab that I couldn't confidently just read a protein sequence and understand what it meant &amp;mdash; naming the residues, picturing the side chain structures, and understanding the significance of replacing one residue with another. I expected that I'd just pick it up naturally from working with sequences and structures, and that did happen somewhat. But I wanted it to be as easy as reading English, and that level of completeness doesn't happen without some rote memorization.&lt;br /&gt;&lt;br /&gt;That brought to mind a &lt;a href="http://www.wired.com/medtech/health/magazine/16-05/ff_wozniak"&gt;Wired article&lt;/a&gt; about Piotr Wozniak and his spaced-repetition memorization program, &lt;a href="http://www.supermemo.com/"&gt;SuperMemo&lt;/a&gt;. When I originally read the article I wasn't in grad school and didn't have an urge to memorize any particular list of things. Anyway, SuperMemo appeared to be Windows-only software, and an algorithm like this would be more fun to code from scratch anyway. Enough fun, really, that there &lt;em&gt;had&lt;/em&gt; to be one or two open-source implementations floating around.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://mnemosyne-proj.org/"&gt;Mnemosyne&lt;/a&gt; popped up as the closest match in an Ubuntu package search, so I'm running with that. Putting the flash cards together was pretty simple; I was able to do it in a few minutes from inside the program and export it in the standard XML format. I zipped it up with a quick plain-text README and uploaded it to the project home page as the &lt;a href="http://mnemosyne-proj.org/node/166"&gt;Amino Acids&lt;/a&gt; card set.&lt;br /&gt;&lt;br /&gt;The content came from a slide in a lecture, and I did a quick sanity check on Wikipedia before uploading. The notation for the 20 standard amino acids is complete, and that was the main goal of this. The assignment of amino acid "groups" seems to be a little arbitrary, depending on the source (by structure, functional groups, chemical properties, etc.), and I tried to make the categories complete without too much overlap -- there's a small deviation from my slide here. I also added another category for "side chain properties", pH and polarity. Another enhancement might be the standard codons for each amino acid, though I'm not sure I want to deal with that yet.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-469615028017332339?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/469615028017332339/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=469615028017332339' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/469615028017332339'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/469615028017332339'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2009/03/mnemosyne-getting-things-memorized.html' title='Mnemosyne: Getting Things Memorized'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-546727074630579025</id><published>2009-02-07T14:36:00.000-05:00</published><updated>2009-02-07T14:49:04.640-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='education'/><title type='text'>Carrots and sticks</title><content type='html'>&lt;p&gt;What's old is new again:&lt;/p&gt;&lt;p&gt;&lt;a href="http://www.theglobeandmail.com/servlet/story/RTGAM.20090206.wprof06/BNStory/National/home"&gt;Professor makes his mark, but it costs him his job&lt;/a&gt;&lt;/p&gt;&lt;p&gt;In &lt;i&gt;Zen and the Art of Motorcycle Maintenance&lt;/i&gt;, Robert Pirsig mentions his own experiment in withholding grades at a university. He didn't just announce on the first day that everyone would get an A+ (that seems gimmicky), but since it was a class on rhetoric, he spent the course developing an argument for eliminating the grades-and-degrees system and discussing it with his students.  Most students were unenthusiastic or opposed -- grades and degrees are what they came for.&lt;/p&gt;&lt;p&gt;He assigned, collected and graded papers but returned them to students with only the comments, not the grade.&lt;/p&gt;&lt;p&gt;At first:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;A-students were annoyed, but did the work anyway&lt;/li&gt;&lt;li&gt;B-C students blew off some assignments&lt;/li&gt;&lt;li&gt;C-D students usually skipped class&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;He observed this and changed nothing. If students acted up, he let it slide.&lt;/p&gt;&lt;p&gt;Around 3-4 weeks into it:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;A-students got nervous and pushed themselves harder, in class and in papers&lt;/li&gt;&lt;li&gt;B-C students saw what the try-hards were doing and returned to the usual level of effort&lt;/li&gt;&lt;li&gt;C-D students who had committed to ditching would occasionally show up out of curiousity&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;And finally:&lt;/p&gt;&lt;ul&gt;&lt;li&gt; A-students relaxed and began enjoying the class as active participants. In a final essay, still not knowing what their grades were, these students favored eliminating grades by 2-1.&lt;/li&gt;&lt;li&gt; B-C students saw this and freaked, putting an unusual amount of effort into their work. Eventually, they joined the A students in engaging class discussions. These were evenly divided over the issue of eliminating grades.&lt;/li&gt;&lt;li&gt; C-D students, or those who attended, also saw this and began trying to hand in reasonable work. Those who couldn't hack it freaked even more, and remained in a state of Kafkaesque terror until the quarter mercifully ended.  Naturally, in the final essay these students were unanimously opposed to eliminating grades.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Interesting as this result was, he reverted to the regular grading system the next quarter because he couldn't provide any alternate goal for students -- those who can recognize quality in their own work don't need the university; those who can't need something to work toward, or they don't progress.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-546727074630579025?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/546727074630579025/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=546727074630579025' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/546727074630579025'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/546727074630579025'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2009/02/carrots-and-sticks.html' title='Carrots and sticks'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-123050970038111966</id><published>2008-10-25T12:29:00.000-04:00</published><updated>2009-01-09T13:29:28.829-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='science'/><title type='text'>Psychology research with Mechanical Turk</title><content type='html'>&lt;p&gt;&lt;b&gt;Elevator pitch:&lt;/b&gt; There's a missing gear in the machine of psychology research. Every significant human study requires weeks or months of data collection, and more time coding that data in a form that can be analyzed statistically. This makes it infeasible to do the sort of fast, iterative refinement of models that biology has seen in recent years.&lt;br /&gt;&lt;br /&gt;Amazon's &lt;a href="https://www.mturk.com/mturk/welcome"&gt;Mechanical Turk&lt;/a&gt; provides the missing piece. It provides an accessible interface for building a survey, interactive test, or other psychological measure, pushing it out to thousands of participants, quickly returning the results to the researcher in electronic form, and screening out unusable data. It's flexible enough to allow screening and debriefing, and gives access to a vastly larger pool of participants than Experimetrix. And it's cheap.&lt;/p&gt;&lt;h2&gt;Background&lt;/h2&gt;&lt;p&gt;First, take a look at this: &lt;a href="http://www.tenthousandcents.com/index.html"&gt;Ten Thousand Cents&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;When I first heard about bioinformatics I was under the impression that it was the exponentially increasing power of computers that made it irresistible to start using them for biological research. But actually, it was pretty much the reverse -- high-throughput experimental methods like gene sequencing, mass spectrometry and X-ray crystallography generated too much data for humans to process manually. Computers were only barely able to handle this workload in 1986, when the human genome project started -- scientists just did what was needed to move around the mountains of data coming out of their experiments. Similarly, new computational research is coming out of the Large Hadron Collider project now.&lt;br /&gt;&lt;br /&gt;Psychology researchers (especially in social psychology) currently spend semesters at a time gathering data for their studies and converting it into data that can be quantitatively analyzed. High-throughput experimental methods are scarce and expensive, so there's no "data glut" driving the development of better information-management methods. Progress in the field is slow and lossy -- since there's not much demand for the raw data, conclusions are described qualitatively, which makes it hard to use prior results as a solid foundation for future work.&lt;br /&gt;&lt;br /&gt;With Mechanical Turk, it's possible to do in one shot a study that would otherwise require a meta-analysis of several studies across particular locations or demographics. With more consistent data and larger populations, data &lt;i&gt;can&lt;/i&gt; be reusable.&lt;/p&gt;&lt;h2&gt;How it could work&lt;/h2&gt;&lt;p&gt;If it fits behind a web interface, or can be described and completed with plain language or pictures, it can be done with Mechanical Turk. Necessarily, a form of consent can precede the main task, and a blurb of debriefing can finish.&lt;br /&gt;&lt;br /&gt;To get a feel for how it's done, read this article: &lt;a href="http://waxy.org/2008/11/the_faces_of_mechanical_turk/"&gt;The Faces of Mechanical Turk&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Naturally, the first study done this way should be something to determine how the population of Turkers corresponds to the general population and the student populations that have already been characterized in previous studies. A public-domain measure of the Big Five or something like the Narcissistic Personality Inventory would be good candidates. Then, let slip the hounds of statistics. Are Turkers as representative of the general population as psychology undergrads? More so?&lt;br /&gt;&lt;br /&gt;Some research along these lines has already been blogged here: &lt;a href="http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html"&gt;Mechanical Turk Demographics&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Now, let's try some examples.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Surveys:&lt;/b&gt; You craft a survey, Turkers take it, and you retrieve and filter the results through the Mechanical Turk interface. Pretty straightforward, no?&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Interactive tasks:&lt;/b&gt; This is what Mechanical Turk is designed for; only, the focus was expected to be the task, not the Turker. Anyway, the data's yours. An example of a task like this would be a simple, unbounded game (Flash or JavaScript) that the participant can quit any time (possibly paired with another stimulus). The returned data would be the play duration alongside any personal or demographic information requested.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Coding visual or audio data:&lt;/b&gt; Following the original intent of Mechanical Turk more closely, this application of the service distributes a repetitive task normally performed over several weeks or months by the researcher or a group of grad students. Rather collect new data about a participant, this simply boils down a vast quantity of data that's already been generated -- this is a problem we want to have. A two-step example: (1) run a Mechanical Turk task in which participants draw or assemble an arbitrary image; (2) run a second task with a different set of participants who look at these images and code (type or select) the relevant traits they see in the images.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Measure development:&lt;/b&gt; One of the more uncomfortable questions in social psychology research is the validity of personality measures. Devise a series of questions and a method for tabulating the results; run it on some participants; analyze the results to get some answers. But, what's really being tested here -- the population, or the measure? Tragically, there's no time to refine the measure very much; if the results are useful, you run with it. But! With Mechanical Turk, collecting survey results is cheap and quick; and since the general format of the survey isn't changing between revisions, the same set of statistical transformations can be applied programatically to each iteration of the survey.&lt;br /&gt;&lt;br /&gt;This is a great way to build a psychological measure that you can be confident in: Push an initial draft of the measure out to Turkers, receive some results, perform a statistical analysis and save the operations as an &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; or SPSS script. Then, manually refine the measure, put it back on Turk, filter the new results through your analysis script, and repeat until it looks good. This can get as advanced as you'd like -- start with several times as many questions as you'd like to see in the final survey, then automatically dispatch random subsets of the question list to Turk, filter through your automatic analysis to get some scores indicating quality, and use a Bayesian classifier to narrow down the best possible subset of questions.&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;Update:&lt;/b&gt; Here's a conference paper on the same topic.&lt;br /&gt;&lt;a href="http://www-users.cs.umn.edu/~echi/papers/2008-CHI2008/2008-02-mech-turk-online-experiments-chi1049-kittur.pdf"&gt;http://www-users.cs.umn.edu/~echi/papers/2008-CHI2008/2008-02-mech-turk-online-experiments-chi1049-kittur.pdf&lt;/a&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-123050970038111966?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/123050970038111966/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=123050970038111966' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/123050970038111966'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/123050970038111966'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2008/10/psychology-research-with-mechanical.html' title='Psychology research with Mechanical Turk'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-7275581963388096144</id><published>2008-03-07T15:48:00.002-05:00</published><updated>2010-01-01T13:58:38.723-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='tools'/><category scheme='http://www.blogger.com/atom/ns#' term='productivity'/><category scheme='http://www.blogger.com/atom/ns#' term='vim'/><title type='text'>Vimming your way to the top</title><content type='html'>Here's the Vim syntax file I use for highlighting my to-do list. It's based on the syntax file for YAML.&lt;br /&gt;&lt;a href="http://www.vim.org/scripts/script.php?script_id=2599"&gt;http://www.vim.org/scripts/script.php?script_id=2599&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Benefits:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Different colors for lines ending in ':', or starting with '*' or '{'&lt;/li&gt;&lt;li&gt;Assign keywords to be automatically highlighted, like important locations, coworkers' names, customers, taquerias, etc.&lt;/li&gt;&lt;li&gt;Start sections with a line of underscores and a heading beginning with the '{' character. The heading stands out (red with GVim's "desert" color scheme), and you can jump between sections just like C blocks using ]] and [[ keystrokes.&lt;/li&gt;&lt;li&gt;Ordinary text (i.e. not specifically formatted for this syntax) looks sane.&lt;/li&gt;&lt;/ul&gt;Normally I have a line in my .vimrc assigning the filetype "todolist" to the file where I keep my permanent todolist, but another way to add this highlighting to a text file is to add &lt;tt&gt;vim: ft=todolist&lt;/tt&gt; to the end of a file. It's harmless.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Update (4/2/09):&lt;/b&gt; I uploaded the script to vim.org, where it will be easier to track and update.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Update (1/1/10):&lt;/span&gt; Here's an example of how to use this color scheme for course notes.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_GoA5eGd81R8/Sz4__Qs3hqI/AAAAAAAAAJ8/ncGJ1Dm-vD8/s1600-h/todolist-desert.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 383px;" src="http://1.bp.blogspot.com/_GoA5eGd81R8/Sz4__Qs3hqI/AAAAAAAAAJ8/ncGJ1Dm-vD8/s400/todolist-desert.png" alt="" id="BLOGGER_PHOTO_ID_5421841357448119970" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;60 underscores (my preference) and a curly brace indicate a new section&lt;/li&gt;&lt;li&gt;Subsection lines end with a colon (generally followed by bullet points)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Special or out-of-context notes start with an asterisk&lt;/li&gt;&lt;li&gt;For separation, or to display a different sort of sub-heading, play with asterisks: '* * *' centered, or '** OLD **' for example&lt;/li&gt;&lt;/ul&gt;At school, I run a shell script for each new class that creates a new directory from the course name, copies a skeleton of this example text to a file called lecture-notes.txt, etc., and adds the directory to Mercurial -- so while there's some boilerplate involved with this plugin, it's easy to automate and plays well with Vim's text-munging capabilities.&lt;br /&gt;&lt;br /&gt;I've also picked up the habit of putting @contexts above unsorted items at the top of my main to-do list, inspired by the GTD approach. The syntax plugin doesn't take advantage of this yet; I'll post another update when that's done.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-7275581963388096144?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/7275581963388096144/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=7275581963388096144' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/7275581963388096144'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/7275581963388096144'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2008/03/vimming-your-way-to-top.html' title='Vimming your way to the top'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_GoA5eGd81R8/Sz4__Qs3hqI/AAAAAAAAAJ8/ncGJ1Dm-vD8/s72-c/todolist-desert.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-3731994283989264393</id><published>2008-01-19T12:58:00.000-05:00</published><updated>2009-03-09T00:13:43.825-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='lisp'/><category scheme='http://www.blogger.com/atom/ns#' term='functional'/><category scheme='http://www.blogger.com/atom/ns#' term='function-level'/><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><category scheme='http://www.blogger.com/atom/ns#' term='blub'/><category scheme='http://www.blogger.com/atom/ns#' term='languages'/><title type='text'>On Blub</title><content type='html'>There's an interesting (old) discussion thread at Raganwald.com:&lt;br /&gt;&lt;a href="http://weblog.raganwald.com/2006/10/are-we-blub-programmers.html"&gt;http://weblog.raganwald.com/2006/10/are-we-blub-programmers.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Blub Theory evolves over the course of the comments. First off, the requirements for Blubbiness are defined as:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;There's at least one language that's worse for the task at hand, and the programmer realizes (validly) that it's less suited for the task than Blub.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;There's at least one language that's better for the task, and the programmer doesn't realize that it's better for the task than Blub.&lt;/li&gt;&lt;/ol&gt;That's Blub from the programmer's perspective. For the adventurous programmer, Blub is the language you're fighting against when you try to introduce Python or OCaml to your programming team. (Citations: &lt;a href="http://www.paulgraham.com/avg.html"&gt;Beating the Averages&lt;/a&gt;, &lt;a href="http://www.paulgraham.com/pypar.html"&gt;The Python Paradox&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;Another commenter glances on the management perspective of Blub:&lt;br /&gt;&lt;blockquote&gt;&lt;i&gt;I misread "Blub" to be "bulb". As in when a programmer burns out his employer just throws him away and screws in a new one.&lt;/i&gt;&lt;/blockquote&gt;'Nuff said.&lt;br /&gt;&lt;br /&gt;When we're not trying to pin it down too carefully, we know exactly what Blub is. The first Blub was Fortran &amp;mdash; and in some circles, it's still Blub. Guy Steele Jr. (who was involved in the design of Scheme, Common Lisp, and Java in previous efforts to take down previous Blubs) is currently working on &lt;a href="http://en.wikipedia.org/wiki/Fortress_programming_language"&gt;Fortress&lt;/a&gt; with the same goal. Fortress looks nothing like Fortran, but the name's close enough to get the point across, with the point being that it's intended to be much better suited for tasks where scientific programmers instinctively reach for Fortran. C++ was Blub for the '90s, and since real Blubs never die, it's still the Blub of choice for most performance-critical stuff. Java out-Blubbed C++, and now Java and C# are splitting the Blub market. However, examples of Blub code are still generally a simplified C++, since the equivalent in Java would take too much boilerplate to be worth the column space.&lt;br /&gt;&lt;br /&gt;(Sidebar: Modern Fortran looks almost nothing like the original Fortran. C++, especially variety found in Visual Studio now, and with the 200X extensions, is also a mutant. And it's not a superset of C99, either, defeating the original purpose of the language. Visual Basic, VBA and VB.NET are not compatible, despite the naming. How can a language take its users for such a ride over the years when another language with a different name might be the more logical next step? Javascript and C# surely got a bit more mileage on name recognition. All the evidence points to human psychology still working on programmers.)&lt;br /&gt;&lt;br /&gt;But the argument breaks down when we try to explain &lt;font style="font-style: italic;"&gt;why&lt;/font&gt; another language is better suited for the task than Blub. There are C++ programmers out there who can &lt;a href="http://en.wikipedia.org/wiki/ICFP_Programming_Contest#Prizes"&gt;bust out a better program&lt;/a&gt; than you can write with any other language. Out of the last 10 years of the ICFP Programming Contest, 3 winners used Haskell, 3 used OCaml, and 2 used C++ &amp;mdash; and this is a contest arranged by functional-programming gurus. Python and Ruby have never received any prizes, though Perl was recognized by the second-place team last year.&lt;br /&gt;&lt;br /&gt;I see two axes to evaluate languages on: something like the front end and the back end. Semantics and implementation. Both are labeled "power", but for the front end that implies what the language does for the programmer, and for the back end it's what the program does for the machine.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Example 1:&lt;/b&gt; Ruby has an excellent front end. It's one of the most expressive languages available; that's probably why 37signals picked it for Rails. The best semantic ideas from Smalltalk, Common Lisp and Perl are all in there; most of the famous Design Patterns are built in either implicitly or explicitly. (It's not that the language makes them all obsolete; the language designer just had the foresight to implement the tricky ones for you.)&lt;br /&gt;&lt;br /&gt;But the back end has been playing catch-up. Performance lagged well behind even Python and Perl until the very recent v1.9, and there's no  native-code compiler. I could be misinformed, but I've also heard that: threading suffers the same issue as Python of limiting the interpreter process to a single processor (or so I've heard); there's no built-in foreign-function interface &lt;font style="font-style: italic;"&gt;a la&lt;/font&gt; Haskell or Python's ctypes module; Unicode support has rough spots; and large-scale parallelism and concurrency basically mean running a bunch of separate Ruby processes.&lt;br /&gt;&lt;br /&gt;There's a lot for a Blub programmer to pick on.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Example 2:&lt;/b&gt; &lt;a href="http://en.wikipedia.org/wiki/Borland_Delphi"&gt;Delphi&lt;/a&gt;, a.k.a. Object Pascal, has an excellent native-code compiler, with support for cross-platform compilation and single-file (DLL-free) executables, and can also run on .NET, with all these options available through the same IDE. It's competitive with C on benchmarks, often faster. Integration with databases and other external components is solid. Refactoring tools are included with the IDE, lots of fun with static analysis. Object Pascal itself was originally designed at Apple some years before Borland picked it for their offering, then abandoned, but there seems to be something inherent in the language that enables highly optimized compilation. The Free Pascal implementation, for instance, comes well ahead of every other language in the &lt;a href="http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&amp;amp;lang=all&amp;amp;calc=Calculate&amp;amp;xfullcpu=1&amp;amp;xmem=1&amp;amp;xloc=0&amp;amp;binarytrees=1&amp;amp;chameneosredux=0&amp;amp;fannkuch=1&amp;amp;fasta=1&amp;amp;knucleotide=1&amp;amp;mandelbrot=1&amp;amp;meteor=0&amp;amp;nbody=1&amp;amp;nsieve=1&amp;amp;nsievebits=1&amp;amp;partialsums=1&amp;amp;pidigits=1&amp;amp;recursive=1&amp;amp;regexdna=1&amp;amp;revcomp=1&amp;amp;spectralnorm=1&amp;amp;hello=0&amp;amp;sumcol=1&amp;amp;threadring=1"&gt;Computer Language Benchmarks Game&lt;/a&gt; when memory and speed are weighted equally. On the combined benchmarks, Free Pascal uses only half as much memory as C (gcc)!&lt;br /&gt;&lt;br /&gt;The catch is, Object Pascal is a&lt;a href="http://en.wikipedia.org/wiki/Object_Pascal#Example_.22hello_world.22_programs"&gt; cheesy-looking language&lt;/a&gt;. On the same benchmarks set, comparing the size of the code in gzipped bytes (emphasizing tokens instead of characters), Object Pascal comes in 24th out of 33 languages, just behind C. It beats Fortran, Java and C++, but not C#. I think I'd just buy more RAM rather than rewrite a Blub program in Object Pascal.&lt;br /&gt;&lt;br /&gt;The benchmarks tend to be heavy mathematical algorithms, rather than general-use applications, so certain things like I/O, libraries and support for bottom-up programming and meta-programming are discounted. Regardless, Python, Perl and Ruby are the top 3 languages for code size on these benchmarks &amp;mdash; I think Lisp was hurt more by this aspect of the benchmarks, since syntactic sugar isn't built in; there's no room for code reduction via mini-language. Haskell was probably helped by the absence of I/O. In general the benchmarks show that Blub languages perform well but are somewhat verbose, while scripting and Web-friendly languages are concise but have poor performance; Prolog ranks badly in every way, while OCaml and Haskell do well in every way; this fits reality fairly well for number-crunching but not for the Web or AI. Let's acknowledge once again that benchmarks aren't perfect, and forge ahead.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Fact:&lt;/b&gt; A language and a compiler are not the same thing.&lt;br /&gt;&lt;br /&gt;But the Object Pascal example should show that a language can baby the compiler to give better results. The three arguments go:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Declaring static types and generally programming close to the metal gives the compiler the information it needs to generate an optimally efficient program. That's why C can be fast — it's a close fit to the hardware. Same goes for Java and the JVM.&lt;/li&gt;&lt;li&gt;Using the right abstractions and strong type inferencing lets the compiler get a high-level view of what your algorithm is doing, allowing it to do more optimizations itself. That's why OCaml and Haskell can be fast &amp;mdash; they're a close fit to the pure algorithm.&lt;/li&gt;&lt;li&gt;While the expressiveness of new languages like Ruby and Python is appealing, the race to incorporate imperative, object-oriented and functional programming styles into every major language is actually resulting in weaker languages. Borrowing features doesn't bring a language any closer to providing a new model of computation, and it certainly doesn't give a better angle of attack at the whole point of all of this &amp;mdash; making the computer hardware do what we want.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;The third argument was made by John Backus, creator of Fortran, the &lt;a href="http://en.wikipedia.org/wiki/Backus-Naur_form"&gt;Backus-Naur Form&lt;/a&gt; for defining programming language syntaxes, and later the FP and FL programming languages.&lt;br /&gt;&lt;blockquote&gt;&lt;i&gt;"Programming languages appear to be in trouble. Each successive language incorporates, with a little cleaning up, all the features of its predecessors plus a few more. [...] Each new language claims new and fashionable features... but the plain fact is that few languages make programming sufficiently cheaper or more reliable to justify the cost of producing and learning to use them."&lt;/i&gt;&lt;/blockquote&gt;&lt;div style="text-align: right;"&gt;— John Backus&lt;/div&gt;&lt;br /&gt;The talk that began with this argument went on to introduce &lt;a href="http://en.wikipedia.org/wiki/Function-level_programming"&gt;function-level programming&lt;/a&gt;. At the time, everyone thought Backus was talking about functional programming, so it unintentionally gave a boost to Lisp and later the ML family, of which Haskell and OCaml are derived. But no: it was actually about a new language called FP, somewhat based on APL. FP begat FL, which went nowhere, but Morgan Stanley created an ASCII-friendly variant of APL called &lt;a href="http://www.aplusdev.org/"&gt;A+&lt;/a&gt; (which is now free, GPL'd software), and the proprietary &lt;a href="http://www.jsoftware.com/"&gt;J&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/K_%28programming_language%29"&gt;K&lt;/a&gt; have carried the torch since then. The use of &lt;a href="http://www.nsl.com/"&gt;these languages&lt;/a&gt; now seems to be mostly in the financial world. (Perhaps because it's really very well-suited to financial tasks, and perhaps because that's where APL made its splash &amp;mdash; who knows, it may have even Blubbed its way in.)&lt;br /&gt;&lt;br /&gt;The main idea is point-free programming: rather than pushing values around (as even functional programming languages do), compose functions together to create an algorithm that only references functions, not values. Then create a basic set of operators that can be composed together to create higher-level functions. This is an excellent way to do manipulate arrays and matrices. Haskell touches on this idea but doesn't emphasize it.&lt;br /&gt;&lt;br /&gt;Benchmarks for any of these languages are hard to find, but I see one cryptic statement about K &lt;a href="http://www.kx.com/a/k/examples/bell.k"&gt;here&lt;/a&gt;:&lt;br /&gt;&lt;pre&gt; [k is much faster than c on strings and memory management.]&lt;/pre&gt;&lt;br /&gt;And another startling statement on &lt;a href="http://en.wikipedia.org/wiki/K_%28programming_language%29#Performance_characteristics"&gt;Wikipedia&lt;/a&gt;:&lt;br /&gt;&lt;blockquote&gt;&lt;i&gt;The performance of modern CPUs is improving at a much faster rate than their memory subsystems. The small size of the interpreter and compact syntax of the language makes it possible for K applications to fit entirely within the level 1 cache of the processor. Vector processing makes efficient use of the cache row fetching mechanism and posted writes without introducing bubbles into the pipeline by creating a dependency between consecutive instructions.&lt;/i&gt;&lt;/blockquote&gt;&lt;br /&gt;It looks pretty convincing to me. Finally, a fresh look at how programming languages make a machine do work.&lt;br /&gt;&lt;br /&gt;This seems to be the argument missing from every language war: by removing the non-orthogonal parts of a language, it becomes more powerful. K doesn't have objects or continuations, and it doesn't need them. Likewise, Haskell restricts the ability to modify state to monads, Erlang's flow-control constructs throw out traditional iteration entirely, and Lisp virtually strips out syntax itself.&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;i&gt;Programming languages should be designed not by piling feature on top of feature, but by removing the weaknesses and restrictions that make additional features appear necessary.&lt;/i&gt;&lt;/blockquote&gt;&lt;div style="text-align: right;"&gt;— Revised&lt;sup&gt;5&lt;/sup&gt; Report on the Algorithmic Language Scheme&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;The corollary to (and irony of) the Blub paradox is that since these optimized languages are missing constructs found in Blub &amp;mdash; by design &amp;mdash; a Blub programmer will always have plenty to pick on.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-3731994283989264393?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/3731994283989264393/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=3731994283989264393' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/3731994283989264393'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/3731994283989264393'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2008/01/on-blub.html' title='On Blub'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-4126275864422630342</id><published>2007-11-08T12:43:00.000-05:00</published><updated>2008-03-07T16:47:51.526-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><category scheme='http://www.blogger.com/atom/ns#' term='c'/><category scheme='http://www.blogger.com/atom/ns#' term='style'/><title type='text'>Curly-brace wrangling</title><content type='html'>This is not new, but neither is C:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.chris-lott.org/resources/cstyle/witters_tao_coding.html"&gt;http://www.chris-lott.org/resources/cstyle/witters_tao_coding.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I'm investigating coding style, and this is the best style I've seen so far -- at least for C, C++ and Javascript -- which appear to be the two languages most susceptible to oddly formatted code, anyway. I flailed against the extra space inside parens for a little while, but now I see the wisdom of it (for C and C++, since parens should sting a little there). Combining Witter's style and typedefs for function pointers, and I think I'm set for inflicting well-informed pedantry on any new code that comes under my gaze.&lt;br /&gt;&lt;br /&gt;The two other sources on programming style/zen that impressed me are also referenced here: the style guide for the &lt;a href="http://www.chris-lott.org/resources/cstyle/LinuxKernelCodingStyle.txt"&gt;Linux kernel&lt;/a&gt; is short and sweet, and Fred Brooks' "No Silver Bullet" is, like, important. Rob Pike's &lt;a href="http://www.lysator.liu.se/c/pikestyle.html"&gt;article&lt;/a&gt; is also interesting, especially at the end. There's a lot of cross-referencing between all of these and the Wikipedia entry on &lt;a href="http://en.wikipedia.org/wiki/Unix_philosophy"&gt;Unix philosophy&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-4126275864422630342?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/4126275864422630342/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=4126275864422630342' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/4126275864422630342'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/4126275864422630342'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2007/11/this-is-not-new-but-neither-is-c.html' title='Curly-brace wrangling'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-4066766658321960654</id><published>2007-10-16T21:46:00.000-04:00</published><updated>2007-10-16T23:16:51.196-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='linspire'/><category scheme='http://www.blogger.com/atom/ns#' term='arch'/><category scheme='http://www.blogger.com/atom/ns#' term='opensuse'/><category scheme='http://www.blogger.com/atom/ns#' term='linux'/><category scheme='http://www.blogger.com/atom/ns#' term='debian'/><category scheme='http://www.blogger.com/atom/ns#' term='ubuntu'/><category scheme='http://www.blogger.com/atom/ns#' term='suse'/><category scheme='http://www.blogger.com/atom/ns#' term='cnr'/><title type='text'>OpenSUSE 10.3: Does it satisfy?</title><content type='html'>In a way.&lt;br /&gt;&lt;br /&gt;I'm coming from a more Ubuntu climate, so immediately I see some regressions:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The main installer requires a DVD, not a CD.&lt;/li&gt;&lt;li&gt;The CD installers are split between KDE, Gnome, and Non-Free programs. This wouldn't be a problem, except...&lt;/li&gt;&lt;li&gt;The hardware detection, particularly for networking, fails where Ubuntu and Puppy succeed. And once I find a decent (wired) ethernet connection, avoiding that problem, I discover that...&lt;/li&gt;&lt;li&gt;The software repositories are crap compared to Debian &amp;amp; family.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Every full-grown Debian tribesman and -woman should know this: You have been coddled with your software repositories. OpenSUSE's &lt;a href="http://en.opensuse.org/Package_Repositories"&gt;main repository&lt;/a&gt; has enough basic software (and an inappropriate amount of Mono stuff) to overflow a DVD, and that's about where it ends.&lt;br /&gt;&lt;br /&gt;To manage additional software with the built-in package manager (bundled with the mighty &lt;a href="http://en.opensuse.org/YaST"&gt;YaST&lt;/a&gt;), there's a helpful "Community Packages" module in YaST, which makes it easy-peasy to connect to a few more community-maintained repositories, as well as the official Gnome and KDE repositories. The catch is that these repositories also suck. For developers, there's little additional material in the semi-official community repositories that can't be found in the main or mostly-official repositories. Meanwhile, the searching and filtering mechanism gets a "meh," while adding and updating repositories seems to take an absurd amount of time compared to APT-based systems. If you're a (non-Mono) developer, this situation is a problem.&lt;br /&gt;&lt;br /&gt;There's hope. To compensate for certain underwhelming package managers, Linspire Inc. created &lt;a href="http://cnr.com/"&gt;Click 'n' Run&lt;/a&gt;, a (now) cross-distro packaging system. Unfortunately, it's taken some time to get off the ground -- it's currently in alpha, after a year or so in the announced-with-fanfare-but-still-vaporware stage -- and it doesn't support OpenSUSE. However, the site does support Linspire's own distros, plus Ubuntu. Debian, Fedora, and OpenSUSE will also have supported CNR clients in... the future. Clearly CRN.com has high ambitions, but today, on OpenSUSE, they're just ambitions.&lt;br /&gt;&lt;br /&gt;On the other hand, OpenSUSE does have YaST. And animals. In fact, that was the main thing my fiancée appreciated about these new distros I keep installing everywhere: so many exciting animals. She was transfixed by the &lt;a href="http://en.opensuse.org/Artwork:Brand"&gt;green lizard&lt;/a&gt; guarding the KDE menu in OpenSUSE -- envious, you might say. Especially when it flashed red. Other animals in my menagerie are a &lt;a href="http://en.opensuse.org/Kerry"&gt;Kerry&lt;/a&gt; beagle in the system tray, and a &lt;a href="http://www.pidgin.im/"&gt;Pidgin&lt;/a&gt; &lt;a href="http://upload.wikimedia.org/wikipedia/commons/thumb/1/18/Pidgin.svg/120px-Pidgin.svg.png"&gt;pigeon&lt;/a&gt; on the quick launch applet.&lt;br /&gt;&lt;br /&gt;I believe that as Gnome and KDE mature, and spawn their own tools for easy and error-resistant system administration, YaST will become increasingly unnecessary. For now, it's mighty handy for setting up a new system, particularly without an Internet connection for locating and reading the proverbial manual. Then again, in a perfect Linux, much of this would have already been ready to go after a fresh install. Another issue: the YaST configuration scripts occasionally tweak unrelated settings as YaST finishes up, resulting in occasional mysterious and disconcertingly Windows-like behavior.&lt;br /&gt;&lt;br /&gt;Also, 10 points off for having a "My Computer" icon on ~/Desktop.&lt;br /&gt;&lt;br /&gt;Let's say: If you were traumatized by Arch or Slackware, if you (deep inside) genuinely like having Windows on your desktop, if you just want to set up a Linux box and never fuss with it again, you might like OpenSUSE. But if you were deeply offended by OpenSUSE, if you want Linux in your morning coffee, you might like Arch. And if you just want a desktop that works, if you're feeling a little Gutsy, wait two more days and grab a copy of Ubuntu 7.10.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-4066766658321960654?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/4066766658321960654/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=4066766658321960654' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/4066766658321960654'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/4066766658321960654'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2007/10/opensuse-103-does-it-satisfy.html' title='OpenSUSE 10.3: Does it satisfy?'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-5654912634930399580</id><published>2007-08-02T14:31:00.000-04:00</published><updated>2008-01-31T19:49:19.048-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='unix'/><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><category scheme='http://www.blogger.com/atom/ns#' term='perl'/><category scheme='http://www.blogger.com/atom/ns#' term='shell'/><category scheme='http://www.blogger.com/atom/ns#' term='languages'/><category scheme='http://www.blogger.com/atom/ns#' term='scripting'/><title type='text'>The Right Tool For The Job: Scripting</title><content type='html'>&lt;i&gt;&lt;br /&gt;Though it's barely planned&lt;br /&gt;The kludgiest of Perl scripts&lt;br /&gt;Is one day maintained&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;I've been learning Perl lately, after having used Python wherever possible for a couple of years. It's gut-wrenching. So today's pedantry is on the topic of scripting languages -- interpreted, batteries-included, "!#/usr/bin/env"-ready languages for getting a simple job done with a minimum of hassle, as I'm defining it.&lt;br /&gt;&lt;br /&gt;Google for "little $LANG script", in quotes, replacing $LANG with each of the most well-known scripting languages. My results:&lt;br /&gt;&lt;br /&gt;Table 1:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;$LANG   @Hits&lt;br /&gt;=====   =====&lt;br /&gt;Perl    32,300&lt;br /&gt;shell   24,400  * what does this mean, exactly?&lt;br /&gt;PHP     15,500&lt;br /&gt;Python  12,000&lt;br /&gt;VB      1080    * Skewed, because "vb script" is also a language&lt;br /&gt;bash    808&lt;br /&gt;batch   624&lt;br /&gt;Ruby    511&lt;br /&gt;Tcl     411&lt;br /&gt;js      271&lt;br /&gt;sh      266&lt;br /&gt;vim     76&lt;br /&gt;C++     7&lt;br /&gt;scheme  7&lt;br /&gt;lisp    6&lt;br /&gt;emacs   3&lt;br /&gt;haskell 3&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;To further abuse Internet statistics, let's search for each language on Google Code:&lt;br /&gt;&lt;br /&gt;Table 2:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;$LANG   @HITS&lt;br /&gt;=====   =====&lt;br /&gt;C++     6,000,000&lt;br /&gt;Perl    1,420,000&lt;br /&gt;Python  1,050,000&lt;br /&gt;PHP     1,590,000&lt;br /&gt;shell   879,000&lt;br /&gt;Ruby    304,000&lt;br /&gt;Lisp    238,000 * includes elisp&lt;br /&gt;Javascript  212,000&lt;br /&gt;Basic   202,000&lt;br /&gt;Tcl     186,000&lt;br /&gt;bat     183,000&lt;br /&gt;Scheme  103,000&lt;br /&gt;Haskell 67,700&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Now, combine these two tables to get a ratio representing the "scriptability" of each language. Or rather, divide the Google Code hits by "little script" hits to get a "Script Factor" inversely proportional to the fraction of existing code that qualifies as little scripts. This is hard science.&lt;br /&gt;&lt;br /&gt;Table 3:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;$LANG    Google Code  "Little"  Script Factor   Notes&lt;br /&gt;====     ===========  ========  =============   =====&lt;br /&gt;shell    879,000      24,400    36              * "Shell" is vague&lt;br /&gt;Perl     1,420,000    32,300    44&lt;br /&gt;Python   1,050,000    12,000    88&lt;br /&gt;PHP      1,590,000    15,500    103&lt;br /&gt;Basic    202,000      1080      187             * Includes non-Visual basics&lt;br /&gt;Batch    183,000      624       293&lt;br /&gt;Tcl      186,000      411       453&lt;br /&gt;Ruby     304,000      511       595&lt;br /&gt;Javascript 212,000    271       782&lt;br /&gt;Scheme   103,000      7         14,714&lt;br /&gt;Haskell  67,700       3         22,567&lt;br /&gt;Lisp     238,000      9         26,444          * Includes elisp and common lisp&lt;br /&gt;C++      6,000,000    7         857,143&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Interestingly, this is close to the first list of "little script" languages, with the three P's right up top. The functional languages I threw in for fun are ranked by absurdly small denominators, so I wouldn't&lt;br /&gt;say the results are meaningful beyond indicating that even the hardcore people using these languages for real projects are using P-languages and the shell for simple scripts.&lt;br /&gt;&lt;br /&gt;What does this all mean?&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Scripters are using the right tool for the job. Good scripting languages float to the top.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Even the most hardcore Lisp and Haskell programmers use something else for scripting. In other words, they know multiple languages, and they, too, use the right tool for the job.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;There are seven idiots in the world writing scripts in C++. One would only do this if unaware of any other scriptable language, and therefore capable of using only one tool for any job.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Emacs users call their customizations "packages" or "modes," not "scripts." Foiled.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;Now let's get specific.&lt;br /&gt;&lt;h4&gt;Shell&lt;/h4&gt;"Shell" came in first in scriptability, second-place in "little $lang script," but merely fifth in Google Code usage. So shell scripting is a popular way to get things done, but not so much for writing full-on applications.&lt;br /&gt;&lt;br /&gt;What language is shell scripting, exactly? I'm assuming the search hits refer to bash, ksh, csh, zsh, and the rest of the Unix shells, mostly because that's how it showed up on Google Code, and because bash seems to be the default on the major Linux distros. Plus, Windows programmers don't talk about the "shell"; if they wade into the muck of cmd.exe at all, they call it batch, DOS, or occasionally command-line scripting. And they don't talk about it online as much as Unix/Linux gurus, outside a few Microsoft-specific websites, from what I've seen.&lt;br /&gt;&lt;br /&gt;The strengths of the shell are (1) everything is a string; (2) courtesy of Unix design, the sources and recipients of character streams consistently look like filenames; (3) complex programs can be used like functions and filters, directly adding to the shells abilities (the ultimate FFI, in a way); (4) since code can be data and commands can be piped and redirected around, flow control can be pretty concise. The flaws, as I see them, are (1) everything is a string, meaning nontrivial structures must be serialized and parsed at every step; (2) there are few guarantees about what's actually available to the shell on a given system -- paths, environmental variables, program versions -- so sharing scripts between systems is wildly unreliable. Still, I've never seen a GUI tool as broadly useful as the shell is for getting computer tasks done.&lt;br /&gt;&lt;h4&gt;Perl&lt;/h4&gt; Legend has it that Larry Wall designed Perl to pull together all of the various Unix sysadmin tools into one effective package, with the plan for it to be especially useful for text manipulation (Reporting and Extraction). So C, bash, awk, sed, grep, and friends are all in there -- in short, it keeps the shell's advantages and does its best to eliminate the disadvantages. (Best of all, it finally got regular expressions right.) And then there's CPAN. I'm not surprised that Perl is #1 for "little scripts" that are just complex enough to be worth saving.&lt;br /&gt;&lt;br /&gt;What is Perl the right tool for?&lt;br /&gt;&lt;ul&gt;&lt;li&gt;One-liners that bash doesn't have an equivalent for -- Perl is installed almost everywhere bash is&lt;/li&gt;&lt;li&gt;Straightforward text-processing scripts (Python's immutable strings are a weakness here, and Ruby installations still aren't a universal default)&lt;/li&gt;&lt;li&gt;It was a great server-side scripting language during the first dotcom boom (though Java managed to cast itself as the more legit (enterprisey) big brother here). Since Perl coders weren't afraid to get things done "right now," mod_perl made the combination of Apache and Perl effective, scalable, and most importantly, &lt;span style="font-style: italic;"&gt;available&lt;/span&gt; just when it was needed.&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Python&lt;/h4&gt;Python fixes Perl, says the next legend. But its strength as a scripting language is that it fixes Java, too -- and as it turns out, Python's "scriptability" is exactly half that of Perls. Spooky, no?&lt;br /&gt;&lt;br /&gt;I like Python. It makes sense to C programmers and Unix hermits. And, thanks to Guido's diligent attention to aesthetics, ugly Python code almost always means you're doing something awkward, slow or wrong. The language rewards good behavior with readable, concise code. You know that whitespace issue where if you copy code from a forum and paste it into your own code, the interpreter will crap out on the indentation? It's &lt;i&gt;punishing you&lt;/i&gt; for blind copy-and-paste. Doesn't that creep you out a little? Guido is basically handing out candy if you read the documentation on generator expressions, and slapping you on the wrist if you don't read your own code before running it.&lt;br /&gt;&lt;br /&gt;There doesn't seem to be a single theoretical approach that guarantees a language will work that way, but for Python, it seemed to work.&lt;br /&gt;&lt;br /&gt;What is Python the &lt;span style="font-style: italic;"&gt;wrong&lt;/span&gt; job for?&lt;br /&gt;&lt;ul&gt;&lt;li&gt;One-liners -- remember that thing about whitespace?&lt;/li&gt;&lt;li&gt;Unix tasks that have already been thoroughly solved with existing command-line tools (see&lt;br /&gt;Bash).&lt;/li&gt;&lt;li&gt;Number crunching (by itself, but see SciPy and Parallel Python). Python 3.0 borrows most of Scheme's numerical tower, so that may improve the situation.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-5654912634930399580?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/5654912634930399580/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=5654912634930399580' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/5654912634930399580'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/5654912634930399580'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2007/08/right-tool-for-job-scripting.html' title='The Right Tool For The Job: Scripting'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-266234734515043410.post-5343940066420548863</id><published>2007-08-01T18:01:00.000-04:00</published><updated>2008-03-07T16:46:46.378-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='linux'/><category scheme='http://www.blogger.com/atom/ns#' term='userfriendly'/><category scheme='http://www.blogger.com/atom/ns#' term='ubuntu'/><title type='text'>Upgrading Ubuntu</title><content type='html'>Upgrading Ubuntu between major versions is a game. Your machine was running smoothly before you ran update-manager and plunged into a massive set of repository changes and software upgrades, so clearly, your machine is also capable of running the next major version of Ubuntu. The game is when something breaks during the transition. It's randomized to make it more of a challenge, so you can't just look up a walkthrough on the Ubuntu forums or Gamefaqs. Somewhere, a config file was mangled, or the flags on a low-level program changed and a caller failed to compensate for the new configuration. Now, Edgy is counting on you to use your command-line skills to track down the culprit and &lt;i&gt;make him pay&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;New installations from a CD are a breeze these days (as of Breezy...), assuming you've already tried the live CD on your system and it seemed to work. Before Dapper Drake, nobody (meaning, not myself) expected dist-upgrade to work without a hitch. Ubuntu was a fresh new operating system, cranking out new major versions every 6 months or so, and if you're into trying out new OSes, you've probably quickly learned, the hard way or the easy way, to use a fresh CD if you need your computer up and running again today. If the game of fixing a broken system is too frustrating, your live CD will save you: Death from above to the existing installation, and reinstall from scratch. If you have a separate /home partition, you're in good shape. If you need to rescue some files before the great annhilation, the live CD helps you there, too.&lt;br /&gt;&lt;br /&gt;But as of Dapper Drake, the game is winnable for most users. Since Dapper is designed for long-term support, update-manager normally doesn't offer the option to do a dist-upgrade through the GUI. As I recall, dist-upgrade didn't really work either in October 2006 when Edgy was released, either. Dapper took 7 1/2 months to finish instead of the usual 6, and Canonical compensated by pulling together Edgy in 4 1/2 months. So, there was some revolting hack posted on the Ubuntu wiki for getting 'er done. Perform the listed incantations, and you end up with a system that should be pure Edgy, but in practice dies horribly.  For me and others, X died, and the command-line interface mostly died, too. There was a prompt with strange terminal font rendering, and the shift key had inscrutable behavior, so if your password involved that advanced functionality... well, shoot. I think I used the recovery mode and ran variations on apt-get to finish the upgrade and get Ubuntu back on its feet again.&lt;br /&gt;&lt;br /&gt;I've been involved in two Dapper-to-Feisty upgrades this week, and the game is prettier to look at now. Better graphics, flashier bad guys, improved gaming experience overall. On my veteran P3 desktop, a handcrafted relic from the 20th century, I rediscovered the power cord and brought it back into service. Yep, still a computer. Standard Ubuntu Dapper. Before getting my current Toshiba POS vintage 2001 laptop, I used this box for trying out exciting new Linux distros, and had kind of a stormy relationship with Synaptic. Some things are installed that shouldn't be, in strange ways, with hand-mangled config files. Sounds like a good candidate for the newly streamlined Dapper-to-Edgy upgrade process:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;sudo gdsu "update-manager -c"&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;This launches update-manager with a graphical prompt for superuser privileges, and tells it to check for distro upgrades. Type a password, click the shiny buttons, let chill for 2 hours before rebooting.&lt;br /&gt;&lt;br /&gt;On ye olde Pentium the Third, it worked pretty well. Some packages were broken, but X launched and Gnome loaded -- with a couple of angry message boxes letting me know that gnome-panel was disgruntled. Helpfully, Bug Buddy popped right up to tell me that it couldn't do anything useful for me or the developers, and allowed me to close it. The Gnome panel then restarted and crashed again, launching another tragic Bug Buddy, ad infinitum. At first I was fooled into thinking Bug Buddy was a modal dialog on top of all of Gnome, locking me out from doing anything else, but in fact it was just the panel that was broken, and Bug Buddy could safely be ignored. The desktop and other applications still worked.&lt;br /&gt;&lt;br /&gt;Having failed to set a memorable key binding for xterm, and lacking the initiative to look up the built-in way to do it (didn't know about Ctrl-Alt-F2 and virtual terminals yet), I cobbled together a desktop icon to launch xterm, and used that to run update-manager:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;sudo update-manager&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;This works in Edgy because here, update-manager checks for distro upgrades be default. If I understand correctly. Anyway, it worked, I followed the graphical upgrade sequence again to end up with a fully functioning Feisty Fawn, and I didn't even have to get Edgy working properly. Most of the relevant configuration files get updated in the upgrade process, so whatever had previously been mangled was corrected in the wonderfully silky-smooth Feisty upgrade process.&lt;br /&gt;&lt;br /&gt;Back in 10/2006, I recounted my upgrade woes to my sister, probably finishing the story with "it wasn't really that bad," and she opted to stick with Dapper until a better upgrade path came along. Well, that path came along. She was dismayed to find that you can't go straight from Dapper to Feisty, but decided that Feisty was worth it. The specific motivation: Her laptop (newer than either of my machines) uses certain lame components that don't have particularly good driver support in Dapper, and she believes her wireless and ethernet connections will behave better in Feisty.&lt;br /&gt;&lt;br /&gt;I wasn't there for the prologue, but we teamed up for the big game this morning. This time, it really wasn't that bad at all. Gnome came up correctly, and since the only program we need access to in Edgy is update-manager, that was the immediate next step.&lt;br /&gt;&lt;br /&gt;The GUI option to upgrade to Feisty didn't come up. Strange. Something's wrong. On updating the archive list, we saw a stream of errors connecting to the repositories -- OK, now we know what the game is. &lt;i&gt;The Internet died&lt;/i&gt;. More specifically, since this is a wired connection and she doesn't use NetworkManager, the mangled config file is /etc/network/interfaces.&lt;br /&gt;&lt;br /&gt;For newer Ubuntu users, this is how you find out what happened to your internet access:&lt;br /&gt;1. If you're using wireless, type &lt;tt&gt;iwconfig&lt;/tt&gt; to see what wireless devices are active. Or, type &lt;tt&gt;ifconfig&lt;/tt&gt; to see what all your networking devices are doing.&lt;br /&gt;2. If you see warning messages, pay attention. If you see just a list of devices and no sign of activity, and you're not using NetworkManager, the file, your network interfaces config file is wrong.&lt;br /&gt;3. &lt;tt&gt;sudo gedit /etc/network/interfaces&lt;/tt&gt; (replacing gedit with your text editor of choice if you care)&lt;br /&gt;4. Fix the config file. Use the Ubuntu wiki or an online search if necessary. (Obviously, from another computer.) If you're lost, you can erase whatever you don't understand and go through the Networking GUI to rebuild it.&lt;br /&gt;&lt;br /&gt;As usual, the next upgrade from Edgy to Feisty was clean and uneventful. The only complaint I heard later was that the wireless situation was about the same level of fussiness as before (I still don't know the details). However, Network Manager seems to be alive by default in Feisty -- or else I didn't notice when it was installed -- so there are more options for fussing with the wireless connection right off the bat. Fuss, fuss, dhclient, fuss, restart, gold. So, I think we're in a good place now.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/266234734515043410-5343940066420548863?l=etalog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://etalog.blogspot.com/feeds/5343940066420548863/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=266234734515043410&amp;postID=5343940066420548863' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/5343940066420548863'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/266234734515043410/posts/default/5343940066420548863'/><link rel='alternate' type='text/html' href='http://etalog.blogspot.com/2007/08/upgrading-ubuntu-between-major-versions.html' title='Upgrading Ubuntu'/><author><name>Eric Talevich</name><uri>http://www.blogger.com/profile/10168388850793209768</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-HrScY__e4Nk/TiCn_6ISC-I/AAAAAAAAAMY/zaiWEWFEvbg/s220/selfpic-turtle.png'/></author><thr:total>0</thr:total></entry></feed>
