Removing duplicates using itertools.groupby
Eliminating duplicates
Previously I discussed merging sorted streams using Python; a call to a standard library function for those who’ve updated to 2.6.
>>> from heapq import merge >>> from itertools import count, imap, islice >>> m2, m3, m5 = [imap(n.__mul__, count(1)) for n in (2, 3, 5)] >>> m235 = merge(m2, m3, m5) >>> list(islice(m235, 10)) [2, 3, 4, 5, 6, 6, 8, 9, 10, 10]
Here, we merge positive multiples of 2, 3 and 5 into a single stream. 6 appears twice in the output, being a multiple of both 2 and 3. 10 is similarly duplicated and 30 would feature three times.
Itertools.groupby can remove the repeated entries.
>>> from itertools import groupby >>> help(groupby) Help on class groupby in module itertools: class groupby(__builtin__.object) | groupby(iterable[, keyfunc]) -> create an iterator which returns | (key, sub-iterator) grouped by each value of key(value).
If the key function is not specified or is None it defaults to an identity function, and groupby partitions an iterable into subiterators over equal valued elements.
>>> from operator import itemgetter >>> first = itemgetter(0) >>> m235 = merge(*(imap(n.__mul__, count(1)) for n in (2, 3, 5))) >>> u235 = imap(first, groupby(m235)) >>> list(islice(u235, 10)) [2, 3, 4, 5, 6, 8, 9, 10, 12, 14]
Merging sorted streams in Python
Problem
In a post on his Code Monk blog David Jones offers up some Python tidbits he’s only recently discovered and invites readers to share similar discoveries. I’d like to respond here by talking about a recipe from the Python Cookbook.
Problem
You have several sorted sequences (iterables) and need to iterate on the overall sorted sequence that results from “merging” these sequences.
For example, we could merge multiples of 2, 3 and 5.
>>> from itertools import count, imap, islice >>> m2, m3, m5 = [imap(n.__mul__, count(1)) for n in (2, 3, 5)] >>> m235 = merge(m2, m3, m5) >>> list(islice(m235, 10)) [2, 3, 4, 5, 6, 6, 8, 9, 10, 10]
Here, 6 appears twice in the merged stream since it’s a multiple of 2 and also of 3, and similarly 10 makes a double appearance.
This example merges three infinite streams. If we were sure all our inputs were finite, we might well simply chain them together and sort the whole thing.
>>> from itertools import chain >>> def merge(*seqs): >>> return sorted(chain(*seqs))
An algorithm which deals with a (potential) mix of finite and infinite sequences is a little more interesting. We might consider an approach which repeatedly peeks at the head element of each sequence, finds the smallest of these, then pops and yields it. The recipe in the Cookbook improves on this idea.
Launching missiles and other unhappy accidents
Team work
Only Iain really understood the application engine, a particularly subtle and complex part of our system. We depended on him alone for any modifications in this area. To reduce the risk, the project manager paired me with Iain to work on a new feature for the engine. After all,
“What would happen if Iain fell under a bus?”
Well, if something that terrible happened, I don’t suppose I’d have cared much about the app engine. As it turned out we never got started on that feature. A customer reported a critical bug in the graphics library — a more straightforward part of the system, but one which only I knew about — and by the time I’d turned in a fix Iain was heading off for the weekend. Celtic were at home to Rangers. He had a plane to catch.
Haskell makes the A-Z
Switching from buses and planes to Haskell, Computerworld’s great A-Z of Programming Languages series recently featured Simon Peyton Jones talking with characteristic exuberance about the language.
Haskell’s roots are in academia. Unlike, say, C++ or Erlang, it wasn’t created with a particular real world application in mind. Rather, the language aimed at rigorously investigating an approach to programming, that approach being strongly-typed lazily-evaluated functional programming. Pure functional programming, if you like. As the language continues to evolve its purity has not been watered down — what’s more, as computer architectures develop, the real world seems to be turning towards Haskell.
So much for the summary. It’s the terrible side effects I’d like to pick up on.
Life, user manuals, recursive pictures
Reading what to read
Recently, scouring proggit for something worth reading, I found myself directed to Don Knuth’s website and looking at his list of recommended nontechnical books. Topping this list, to my surprise and delight, I found:
Life A User’s Manual by Georges Perec (perhaps the greatest 20th century novel).
Knuth may be an expert on sorting but ranking novels by their greatness is an invidious task. Perec’s masterpiece isn’t that well-known, in English-speaking countries at least, hence my surprise: few who have read the book would dispute the position claimed for it.
The book visits the rooms of a Parisian apartment block, 11 Rue Simon-Crubellier, on June 23 1978, describing the inhabitants and their stories. There’s a fascination with detail and lists, with what exactly can be seen in each room, and I’ll admit this detail perplexed me the first time I read the book — what could be the significance of all of these things?
On re-reading, I savour the details and the strange patterns they form. Early on, we look at the framed posters hanging on the wall of Madame de Beaumont’s drawing room.
One of [the posters] depicts four greedy-looking monks sitting at a table around a Camembert cheese on the label of which four greedy-looking monks — the very same — are again at a table around, etc. The scene is repeated distinctly four times over.
What a wonderful example of visual recursion!
Looping forever and ever
On the subject of syntactic sugar, C’s for loop is a curious beast. Many years ago, on encountering code like:
for (i = 0; i < 10; i++)
{
....
}
I would have to pause and mentally expand it:
i = 0;
while (i < 10)
{
....
i++;
}
Now, after repeated use, these loops seem familiar and expressive. While the standard C++ algorithms offer a higher level abstraction with their iterator range operations, plain old loops often turn out to be easier to work with.
You can plug general expressions into the loop control construct.
for (<setup>; <proceed?>; <advance>)
{
....
}
Any or all of the control expressions can be omitted: drop the <setup> if nothing needs setting up; leave out the <advance> if nothing needs advancing; and kill the <proceed?> to keep the loop going. Thus one standard form of never-ending loop is:
for (;;)
{
....
}
Equivalents would be while (1) or while (true), but who wants to see a bald literal in a source file? Besides, for (;;) is shorter than while (1), and an empty while () is of course a syntax error. Even better: use a suitable font, squint, and the parenthesised semicolons resemble a mite of some sort.
Syntactic Sugar
The SGI STL documentation notes that std::map::operator[] is redundant:
Strictly speaking, this member function is unnecessary: it exists only for convenience.
This sentence pretty much nails what’s meant by “syntactic sugar”, defined more formally in Wikipedia as:
… a term coined by Peter J. Landin for additions to the syntax of a computer language that do not affect its functionality but make it “sweeter” for humans to use.
For example, Python’s @decorator syntax sweetens the practice of wrapping functions. In C++, operator overloading adds nothing which couldn’t be achieved using standard function call syntax, but it opens the door to some inventive and expressive techniques. Haskell has a nice syntax for custom infix operators, and so on.
Lisp values its spare syntax and certainly won’t allow infix operators, but even a minimalist dialect like Scheme allows lists to be represented like this (a b c d e) rather than (a . (b . (c . (d . (e . ()))))).
Perl comes laced and frosted with syntactic sugar. Larry Wall explains why he doesn’t heed Alan Perlis’s famous warning about cancer of the semicolon.
To me, one of the most agonizing aspects of language design is coming up with a useful system of operators. To other language designers, this may seem like a silly thing to agonize over. After all, you can view all operators as mere syntactic sugar — operators are just funny looking function calls. Some languages make a feature of leveling all function calls into one syntax. As a result, the so-called functional languages tend to wear out your parenthesis keys, while OO languages tend to wear out your dot key. — Larry Wall, Apocalypse 3, Operators
(Perl, of course, tends to wear out your shift key …)
Macros with halos
The C preprocessor is a notoriously primitive and slippery creature. Included and occasionally embraced by C++, it may eventually prove to be the downfall of the newer language. How much trouble has been caused by seemingly innocuous definitions?
#define min(a, b) (((a) < (b)) ? (a) : (b))
This macro hobbles any attempt to use the standard C++ algorithm, std::min(), and since it’s a preprocessor thing the compiler warnings may not alert you immediately to what’s going on. You don’t have to include <algorithm> directly to trigger the problem.
Try compiling this stripped down source file using GCC 3.4.6:
#include <string> #define min(a, b) (((a) < (b)) ? (a) : (b)) #include <sstream>
and you’ll see something like:
In file included from /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/streambuf:781,
from /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/ios:50,
from /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/istream:45,
from /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/sstream:45,
from macros.cpp:3:
/usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/streambuf.tcc: In member function `virtual std::streamsize std::basic_streambuf<_CharT, _Traits>::xsgetn(_CharT*, std::streamsize)':
/usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/streambuf.tcc:54: error: expected unqualified-id before '(' token
/usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/streambuf.tcc: In member function `virtual std::streamsize std::basic_streambuf<_CharT, _Traits>::xsputn(const _CharT*, std::streamsize)':
/usr/lib/gcc/i386-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/streambuf.tcc:88: error: expected unqualified-id before '(' token
I spent a while tracking down a similar problem today (needless to say, the offending macro was hiding several #includes away from the file which triggered the error[1]). Once I’d exposed the source, I naturally wanted to grumble to someone, so I pasted the code into an instant message to a colleague. Look, my messaging client sanctified the macro by adding halos to the a’s!
[1] One tip for tracking down such problems is to run the preprocessing phase of compilation on its own: with GCC, for example, supply the
-E flag.
Entertaining Documentation
A Programmer’s first language
A recent enquiry on the ACCU mailing list asked which programming language would be most suitable for a beginner. The general response favoured Python. This should come as no surprise: elsewhere, Python’s benevolent dictator for life explains:
- how his funky title came about, and
- how (somewhat to his surprise) Python has become increasingly popular for teaching and as a first language.
Hang on though! Back on the mailing list Mike Small voiced his dissent and spoke up for Perl.
… I also think the online Perl documentation blows away the online Python docs. One I can read for entertainment. The other has just the bare facts and is dull, although not as bad as the run of the mill doxygen-type tool generated, fill in the required fields docs you get for mainstream stuff like Java or .NET.
Compare…
http://perldoc.perl.org/perlbot.html
… with …
http://docs.python.org/ref/types.html
(if that’s an unfair comparison someone feel free to find me an excerpt from the standard python docs that isn’t a complete snore-fest).
iBlame Exchange
After a virtual wait outside the istore my boss’s iphone 2.0 arrived. Having unboxed and activated it, he handed his obsolete iphone 1.0 on to one of his sons, who, flouting the warning:
You must be at least 17 years old to download this game
downloaded ipint, an advert/game which uses the iphone’s on board accelerometer to slosh beer around.
This new iphone 2.0 is a company phone and other lucky company phone users where I work have switched to iphones. The must-have feature which precipitated this decision was not the sleek design, the clever interface, nor even the amusing accelerometer, but rather the more utilitarian microsoft exchange email integration — a feature my boss tried out on holiday in the Alps. Some bug in the system (or, more probably, in the integration of various systems) caused everyone@where-i-work.com to get spammed with ten copies of the same photograph.
![[mini Matterhorn]](/images/matterhorn-mini.jpg)
![[mini Matterhorn]](/images/matterhorn-mini.jpg)
![[mini Matterhorn]](/images/matterhorn-mini.jpg)
![[mini Matterhorn]](/images/matterhorn-mini.jpg)
![[mini Matterhorn]](/images/matterhorn-mini.jpg)
![[mini Matterhorn]](/images/matterhorn-mini.jpg)
![[mini Matterhorn]](/images/matterhorn-mini.jpg)
![[mini Matterhorn]](/images/matterhorn-mini.jpg)
![[mini Matterhorn]](/images/matterhorn-mini.jpg)
Distorted Software
Helicopter View
Recently I’ve been helping scope a project to develop a new Product. In an Agile world big design up front may be frowned-upon, but commercial reality requires some idea of costs from the outset. You need to plan, to size up the job. You must somehow answer the question: What will the software for this Product look like?
I don’t mean “What does the UI look like?” any more than I mean “Will it have a funky yellow chassis?” or even “What will source tree look like?”.
I’m talking about something less easy to visualise:
- what will the main components be?
- how do they connect?
- how big are they?
So, what exactly does a database look like? How about a web server? Or a message queue? A state machine?
tag.wordaligned.org
I’ve set up the subdomain tag.wordaligned.org for personal material. If you want to see pictures of Tyrannosauruses or Transporter Bridges, or read about my first visit to the fabulous Newport Velodrome, or celebrate Mark Cavendish’s amazing turn of speed in the Tour de France, or even find out who Pete T Day-Journay is, then take a look.
On the subject of the Newport Velodrome, the UK cycling squad will be using it as a base for an Olympic training camp from Thursday July 24th through until August 5th, and apparently you’re allowed in to watch them train.
Rewriting String.Left()
Over at Coding Horror, Jeff Atwood codes up String.Left() in C#.
For example, in C#, there is no String.Left() function. Fair enough; we can roll up our sleeves and write our own function lickety-split:
public static string Left(string s, int len) { if (len == 0 || s.Length == 0) return ""; else if (s.Length <= len) return s; else return s.Substring(0, len); }
He’s using this function to introduce C# extension-methods, but I’m more interested in making some sweeping generalisations about Java code.
In the article’s comments section PRMan points out this function “blows up” if the input string is null, and offers a simple fix:
public static string Left(string s, int len)
{
if (s == null)
return s;
else if (len == 0 || s.Length == 0)
return "";
else if (s.Length <= len)
return s;
else
return s.Substring(0, len);
}
Now, I’ve never written any C#. In fact I don’t even have a C# compiler to hand. But this code looks like everything which ever bugged me about Java. What’s the point of writing a string function which accepts a null input? All we’ve done is force clients of String.Left() to handle null inputs too. Just (implicitly) raise a null pointer exception (which is not the same as blowing up) and be done with it.
It’s unfair to suggest this bogus defensive coding has anything to do with Java — you can do the same thing in any language — but in my personal experience lots of Java code gets written this way, to the extent that I used to regard these nervy null checks as being idiomatic. So I was relieved when a Michael Feathers article put me straight:
Spurious null checks are a symptom of bad code.
Feathers’ examples are coded in Java, so maybe I wasn’t being unfair after all. If you’re really concerned about String.Left() handling whatever clients might throw at it, then it’s more important to deal with negative values of len.
So we’re back to Atwood’s original version which reads the way a programmer might think about this problem:
The leftmost
ncharacters is justs.Substring(0, len), but, careful!, what iflenis too big, and what about ifsis empty, or iflenis zero?
Hence we end up with a three way conditional with three separate and rather different looking return values, in which the most interesting case appears last. I think this code could be more readable. First (unless I’m missing something about C#) we can get rid of the special case of (len == 0 || s.Length == 0) since the subsequent clauses do the right thing. (By the way, I’m switching to Java now since I have no C# compiler to hand.)
public static String left(String s, int len)
{
if (s.length() <= len) {
return s;
} else {
return s.substring(0, len);
}
}
I personally would transform this to:
public static String left(String s, int len){
return s.substring(0, Math.min(len, s.length()));
}
It’s not that I’m particularly against multiple returns, and I don’t claim this tweak is a big improvement: I just prefer a more declarative style, which clearly says we’re returning a substring of s, starting at index 0, of length len or s.length() if smaller.
In fact, I don’t think many Java programmers would agree I’ve improved things. Java seems to favour explicit control flow over even this trivial functional composition. Which is why I don’t miss programming in Java, and why I’m in no hurry to try C#.
I also wonder if Jeff Atwood’s original version of String.Left() is, in theory, more efficient. Will a return value of "" be pooled? Does it make sense to return the original string whenever possible, to avoid creating unnecessary objects? I don’t know enough about C# or Java say if there’s any truth in these suggestions. I do know that such savings are extremely unlikely to make any real difference for real strings in real programs, and that sacrificing readability for theoretical efficiency is a mistake.
On the subject of readability, String.Left() is really just a special case of sequence slicing — a Python feature which I’d like to monkey-patch into other languages.
Me, Myself and OpenID
That’s not strictly Python
At the opening day of PyCon UK 2007 Simon Willison delivered a keynote presentation entitled “Building the Social Web with OpenID”. Willison is an assured and expert presenter, and it took him just an hour to work through 146 multimedia slides pitching OpenID as, in a sentence:
“single sign-on for the web.”
At the time I felt disappointed: I’d expected something more directly related to Python. Looking back from almost a year on, I realise that the presentation I got most from at that conference also had little directly to do with Python (Dr. Terry Jones, “Fluidinfo - towards the next everything”[1]). For me, conferences are a chance to step back and reflect, to learn about new things, which is why, this year, I chose to attend the more eclectic ACCU 2008 conference.
Nonce Sense
I tried to set up a personal OpenID server here at wordaligned. The process went smoothly enough and a couple of tests on my localhost suggested all was working so I deployed the changes. When I then tried logging in to a website using my shiny new identity I found that, not only had I cocked something up, but, as you can see from the query parameter in my browser’s address bar, the site thought I was a nonce.
Hey, I’m guilty of entering an invalid ID, but I’m no pervert!
In the UK, the term nonce (sometimes spelled “nonse”) is a slang word used to refer to a sex offender and/or child sexual abuser, and thus as an insult.
Once I’d got past the initial shock a quick web search exposed my mistake: in cryptography, a nonce is a number used once. Here’s a reference in the OpenID 2.0 specification.
11.3. Checking the Nonce
To prevent replay attacks, the agent checking the signature keeps track of the nonce values included in positive assertions and never accepts the same value more than once for the same OP Endpoint URL.
Digging deeper, I learn that outside the world of crime and cryptography:
It’s mainly a term of trade among lexicographers and linguists and turns up also in phrases like nonce compound, nonce borrowing and nonce formation.
In case you were wondering … this note shares its title, “Nonce Sense”, with a hoax anti-paedophile campaign shown on the controversial comedy show Brass Eye.
Fixing header file dependencies
DEPENDS
Without care C++ header files can deteriorate, so I was interested to find some sensible advice in the Google C++ Style Guide.
Names and Orders of Includes
Use standard order for readability and to avoid hidden dependencies: C library, C++ library, other libraries’
.h, your project’s.h.…
The preferred ordering reduces hidden dependencies. We want every header file to be compilable on its own. The easiest way to achieve this is to make sure that every one of them is the first
.hfile #included in some.cc.
I agree, hidden dependencies are bad, and I’m not about to quibble with the “standard order” defined by the guide, even if I’m used to a slightly different ordering. Certainly headers should be compilable on their own; but I suggest the easiest way to achieve this is, well, to compile them on their own.














![[PyCon UK]](http://www.pyconuk.org/images/pyconuk_logo_72x72.png)





