When you're getting into these really rather modest differences of a few percent, you have to watch your margin of error very carefully. In addition, in use, you have to exclude both placebo and impressionistic effects, placebo here meaning whether the rider is a roadie who drinks the derailleur koolaid from a camelback or whether he is an engineer who prefers the Rohloff for many perfectly sensible reasons, and impressionistic effects things like the noise around gears 7 to 8 making the bike feel less efficient. I find it heartening that both points have already been made; it demonstrates a proper skepticism.
Also, in these small margins, you have to start asking who is taking the test and what is his track record. This is difficult to explain to so many engineers, but statistical data always has an interpretative element starting with how the test is set up (you can change the outcome by changing the hypothesis), and can easily be turned into an art form (ask me, it's what I made my name on when I was in advertising).
Mind you, like George, I bought the Rohloff for another reason than pure efficiency, though I was aware of the various tests and sets of numbers, of which at the time I bought my Rohloff-equipped bike the roadies were still trying to make a thing. At that time I remembered that I ran with De Villiers Lambrecht a few times (his regular partner at that time was Ivan Latsky) and that he shortly became the world's first barefoot 4-minute miler. He was never faster with the more efficient spiked shoes that by then had long been at a peak of development. The guys from the engineering department at our college could easily spend an hour explaining why he should be faster on spikes. There a canyon a mile deep between "should" and "is".
What I find particularly disturbing about such close statistical comparisons where the margin is a handful of percentage points or often much less, is that they relate very poorly to actual usage. This is for a whole passel of reasons, but two stand out for me. One is that the underlying assumption is that (in this case) the derailleur and Rohloff bikes will be used by robots tuned to peak response, or at the very least highly experienced and fit athletes able at every single gear change to extract the last gramme of efficiency from the machines. The other disturbing underlying assumption is that the kinesthetic satisfactions and other psychological interplays between man and machine have zero value, zero input to the end result; it's BS and every statistician with real life experience knows it.
Let's give you an example from my cycling experience of how wrongheaded the first assumption, of peak operating efficiency is, even though on the face of it it sounds perfectly reasonable. I have two bikes, a Gazelle Toulouse and a Trek Smover, which are both Dutch vakansiefietse, the sort of stadsfiets (an upright commuter) with all the "luxury" extras that the Dutch save for their holidays. Both have eight speed Shimano Premium gearboxes. Both are tuned to precisely my preferred seating/operating position: the ergonomics are identical and the bikes weigh the same to within a few ounces; basically, you need to be an expert to tell them apart. But the Gazelle has a manual gearbox and the Trek has an automatic gearbox. Now, an engineer can test the two bikes and discover that the Trek consumes power (from the hub dynamo that drives all the electronics to change the gears and operate the active suspension) to operate itself, while the Gazelle does not. Conclusion: the Gazelle must be faster. Not so in the least. The Trek, over a circuit I rode at least five times a week, was always faster, simply because the human would never change gears as optimally as the computer. I don't want to brag, but I knew this in advance, long since having discovered that across Europe in a day, from London to Nardo in the boot of Italy, a big soft car with a huge engine and an automatic gearbox (Stirling Moss and I both had 7 liter Ford Galaxy for journeys like that) was oodles faster than a nippy, noisy, uncomfortable, loud little Porsche. By all means refer back to my question about who is taking the test, above. I will instantly agree that a better cyclist would be faster on the manual Gazelle and so narrow the difference between it and the automatic Trek, but will he do it every time? I don't think so. Riding at racing peak all the time is something unpleasant obsessives may do, but I don't know any of those, and suspect that it is an impossible attitude to maintain.
The Galaxy and Porsche "convenience differential" making the big, wallowy American car (I tuned the suspension, of course, but you can't turn a pig's ear into a silk purse) faster describes a psychological effect and will do as an example of my second objection, to machine studies that neglect human considerations.
***
Turning now to an engineering objection to basing decisions on such marginal differences between machines of such fundamentally different approaches, I will say outright that the differences measured are irrelevant. Note that I'm not attacking Andreas Oehler's precision, nor his right to do these comparisons, something it is necessary to say since I earlier refused to take a lamp study from his hand because it was tainted by his employment by Schmidt Maschinenbau, the makers of the SON dynamo, who buy the components of their lamps from B&M, both of whose lamps were in the test. My point is analogous to George's above, where he says the Rohloff was new and would run in and (implied by George) then be more efficient.
The problem is in fact far larger than George's objection. It is that the derailleur when new may well test marginally better than the Rohloff, but from new the Rohloff can only improve, albeit slowly, as it runs in. Chalo Colina, the noted Boeing tool machinist -- he's the guy who designed the 48-spoke Rohloff rim that some of you may lust after -- said that a Rohloff runs in just about the time other hub gearboxes lie themselves down to die. But the derailleur from new is on a permanent downhill slope, and as it wears it loses efficiency, whereas over the same distance the Rohloff picks up efficiency.
A much more relevant test of a touring gearbox would be at the halfway point in the life of the Rohloff transmission, when the derailleur setup would long since have laid itself down to die, and would therefore offer zero efficiency.
But hey, let's skew the test in favour of the derailleur transmission, as Oehler & Co do (involuntarily -- I'm not accusing them of dishonesty, but I once operated the second-largest -- after the US Census -- research budget in the world and know how easy it is to go wrong), and take the test with the derailleur at the halfway point of its life, when the Rohloff is still running in, and I'm betting that the Rohloff will be more efficient than the worn derailleur.
I haven't finished yet. The same applies to every ride. The derailleur at the midpoint of the ride will have picked up some dirt and be less efficient than at the beginning of the ride. The derailleur will offer even less efficiency at the end of the ride. The Rohloff, fully enclosed, will offer the same operating efficiency at the beginning, the middle and the end of the ride.
In short: In a fair test, extended over the life of the two transmission systems, with real users under real-life conditions, who thinks the Rohloff won't come out ahead on both efficiency and satisfaction?
***
For these rather good kinesthetic, psychological, statistical and engineering reasons, I think tests like these are interesting but anyone inclined to take the face value of their results as important inputs to purchasing decisions should wait on a bit until he can get more experienced advice.