or, from the department of self-inflicted injuries
Recently I saw a bunch of errors in my server logs. People were asking for pages on my web site, but only if they were newer than Thu, 08 Ma 2012 22:44:59 GMT. "Ma"? What kind of a month is that? The internets have so many crazy things.
On further investigation, it seemed this was just a case of garbage in, garbage out; my intertube was busted. I was the one returning a Last-Modified with that date. It was invalid, but client software sent it back with the conditional request.
Thinking more on this, though, and on the well-known last-modified hack in which that field can be used as an unblockable cookie, I think I have to share some blame with the clients again.
So, clients using at least Apple-PubSub/65.28, BottomFeeder/4.1, NetNewsWire, SimplePie, Vienna, and Windows-RSS-Platform/2.0 should ask the people that implement their RSS software to only pass a Last-Modified date if it's really a valid date. Implementors of the NetVibes and worio.com bots should also take a look at their HTTP stacks. I don't guess that there's much that you can do with an etag though, for better or for worse.
Friends, I'm speaking at JSConf.us this year! Yee haw!
I would have mentioned this later, but events push me to say something now.
You see, I wrote the web server that runs this thing, together with the blog software. I've been hacking on it recently, too; more on that soon. But it seems that this is the first time I've noticed a link from a site that starts with a number. The URI parsers for the referer link were bombing out, because I left off a trailing $ on a regular expression.
So, for love and $, JSConf ho! We ride at dawn!
Just a brief meta-note: I've added per-tag feeds to tekuti, based on a patch kindly provided by Brian Gough, and unkindly neglected by me these last four months.
Anyway, you can now access http://wingolog.org/feed/atom?with=guile&without=web&with=scheme and other such URLs. Should be good for planets.
Let me know if there are any problems. Thanks!
Of course, it was not without a couple of problems, and indeed the important one points to something more fundamentally wrong with existing internet technologies.
in which mod_rewrite fails the author
Whoa, Mr. Wingo, calm down there! Blaming the internet for a bug in your software! Hubris is a virtue and all that, but perhaps this is taking it too far? I'm sure my readers will let me know in the comments, but first, some background.
The symptom of the problem is like this. To refer to a post in a URL, tekuti has an identifier, the post key. For example, the path to access to edit a post is /admin/posts/key.
Tekuti doesn't have post numbers though, so the easiest way to generate the post key is to serialize its location in the data store, like 2010/12/22/doing-it-wrong. As you can see the post key can include any character, and indeed typically does include slashes, so the actual text representation of the URL to edit a post has to be percent-encoded, like /admin/posts/2010%2f12%2f22%2fdoing-it-wrong.
This scheme works fine, and indeed when accessing the Tekuti web server directly, everything works. But if you put it behind Apache using a mod_rewrite proxying rule, as Aleix did:
RewriteRule ^/blog(/?.*)$ http://localhost:8080/$1 [P]
Then trying to edit posts doesn't work! What's the deal?
parsing it wrong
The deal is, mod_rewrite does a textual match on decoded paths. So the string that mod_rewrite sees isn't the string given to it -- in this case, it sees /admin/posts/2010/12/22/doing-it-wrong. Then it does some arbitrary transformation on that decoded string, and somehow constructs the result.
But you cannot produce the desired result from the intermediate, decoded string!
There are various options to re-encode parts of the output string, or to not encode parts of it -- but you can't determine which slash in the intermediate string is to be re-encoded.
The problem is that mod_rewrite treats slashes specially, not re-encoding them. This problem is fundamental to any technology that processes URL paths using textual comparisons.
To parse a URL path properly, you must first split it according to the delimiters you are interested in (/, in this case). Then you percent-decode the path components into a list, and match on that list.
Query parameters need to be parsed similarly -- first you split on &, then split on the first = in each component, then percent-decode the resulting keys and values into an ordered list of key-value pairs.
Quoth the RFC:
When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters.
This problem isn't specific to regular expressions -- it also occurs, for example, when dispatching a HTTP request to a URI handler. The dispatch should be based on path components, and not the against the path as a string.
it's a programming language problem
So why does this problem occur, even in technologies as venerable as mod_rewrite? Because it's one that can only be solved by programming languages. You need some sort of sequence data type to parse paths. You need some sort of map to parse query strings, conventionally at least. And then to reconstitute a URI, you need to do so from those data types (lists, maps, &c).
Most contexts in which you do dispatch, like in your .htaccess, aren't equipped with the expressivity to do it right. Regular expressions are only generally applicable on URI subcomponents like path components -- they can only be used in limited situations on full paths.
In Aleix's case, the workaround is to use mod_proxy directly:
ProxyPass /blog http://localhost:8080/
It's unfortunate, as you can't use mod_proxy from an .htaccess file -- only from the main configuration, and requiring a restart of the server. Oh well, though.
You could still use mod_rewrite, but with special rules for the specific URI paths that might include escaped slashes, like this:
RewriteRule ^/blog/admin/posts/(.+)$ http://localhost:8080/admin/posts/$1 [P,B]
But such a special solution is just that -- special, i.e., not general.
If you're hacking something for that web that is intended for general use, and if you parse or generate URIs, do your users a favor and give them an appropriate programming language. Your users will thank you, or more likely, continue in blissful ignorance, which is just as well.
Hey tubes! Long time no type, in this direction at least.
It seems that most of my writing energy these past few months has been directed towards Guile. For example, right now I should be writing documentation for new hacks, but instead I am typing at another part of the ether.
It's good and bad, this thing. The good thing is the hack-cadence in Guile is high. The bad thing is that not many learn about it, because, well, code doesn't blog about itself, does it?
Except in this case, perhaps. The tin can jiggling the electrons at the other end of this blogline has been my hack, of late. What you are reading is words about Scheme web servers, served by a Scheme web server.
That's right, I ported Tekuti to Guile 2.0. Delicious dogfood, yum!
In the process, I decided that mod-lisp, which I had been using, was stupid. There is already a simple, standard way of serving HTTP requests over a socket, and it is HTTP. So I wrote pieces of a web server, and put them in Guile. I'll probably write more about that later, so no more words about that for now, except to request that folks with spiders, bots, odd rss grabbers and such send me bug reports if things aren't legit.
ciao slicehost, ciao linode
If you're still on slicehost, I heartily recommend that you switch. (Heartily! Strange word. Like gravy and meatballs or something.) Linode feels faster to me, it's half the price, and otherwise the quality is about the same or perhaps a little better. And from what I hear, the linode offerings continue to improve, while slicehost hasn't changed for the 2+ years that I was with them.
Anyway, rap at yall soon, and keep your parentheses warm in this at-times cold Northern winter. Peace!
Images courtesy of the excellent Hyperbole and a Half.
Well, it's been a little while since I wrote about tekuti, my Scheme-powered, git-backed blog software. Time for an update.
I added a global tag cloud, in addition to the abridged cloud on the main page. Clicking on a tag takes you to a time-ordered list of posts having that tag, with a cloud of related tags. The post list could use some improvement, mostly because my titles are nonsensical.
I also implemented full-text search, which uses git grep under the hood. Amusing.
Also amusing is the "related posts" list, which shows up individual post pages. It's calculated as the set of posts which share the most tags with the post in question. The indexes that are automatically rebuilt when the master ref changes makes this a relatively cheap set to compute.
Also also: some artificial intelligence anti spam foo. Ha!
This stuff is actually fun to hack on, and is self-contained -- I'm almost spending more time writing about the features than I did implementing them.
I fleshed out tekuti's web page today, giving reasonably detailed install, deployment, and hacking instructions. It even includes a description on how to migrate from wordpress. I'd appreciate any comments that folks might have, probably better on this post than via email.
Stop the madness: uninstall PHP from your servers. We can do better than that!
I had a thought yesterday while biking into town. Git is a transcriptional database -- it writes and writes and writes, and what we are left with is its transcript. I wouldn't call it a transactional database, since it has no rollback operator. It doesn't need one. Ref updates either succeed or fail. If they fail, well, write, write again:
The Moving Ref writes; and, having commit,
Moves on: cosmic Rays nor Zero-blit
Shall untrue its blobs, its trees Unbind,
Nor all your Pushes flip a Single bit.
It is perhaps not as beautiful as Fitzgerald's translation, but the bar was set quite high. To compensate, here is what is, to my knowledge, the first translation of Khayyam into Scheme:
(define (git-update-ref refname proc count) (let* ((ref (git-rev-parse refname)) (commit (proc ref))) (cond ((zero? count) #f) ; failure ((false-if-git-error (git "update-ref" refname commit ref)) commit) (else (git-update-ref (git-rev-parse refname) (1- count))))))
The rest of the text may be found in tekuti.
In a couple hours I'll go up to the polideportivo and help lay out 500 square meters of mat. It's seminar weekend, kiddos! The incoming instructors are top-notch, Yamada sensei out of New York and Shibata sensei out of Hombu dojo in Tokyo. As a bonus we'll have Peter Bernath, Harvey Konigsberg, and about 50 or so other people coming from abroad. Good times!
And, and, para colmo, tomorrow I test for black belt. Yay!
A lot of people ask what happens once you get your black belt. Its traditional meaning is that you are a serious student, and have an understanding of the basics of a martial art. It does not connote finality in any way; it's more like a milestone, or something like that.
One can see this in the first-degree tests, like mine tomorrow. They're usually fun to watch, but unnecessarily forced, lacking in grace. The difference between first- and second-degree tests is phenomenal, though -- it seems that in the few years after shodan, practitioners gain a sense of confidence and fluidity that they lacked before. That I lack now, I mean. So it's an important rite, for me, but one that points towards the future rather than the past.
The album "Less Talk, More Rock" by Propagandhi is a near-masterpiece. While I do like their other albums, "Less Talk, More Rock" has an infectious youthful brilliance that makes me twitch every time I hear it. I must have listened to Resisting Tyrannical Government 50 times and it is still a transformative experience. Rock on!
Since last week's missive, I've been able to relax a bit, hack-wise, fixing errors as I see them. Most errors have been related to the fact that displaying a blog entry first parses it as valid XML, throwing an exception if the input is invalid. Luckily wordpress is pretty good at ensuring that its text is valid XML, but it's not complete -- it allows bare ampersands, both in the text and in URLs, and sometimes lets angle-brackets pass through unfiltered. So I've had to fix up a few old posts.
Among the more curious things I have had to write for this blagware is a UTF-8 encoder, in order to parse character references like ’ and such, given that Guile only does byte strings, currently.
Shockingly, to me, I do get spam, on the order of about one or two comments per day. No one else uses this software. It seems that there are a couple bots out there that actually parse forms, looking for textareas, then manage to divine which fields require what syntax. Currently my field names are the same as wordpress', so I will vary them until my obscurity provides the necessary "security".
But in the meantime, since my persistent store is Git and not a database, I can easily revert any change, be it changes to posts or to comments or whatever. I fleshed out the admin interface sufficiently so that you can actually create and edit posts there, and gave it an interface for seeing recent changes and possibly reverting them. Of course, reversion is also a change which can be reverted, ad infinitum, so there is no need for scary warnings in the UI when deleting comments, because no change is irreversible. Neat.