wingolog

service update

2023-12-14T14:28:45Z

Late last year I switched blog entries and comments to be written in a dialect of markdown, but there was a bug that I never noticed: if a text consisted only of a single paragraph or block, it would trigger an error that got reported back to the user in a very strange way, and which would prevent the comment from being posted.

I had never seen the error myself because blog posts are generally more than a paragraph, but it must have been quite irritating when commenting. Sorry about that; it should be fixed now. Should you experience more strange errors, please do send me an email with the comment to wingo@igalia.com. Cheers.

colophonwards

2023-12-05T11:36:57Z

A brief meta-note this morning: for the first time in 20 years, I finally got around to updating the web design of wingolog.org recently and wanted to share a bit about that.

Back when I made the initial wingolog design, I was using the then-brand-new Wordpress, Internet Explorer 6 was the most common web browser, CSS wasn’t very good, the Safari browser had just made its first release, smartphones were yet to be invented, and everyone used low-resolution CRT screens. The original design did use CSS instead of tables, thankfully, but it was very narrow and left a lot up to the user agent (notably font choice and size).

These days you can do much better. Even HTML has moved on, with

and

elements. CSS is powerful and interoperable, with grid layout and media queries and :has() and :is() and all kinds of fun selectors. And, we have web fonts.

I probably would have stuck with the old design if it were readable, but with pixel counts growing, the saturated red bands on the sides flooded the screen, leaving the reader feeling like they were driving into headlights in the rain.

Anyway, the new design is a bit more peaceful, I hope. Feedback welcome.

I’m using grid layout, but not in the way that I thought I would. From reading the documentation, I had the impression that the element with display: grid would be a kind of flexible corkboard which could be filled up by any child element. However, that’s not quite true: it only works for direct children, which means your HTML does have to match the needs of the grid. Grandchildren can take their rows and columns from grandparents via subgrid, but only really display inside themselves: you can’t pop a grandkid out to a grandparent grid area. (Or maybe you can! It’s a powerful facility and I don’t claim to fully understand it.)

Also, as far as I can tell there is no provision to fill up one grid area with multiple children. Whereas I thought that on the root page, each blog entry would end up in its own grid area, that’s just not the case: you put the

(another new element!) in a grid area and let it lay out itself. Fine enough.

I would love to have proper side-notes, and I thought the grid would do something for me there, but it seems that I have to wait for CSS anchor positioning. Until then you can use position: absolute tricks, but side-notes may overlap unless the source article is careful.

For fonts, I dearly wanted proper fonts, but I was always scared of the flash of invisible text. It turns out that with font-display: swap you can guarantee that the user can read text if for some reason your fonts fail to load, at the cost of a later layout shift when the fonts come in. At first I tried Bitstream Charter for the body typeface, but I was unable to nicely mix it with Fira Mono without line-heights getting all wonky: a tag on a line would make that line too high. I tried all kinds of adjustments in the @font-face but finally decided to follow my heart and buy a font. Or two. And then sheepishly admit it to my spouse the next morning. You are reading this in Valkyrie, and the headings are Hermes Maia. I’m pretty happy with the result and I hope you are too. They are loaded from my server, to which the browser already has a TCP and TLS connection, so it would seem that the performance impact is minimal.

Part of getting performance was to inline my CSS file into the web pages produced by the blog software, allowing the browser to go ahead and lay things out as they should be without waiting on a chained secondary request to get the layout.

Finally, I did finally break down and teach my blog software’s marxdown parser about “smart quotes” and em dashes and en dashes. I couldn’t make this post in good faith without it; “the guy yammers on about web design and not only is he not a designer, he uses ugly quotes”, &c, &c...

Finally finally, some recommendations: I really enjoyed reading Erik Spiekermann’s Stop Stealing Sheep, 4th ed. on typography and type, which led to a raft of book purchases. Eric Meyer and Estelle Weyl’s CSS: The Definitive Guide was very useful for figuring out what is possible with CSS and how to make it happen. It’s a guide, though, and is not very opinionated; you might find Matthew Butterick’s Practical Typography to be useful if you are looking for pretty-good opinions about what to make.

Onwards and upwards!



it's probably spam
2017-03-06T14:16:10Z
Greetings, peoples.  As you probably know, these words are served to you by Tekuti, a blog engine written in Scheme that uses Git as its database.
Part of the reason I wrote this blog software was that from the time when I was using Wordpress, I actually appreciated the comments that I would get.  Sometimes nice folks visit this blog and comment with information that I find really interesting, and I thought it would be a shame if I had to disable those entirely.
But allowing users to add things to your site is tricky.  There are all kinds of potential security vulnerabilities.  I thought about the ones that were important to me, back in 2008 when I wrote Tekuti, and I thought I did a pretty OK job on preventing XSS and designing-out code execution possibilities.  When it came to bogus comments though, things worked well enough for the time.  Tekuti uses Git as a log-structured database, and so to delete a comment, you just revert the change that added the comment.  I added a little security question ("what's your favorite number?"; any number worked) to prevent wordpress spammers from hitting me, and I was good to go.
Sadly, what was good enough in 2008 isn't good enough in 2017.  In 2017 alone, some 2000 bogus comments made it through.  So I took comments offline and painstakingly went through and separated the wheat from the chaff while pondering what to do next.
an aside
I really wondered why spammers bothered though.  I mean, I added the rel="external nofollow" attribute on links, which should prevent search engines from granting relevancy to the spammer's links, so what gives?  Could be that all the advice from the mid-2000s regarding nofollow is bogus.  But it was definitely the case that while I was adding the attribute to commenter's home page links, I wasn't adding it to links in the comment.  Doh!  With this fixed, perhaps I will just have to deal with the spammers I have and not even more spammers in the future.
i digress
I started by simply changing my security question to require a number in a certain range.  No dice; bogus comments still got through.  I changed the range; could it be the numbers they were using were already in range?  Again the bogosity continued undaunted.
So I decided to break down and write a bogus comment filter.  Luckily, Git gives me a handy corpus of legit and bogus comments: all the comments that remain live are legit, and all that were ever added but are no longer live are bogus.  I wrote a simple tokenizer across the comments, extracted feature counts, and fed that into a naive Bayesian classifier.  I finally turned it on this morning; fingers crossed!
My trials at home show that if you train the classifier on half the data set (around 5300 bogus comments and 1900 legit comments) and then run it against the other half, I get about 6% false negatives and 1% false positives.  The feature extractor interns sequences of 1, 2, and 3 tokens, and doesn't have a lower limit for number of features extracted -- a feature seen only once in bogus comments and never in legit comments is a fairly strong bogosity signal; as you have to make up the denominator in that case, I set it to indicate that such a feature is 99.9% bogus.  A corresponding single feature in the legit set without appearance in the bogus set is 99% legit.
Of course with this strong of a bias towards precise features of the training set, if you run the classifier against its own training set, it produces no false positives and only 0.3% false negatives, some of which were simply reverted duplicate comments.
It wasn't straightforward to get these results out of a Bayesian classifier.  The "smoothing" factor that you add to both numerator and denominator was tricky, as I mentioned above.  Getting a useful tokenization was tricky.  And the final trick was even trickier:  limiting the significant-feature count when determining bogosity.  I hate to cite Paul Graham but I have to do so here -- choosing the N most significant features in the document made the classification much less sensitive to the varying lengths of legit and bogus comments, and less sensitive to inclusions of verbatim texts from other comments.
We'll see I guess.  If your comment gets caught by my filters, let me know -- over email or Twitter I guess, since you might not be able to comment!  I hope to be able to keep comments open; I've learned a lot from yall over the years.


the merry month of ma
2012-03-12T20:27:53Z
or, from the department of self-inflicted injuries
Recently I saw a bunch of errors in my server logs.  People were asking for pages on my web site, but only if they were newer than Thu, 08 Ma 2012 22:44:59 GMT.  "Ma"?  What kind of a month is that?  The internets have so many crazy things.
On further investigation, it seemed this was just a case of garbage in, garbage out; my intertube was busted.  I was the one returning a Last-Modified with that date.  It was invalid, but client software sent it back with the conditional request.
Thinking more on this, though, and on the well-known last-modified hack in which that field can be used as an unblockable cookie, I think I have to share some blame with the clients again.
So, clients using at least Apple-PubSub/65.28, BottomFeeder/4.1, NetNewsWire, SimplePie, Vienna, and Windows-RSS-Platform/2.0 should ask the people that implement their RSS software to only pass a Last-Modified date if it's really a valid date.  Implementors of the NetVibes and worio.com bots should also take a look at their HTTP stacks.  I don't guess that there's much that you can do with an etag though, for better or for worse.
Previously, in a related department.


for love and $
2012-02-21T16:48:19Z
Friends, I'm speaking at JSConf.us this year!  Yee haw!
I would have mentioned this later, but events push me to say something now.
You see, I wrote the web server that runs this thing, together with the blog software.  I've been hacking on it recently, too; more on that soon.  But it seems that this is the first time I've noticed a link from a site that starts with a number.  The URI parsers for the referer link were bombing out, because I left off a trailing $ on a regular expression.
So, for love and $, JSConf ho!  We ride at dawn!


meta: per-tag feeds
2011-04-25T18:13:27Z
Just a brief meta-note: I've added per-tag feeds to tekuti, based on a patch kindly provided by Brian Gough, and unkindly neglected by me these last four months.  
Anyway, you can now access http://wingolog.org/feed/atom?with=guile&without=web&with=scheme and other such URLs.  Should be good for planets.
Let me know if there are any problems.  Thanks!


doing it wrong
2010-12-23T00:04:23Z
Intrepid hacker Aleix Conchillo Flaqué writes in to say that, against all odds, he actually managed to install the Tekuti blog software on his server. Rock on!
Of course, it was not without a couple of problems, and indeed the important one points to something more fundamentally wrong with existing internet technologies.
in which mod_rewrite fails the author
Whoa, Mr. Wingo, calm down there! Blaming the internet for a bug in your software! Hubris is a virtue and all that, but perhaps this is taking it too far? I'm sure my readers will let me know in the comments, but first, some background.
The symptom of the problem is like this. To refer to a post in a URL, tekuti has an identifier, the post key.  For example, the path to access to edit a post is /admin/posts/key.
Tekuti doesn't have post numbers though, so the easiest way to generate the post key is to serialize its location in the data store, like 2010/12/22/doing-it-wrong.  As you can see the post key can include any character, and indeed typically does include slashes, so the actual text representation of the URL to edit a post has to be percent-encoded, like /admin/posts/2010%2f12%2f22%2fdoing-it-wrong.
This scheme works fine, and indeed when accessing the Tekuti web server directly, everything works. But if you put it behind Apache using a mod_rewrite proxying rule, as Aleix did:
RewriteRule ^/blog(/?.*)$ http://localhost:8080/$1 [P]
Then trying to edit posts doesn't work! What's the deal?
parsing it wrong
The deal is, mod_rewrite does a textual match on decoded paths. So the string that mod_rewrite sees isn't the string given to it -- in this case, it sees /admin/posts/2010/12/22/doing-it-wrong. Then it does some arbitrary transformation on that decoded string, and somehow constructs the result.
But you cannot produce the desired result from the intermediate, decoded string!
There are various options to re-encode parts of the output string, or to not encode parts of it -- but you can't determine which slash in the intermediate string is to be re-encoded.
The problem is that mod_rewrite treats slashes specially, not re-encoding them. This problem is fundamental to any technology that processes URL paths using textual comparisons.
To parse a URL path properly, you must first split it according to the delimiters you are interested in (/, in this case). Then you percent-decode the path components into a list, and match on that list.
Query parameters need to be parsed similarly -- first you split on &, then split on the first = in each component, then percent-decode the resulting keys and values into an ordered list of key-value pairs.
Quoth the RFC:
When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters.
RFC 3986, section 2.4: When to encode or decode
This problem isn't specific to regular expressions -- it also occurs, for example, when dispatching a HTTP request to a URI handler. The dispatch should be based on path components, and not the against the path as a string.
it's a programming language problem
So why does this problem occur, even in technologies as venerable as mod_rewrite?  Because it's one that can only be solved by programming languages. You need some sort of sequence data type to parse paths. You need some sort of map to parse query strings, conventionally at least. And then to reconstitute a URI, you need to do so from those data types (lists, maps, &c).
Most contexts in which you do dispatch, like in your .htaccess, aren't equipped with the expressivity to do it right. Regular expressions are only generally applicable on URI subcomponents like path components -- they can only be used in limited situations on full paths.
workarounds
In Aleix's case, the workaround is to use mod_proxy directly:
ProxyPass /blog http://localhost:8080/
It's unfortunate, as you can't use mod_proxy from an .htaccess file -- only from the main configuration, and requiring a restart of the server.  Oh well, though.
You could still use mod_rewrite, but with special rules for the specific URI paths that might include escaped slashes, like this:
RewriteRule ^/blog/admin/posts/(.+)$ http://localhost:8080/admin/posts/$1 [P,B]
But such a special solution is just that -- special, i.e., not general.
en brevis
If you're hacking something for that web that is intended for general use, and if you parse or generate URIs, do your users a favor and give them an appropriate programming language. Your users will thank you, or more likely, continue in blissful ignorance, which is just as well.


meta data
2010-12-13T21:08:36Z
Hey tubes! Long time no type, in this direction at least.
It seems that most of my writing energy these past few months has been directed towards Guile. For example, right now I should be writing documentation for new hacks, but instead I am typing at another part of the ether.
It's good and bad, this thing. The good thing is the hack-cadence in Guile is high. The bad thing is that not many learn about it, because, well, code doesn't blog about itself, does it?
Except in this case, perhaps. The tin can jiggling the electrons at the other end of this blogline has been my hack, of late. What you are reading is words about Scheme web servers, served by a Scheme web server.
That's right, I ported Tekuti to Guile 2.0. Delicious dogfood, yum!
In the process, I decided that mod-lisp, which I had been using, was stupid. There is already a simple, standard way of serving HTTP requests over a socket, and it is HTTP. So I wrote pieces of a web server, and put them in Guile. I'll probably write more about that later, so no more words about that for now, except to request that folks with spiders, bots, odd rss grabbers and such send me bug reports if things aren't legit.
ciao slicehost, ciao linode
About the same time, my bank decided to change my credit card, so all my old subscriptions stopped working. It was just the thing I needed to make me jump ship, finally, from slicehost to linode.
If you're still on slicehost, I heartily recommend that you switch. (Heartily! Strange word. Like gravy and meatballs or something.) Linode feels faster to me, it's half the price, and otherwise the quality is about the same or perhaps a little better. And from what I hear, the linode offerings continue to improve, while slicehost hasn't changed for the 2+ years that I was with them.
Anyway, rap at yall soon, and keep your parentheses warm in this at-times cold Northern winter. Peace!
Images courtesy of the excellent Hyperbole and a Half.


metablog
2008-04-23T12:54:13Z
Well, it's been a little while since I wrote about tekuti, my Scheme-powered, git-backed blog software. Time for an update.
features
I added a global tag cloud, in addition to the abridged cloud on the main page. Clicking on a tag takes you to a time-ordered list of posts having that tag, with a cloud of related tags. The post list could use some improvement, mostly because my titles are nonsensical.
I also implemented full-text search, which uses git grep under the hood. Amusing.
Also amusing is the "related posts" list, which shows up individual post pages. It's calculated as the set of posts which share the most tags with the post in question. The indexes that are automatically rebuilt when the master ref changes makes this a relatively cheap set to compute.
Also also: some artificial intelligence anti spam foo. Ha!
This stuff is actually fun to hack on, and is self-contained -- I'm almost spending more time writing about the features than I did implementing them.
documentation
I fleshed out tekuti's web page today, giving reasonably detailed install, deployment, and hacking instructions. It even includes a description on how to migrate from wordpress. I'd appreciate any comments that folks might have, probably better on this post than via email.
Stop the madness: uninstall PHP from your servers. We can do better than that!


git: a transcriptional database
2008-04-12T19:11:30Z
I had a thought yesterday while biking into town. Git is a transcriptional database -- it writes and writes and writes, and what we are left with is its transcript. I wouldn't call it a transactional database, since it has no rollback operator. It doesn't need one. Ref updates either succeed or fail. If they fail, well, write, write again:
The Moving Ref writes; and, having commit,
Moves on: cosmic Rays nor Zero-blit
Shall untrue its blobs, its trees Unbind,
Nor all your Pushes flip a Single bit.
It is perhaps not as beautiful as Fitzgerald's translation, but the bar was set quite high. To compensate, here is what is, to my knowledge, the first translation of Khayyam into Scheme:
(define (git-update-ref refname proc count)
  (let* ((ref (git-rev-parse refname))
         (commit (proc ref)))
    (cond
     ((zero? count) #f) ; failure
     ((false-if-git-error
         (git "update-ref" refname commit ref))
      commit)
     (else
      (git-update-ref (git-rev-parse refname) (1- count))))))
The rest of the text may be found in tekuti.