wingolog

storage primitives for the distributed web

2011-03-24T14:28:17Z

An example-driven discussion of what seems to me to be the minimal set of persistent storage primitives needed in an autonomous web.

❧

Lisa would like to keep a private journal, and store the journal entries in Marge's computer. Lisa is not worried about Marge reading her journal entries, but she wants to make sure that Bart, who also uses the computer, does not have access to her journal. Marge needs to give Lisa some storage primitives so that she can write her journal-entry program.

Marge calls up her operating system vendor, and gets the following procedure documentation in response:

make-cell init-string: Create a new storage location, and store init-string in it. Returns a write-cap, which allows a user to change the string associated with this storage location.
write-cap? obj: Return #t if obj is a write-cap, and #f for any other kind of object.
cell-set! write-cap string: Set the string associated with a cell to a new value.
write-cap->read-cap write-cap: Take a write-cap and return a read-cap, which allows a user to read the value associated with a cell.
read-cap? obj: Return #t if obj is a read-cap, and #f for any other kind of object.
cell-ref read-cap: Return the string associated with the cell.

Marge makes all of these procedures available to Lisa. Lisa then starts to program. She does not want to allow herself to edit her old entries, so she makes a helper:

(define (put-text string)
  (write-cap->read-cap (make-cell string)))

Since put-text throws away the write-cap for this string, nothing else will be able to change an entry, once it is written.

Read-caps and write-caps are capabilities. They are unforgeable. Since Lisa did not give any of these capabilities to Bart, she feels safe typing her innermost thoughts into put-text.

persistent objects

In Scheme, capabilities are objects. A piece of code has capabilities, in the normal English sense of the term, to all of the objects that are in its scope, and to objects that are in the current module.

But since all does not live in the warm world of Scheme, the storage primitives that the computer vendor provides allow read-caps and write-caps to be serialized to strings:

cap->string cap: Return a string representation of the capability cap.
string->cap string: Return a capability corresponding to string. Read capabilities have a different representation from write capabilities, so this procedure may return a read-cap or a write-cap. It returns #f if this string does not look like a capability.

When Lisa started writing, she wrote down the cap-strings for all of her entries in a book. Then when she wants to read them, she types the capability strings into the terminal:

(cell-ref (string->cap "read-cap:a6785kjiyv8c0..."))
=> "I was really happy with my solo today at band, but..."

But this got tiring after a while, so she decided to store the list of capabilities instead:

;; Build a list data type.
;;
(define (make-list-cell)
  (make-mutable-cell ""))

(define (list-cell-ref read-cap)
  (let ((str (cell-ref read-cap)))
    (if (equal? str "")
        '()
        (map string->cap (string-split str #\,)))))

(define (list-cell-set! write-cap caps)
  (cell-set! write-cap
             (string-join (map cap->string caps) ",")))

;; Helper.
;;
(define (->readable cap)
  (if (write-cap? cap)
      (write-cap->read-cap cap)
      cap))

;; Make a new cell, and print out its cap-string.
;; Note to self: write down this string!
;;
(display (cap->string (make-list-cell)))

(define (add-entry! entries cap)
  (list-cell-set!
   entries
   (cons cap (list-cell-ref (->readable entries)))))

Now she just has to write down the cap-string of the new list cell that she made, and she has a reference to all of her entries. Whenever she writes a new entry, she uses add-entry! to update the cell's value, adding on the new cap-string.

distribution

Colin's father Bono has a computer just like Marge's, and Lisa would like for Colin to be able to be able to read some specific entries. So she asks Marge how to give access to the cells that she is using for data storage to other machines.

Marge asks her vendor, and the vendor says that actually, the cells implementation that was provided to her stores its data in the cloud. So Lisa can just give Colin a cap-string -- read-only, presumably -- for the essays that she would like to share, and all is good.

Marge doesn't know what this means, but she tells Lisa, and Lisa freaks out. "You mean that I've been practicing Careless Computing this whole time, and you didn't tell me??? You mean that Chief Wiggum could have called up the cloud storage provider and gotten access to all of my data?"

careful computing

Lisa's concern is a good one. Marge puts her in contact with the vendor directly, who explains that actually, the cells implementation is designed to preserve users' autonomy.

Creating a cell creates an RSA key-pair. The write-cap contains the signing key (SK) and the public key (PK), and the read-cap contains just the PK. Before sending data to the cloud provider, the data is signed and encrypted using the SK, so only people with access to the read-cap can actually decrypt the strings.

The cells are stored in a standard key-value store. The key is computed as the secure hash (H) of the PK, so that even the cloud storage provider cannot decrypt the data. Furthermore, the cell-ref user does not rely on the provider to vouch for the data's integrity, as she can verify it directly against the PK. The only drawback is that Lisa cannot be sure that cell-ref is returning the latest value of a cell, whatever that means.

The vendor continues by noting that it doesn't actually matter much, from Lisa's perspective, what the key-value store is. It could be Amazon S3-backed web service, a Tahoe-LAFS friendnet, home-grown CouchDB things, or GNUnet. This is an important property in a time in which the peer-to-peer key-value stores have not yet stabilized. The vendor also says that they don't have accounting figured out yet, so they don't know how to charge people for storage, but that they trust that the commercial folks will work that out.

context

These primitives are sufficient to build proper data structures on top of k-v stores -- tries, queues, and such -- all with fine-grained access controls, and without having to trust the store itself. Lisa can, if she ever grows up, publish all (or a set) of her diaries to the world, which could then form part of larger data structures, like the "wall" of whatever comes after Facebook.

It seems to me that this set of primitives is a minimal set. You could add in better support for immutable data, but since you can implement it in terms of mutable data, it seemed unnecessary.

This scheme was mostly inspired by Tahoe-LAFS. You can read a short and very interesting technical paper about it here.

future

Next up would be seeing if these primitives interact well with a capabilities-based security kernel for mobile code. Cross your fingers!

bart and lisa, hacker edition

2011-03-19T21:43:28Z

Bart and Lisa are both hacking on a computer run by Marge. Bart has a program that sorts a list of integers. Being a generous person, he shares it with Lisa. Lisa would like to use Bart's program, but she doesn't trust Bart -- she wants to make sure that the program is safe to run before running it. She would like to sort a list of credit card numbers and would be quite vexed indeed if Bart's sort procedure posted them all to a web site.

(This example and much of the discussion is taken from the excellent Rees thesis, which I highly, highly recommend.)

One approach she can take is to examine the program, and only run it if it is obviously harmless. Due to fundamental concerns like undecidability, this will be a conservative evaluation. Then if the program is deemed safe, it can be compiled and invoked directly, without having to put it in a sandbox.

The question I wish to discuss this evening is how to do this safe compilation in Guile, in the case in which a Guile program provides environments for Bart and Lisa to both hack on, at the same time, along with means for Bart and Lisa to communicate.

Let us imagine that Bart gives Lisa a program, as a string. The first thing to do is to read it in as a Scheme data structure. Lisa creates a string port, and calls the system's read procedure on the port. This produces a Scheme value.

Lisa trusts that the read procedure and the open-input-string procedures are safe to call. Indeed she has to trust many things from the system; she doesn't really have much of a choice. She has to trust Marge. In this case, barring bugs, the choice is mostly valid. The exception would be reader macros, which cause arbitrary code to run at read-time. Lisa must assume that Bart has not invoked read-hash-extend on her machine. Indeed, Marge cannot supply read-hash-extend to Bart or to Lisa, just as she would not give them access to the low-level foreign function interface.

This brief discussion indicates that there exist in Guile cross-cutting, non-modular facilities which cannot be given to users. If you want to create a secure environment in which programs may only use the capabilities they are provided, "user-level" environments must not include some routines, like read-hash-extend.

names come to have meanings

So, having proceeded on, Lisa reads a Scheme form which she can pretty-print to her console, and it looks like this:

(define sort
  (lambda (l <)
    (define insert
      (lambda (x sorted)
        (if (null? sorted)
            (list x)
            (if (< x (car sorted))
                (cons x sorted)
                (cons (car sorted) (insert x (cdr sorted)))))))
    (if (null? l)
        '()
        (insert (car l) (sort (cdr l) <)))))

How can Lisa know if this program is safe to run?

Actually, this question is something of a cart before the horse; what we should ask is, what does this program mean? To that question, we have the lambda calculus to answer for, as part of the Scheme language definition.

In this form we have binding forms, bound variable references, free variable references, and a few conditionals and constants. The compiler obtains this information during its expansion of the form. Here we are assuming that Lisa compiles the form in an environment with the conventional bindings for define, lambda, and if.

Of these, the constants, conditionals, lexical binding forms, and bound variable references are all safe.

The only thing we need be concerned about are the free variable references (null?, list, car, cdr, cons, and, interestingly, sort itself), and the effect of the toplevel definition (of sort).

In Scheme, forms are either definitions, expressions, or sequences of forms. Definitions bind names to values, and have no value themselves. Expressions can evaluate to any number of values. So the question here for Lisa is: is she interested in a definition of a sort procedure, or a value? If the latter, she must mark this form as unsafe, as it is a definition. If the former, she needs to decide on what it means for Bart to provide a definition to her.

In Guile, top-level definitions and variable references are given meaning by modules. Free variables are scoped relative to the module in which the code appears. If Lisa expands the sort definition in a fresh module, then all of the free variables will be resolved relative to that module -- including the sort reference within the lambda expression.

Lisa can probably give Bart access to all of the primitives that his program calls: car, cons, etc.

We start to see how the solution works, then: a program is safe if all of its free variables are safe. if is safe. null? is safe. And so on. Lisa freely provides these resources to Bart (through his program), because she knows that he can't break them, or use them to break other things. She provides those resources through a fresh module, in which she compiles his program. Once compiled, she can invoke Bart's sort directly, via ((module-ref bart-module 'sort) my-credit-card-numbers), without the need to run Bart's code in any kind of sandbox.

world enough, and time

However, there are two resources which Lisa transfers to Bart's program which are not passed as procedure arguments: time, and space.

What if Bart's program had a bug that made it fail to terminate? What if it simply took too much time? In this case, Marge may provide Lisa with a utility, safe-apply, which calls Bart's function, but which cancels it after some amount of time. Such a procedure is easy for Marge to implement with setitimer. setitimer is one of those cross-cutting concerns however which Marge would not want to provide to either Lisa or Bart directly.

The space question is much more difficult, however. Bart's algorithm might cons a lot of memory, but that's probably OK if it's all garbage. However it's tough for Marge to determine what is Lisa's garbage and what is Bart's garbage, and what non-garbage is Lisa's, and what non-garbage is Bart's, given that they may share objects.

In the Guile case, Marge has even fewer resources at her disposal, as the BDW-GC doesn't give you all that much information. If however she can assume that she can have accurate per-thread byte allocation counters, she can give Lisa the tools needed to abort Bart's program if it allocates too much. Similarly, if Marge restricts her program to only one OS thread, she can guesstimate the active heap size. In that way she can give Lisa tools to restrict Bart's program to a certain amount of active memory.

related

Note that Marge's difficulties are not unique to her: until just a few weeks ago, the Linux kernel was still vulnerable to fork bombs, which are an instance of denial-of-service through resource allocation (processes, in that case).

Nor are Lisa's, for that matter, considering the number of Android applications that are given access to user data, and then proceed to give it to large corporations, and Android does have a security model for apps. Likewise the sort UNIX utility has access to your SSH private key, and it's only the hacker ethic that has kept free software systems in their current remarkably benign state; though, given the weekly Flash Player and Acrobat Reader vulnerabilities, even that is not good enough (for those of us that actually use those programs).

future

I'm going to work on creating a "safe evaluator" that Lisa can use to evaluate expressions from Bart, and later call them. The goal is to investigate the applicability of the ocap security model to the needs of an autonomy- and privacy-preserving distributed computing substrate. In such an environment, Bart should be able to share an "app" (in the current "app store" sense) with all of his friends. His friends could then run the app, but without needing to trust the app maker, Bart, or anyone else. (One precedent would be Racket's sandboxed evaluators; though I don't understand all of their decisions.)

The difficulty of this model seems to me to be more on the side of data storage than of local computation. It's tough to build a distributed queue on top of Tahoe-LAFS, much less GNUnet. Amusingly though, this sort of safe evaluation can be a remedy for that: if an app not only stores data but code, and data storage nodes allow apps to run safe mapreduce / indexing operations there, we may regain more proper data structures, built from the leaves on up. But hey, I get ahead of myself. Until next time, happy hacking.

towards a gnu autonomous cloud

2010-04-10T10:54:50Z

My previous installments on November's GNU Hackers Meeting (hither, and thither) touched some topics that were important to me, but not as important as the one I'll mention tonight.

Tonight I want to talk about autonomy and the internet. I'll approach it from a roundabout direction.

the facebook problem

Many of you probably know someone who has had their Facebook account disabled. Here are a couple, and here are some thousands more. While I'm probably not the best person to speak of this, as I don't have a Facebook account, it's quite irritating to have this happen. It's like you've been unpersoned.

Beyond the individual indignation though, what is really important (and sometimes missing) is a more universal indignation: never mind me, what gives a corporation the right to unperson anyone?

Sure, I hear you arguing that it's their services, bla bla, but the end of it is that when you use Facebook, you lose autonomy -- communication and identity are needs just like any other.

I might be going out on a limb here, but consider Article 19 of the fine UN Declaration of Human Rights:

Everyone has the right to freedom of opinion and expression; this right includes freedom to hold opinions without interference and to seek, receive and impart information and ideas through any media and regardless of frontiers.

Emphasis mine, of course. What I'm saying is that you shouldn't depend on the government or a corporation or any other entity outside your actual community to be able to actualize these natural rights.

and further: article 12

Though I don't like the wording of this one as much as the previous article, nor the gendered pronouns, check it:

No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honour and reputation. Everyone has the right to the protection of the law against such interference or attacks.

As an American living in Europe, it has taken me some time to appreciate the European focus on privacy. I don't think people in the States understand the issues as well as people do here. OK, so your parents/grandparents lived through fascism: so what?

On a personal level, whether you're an industry insider, someone avoiding an abusive relationship, or an Earth Liberation Front activist, privacy is terribly important. It's not an exaggeration to say that to cede control over your privacy is to cede control over your identity.

But organizations that control your data on this level usually aren't stupid enough to let you know, or make you think about it. All that is left is a dull throb of database scandals and terms-of-service changes and wiretaps.

The problem is not the existence of malicious people: the problem is that the your data is out there. All it takes is one nosy person, or one controlling governmental agency (cf. A, B), or one corporation wanting to monetize (cf. all of them).

There are simply no safeguards. There is nothing you can do if you want to be a part of the modern web to protect your privacy. Your data on servers is always available to wiretap, and subpoena if necessary. Your data is not your own.

origins

Both of these problems (unpersonage and privacy violations) stem from the fact that you rely on someone else's computer to fulfill your personal needs. RMS wrote about this in an article entitled Who does that server really serve?, and I agree with all of his points.

Stormy Peters' recent article, 10 free apps I wish were open source, illustrates many of these misunderstandings. Besides the misleading terminology, what if Gmail were AGPL-free? Would that protect users against the recent Buzz fiasco? No, because users are not in control of the software they use. Users should be able to modify the software they run; if they cannot, due to that software running on another machine, they should not run software on another machine.

I don't think Richard's article goes far enough. As I mentioned above, the problem is that your data is just "out there". Let's postulate an AGPL Gmail that also allows me to run my own Gmail software on Google's network. While this would meet the Free Software definition, it still harms me as a user, because anyone who has access to that server has access to my data.

Besides that, there is the practical difficulty, in that Facebook or Google would never allow you access to the programs that run on your data in that way.

What I'm building up to is the idea that the client-server paradigm is fundamentally incompatible with autonomy. Growing your own food is better than sharecropping, better than "web 2.0".

what shall we do, sir wingo

A fundamental problem requires a radical (adj.: to the root) solution. As is often the case, the seed of a solution has been with us for a long time: public-key cryptography.

Geeks have long enjoyed mailing each other signed and/or encrypted mails, allowing private communication over insecure networks, relying on webs of trust to ensure the identity of the sender. Asymmetric cryptography allows you to send and receive private messages over insecure channels, like the internet.

I won't belabor the point, as most of my readers have seen GPG; it is the Right Thing. But what would GPG-style interactions mean in the context of Facebook?

All of you are probably cringing at this point, imagining the complexity and security implications. But let's bask in that moment for a while, shall we: if it were the case that fellow facebooklicans sent you private messages via GPG, being able to view them sensibly over the web would imply that the Facebook server would have your private key.

Extrapolating this farther, the very set of your "friends" is a kind of private data. If this data were properly encrypted and signed against your private key, to present the standard facebook view that most people know would again require your private key.

In the end, you can't have web services that access private data. Not if you want privacy, anyway.

an autonomous facebook?

To preserve the privacy of your identity, you should never send your private key over the wire. This is well-known. But if you are to do computation on your social network, as facebook.com does, then it follows that such computation must be done local to the user's machine.

But all of facebook on your local machine? Surely you're joking, Mr. Wingo! Well, yes and no. Obviously the answer is not "let's everyone download a program from facebook and run it locally with your private key as an argument". Not quite, anyway.

One good part of the so-called "web 2.0" is that I can code foo-anarchist-commune.org's web site in Scheme and no one is any the wiser. It's easy to deploy in today's environment; deploying e.g. a new facebook experience should not cause me to have to click something to install a new binary.

So, the constraints are:

My key pair is my identity. My public key may be distributed, but my private key must be private.
Since computation needs my private key, computation must happen locally. Viewing an "autonomous facebook" implies running a program on my local machine, with access to my private key.
Since an "autonomous facebook" would be useless without other people, I need access to other people's information, I need a network too.

We can already draw a picture of what this looks like. Let's assume that the end-user experience is still via the web browser.

I think that my paranoid readers know where I'm going with this. My technically-minded readers will be flabbergast, perhaps, at the enormity of the problem of implementing facebook under such constraints. How does my facebook know that it's participating in a network? How does it know about my friend Leif? How does it get updates? Where is the database?

autonomous data model

Well, one thing is clear: someone needs to hold all of that data. Who to do it? In the case of my data (my photos, my messages to others, etc), I should be the one, as it makes me more autonomous. Everyone needs to seed their own data on the network.

I might choose to seed my data from multiple locations, for reliability. Beyond that, nodes might cache information that is routed through them.

One way to implement such a distributed store would be git-like, with content-based addressing and consistent hashing; or like bittorrent. It would have efficiency advantages. I thought for a while that this would be the solution, but GHM folk brought up the privacy argument, that your pattern of network access is too revealing.

So my current thought is to use GNUnet somehow. I'm not sure how this will go, but it's worth a try.

new operating system

Currently, to deploy a web application, you have to pay for servers and bandwidth, and this eventually causes your interests to diverge farther from that of your users. With an autonomous cloud, you could instead deploy web applications using the compute power and bandwidth of leaf nodes -- the power of the people using your software.

This would drastically lower the hacktivation energy for a new project. The little green sandbox above starts to approach a new kind of operating system, even -- a new program to run your programs.

Obviously, I'm thinking Guile would be a fine runtime for the sandbox, to run programs written for Guile -- in Ecmascript or Lua or Scheme or Elisp or whatever other languages people implement for Guile. The user would receive the source code, and running it would automatically compile it on their machine. The application source would also be available to modify and redistribute.

Having a sandbox for mobile code also raises the possibility of interesting mapreduce-type operations, to index the distributed data store.

Firefox could be modified with a plugin to add a new addressing mode, which would go through the HTTP server running locally to your machine. You would be browsing the "autonomous web".

Of course, since the whole thing is based on protocols, one might substitute the Guile environment for something else; or write an alternate interface to Facebook that works over a console or presents you with a native (e.g., Clutter) interface.

related work

There have been loads of people thinking these ideas; none of them is new.

My ongoing use of the term "autonomous" is a nod to anarchists, and to the autonomo.us group. Autonomo.us would be a great organizing place for work around this, but their list server is a bit moribund; somewhat ironic. Perhaps we can return life to that group, though.

GNU Social is a project to make a free-as-in-freedom social network. I think it's a great initiative, and it's probably the place to go if you want to build an alternative to Facebook right now.

GNU Social has made the decision to just get something working. This is the right thing to do, IMO; but near-term solutions should not prevent concurrent research for the long-term. In the end if making an autonomous cloud turns out to be possible, perhaps we can rebase GNU Social on top of the autonomous infrastructure.

Is there something else I should really be looking at? Let me know! I don't think one can ever do a full survey of this field -- better to just start hacking -- but I'm interested in good ideas, especially to the data storage and access problem.

plan

All of this is a bit pie-in-the-sky, but I am going to see if I can work up a proof-of-concept for the upcoming GHM in July. If you are interested in helping this project, probably the best thing to do is to code up some little demo application using GNUnet or some other store. Once you have that, drop by to see me in #guile and we'll talk.

Comments welcome!