A modest proposal to fix the Web

As we observe the 25th anniversary of the first World Wide Web site in August 2016, I think it's time to revive this proposal for how to fix the Web. As the de facto repository of all human knowledge it has become, the Web has several significant limitations:

Everything is rent. All domain names are registered on a yearly basis, and most hosting services are rented month-to-month. That means if everybody stopped paying their bills, most of the Web would disappear in less than a year, even if all the infrastructure stayed intact. (This is true for other Internet protocols as well, but we don't depend on them so heavily.)
Documents change. There is no guarantee that what is at a URL today will be there tomorrow. If you don't see that as a problem, ask a university student who's researching a paper, or the professor reading her paper. The Web was originally created for publishing research, and yet it's nearly useless for its intended purpose because -- unless that research is subsequently published on paper -- there's no way to assure readers that the references still say what they used to. If you (as a researcher) try to cover your assets by saving a local copy of all the sources you cite, your copies are suspect because you could edit them to say whatever you want... only the original source can be considered authoritative, even though it can change at any time! And this problem only gets worse when combined with #1 -- if a domain name gets registered by a cybersquatter, then what is at a URL now could completely contradict what used to be there. Moreover, if you link from your pages to pages on another site -- which is after all the whole point of hypertext -- and those pages are moved or deleted, your page is now broken, and you might not notice for months because the links are exclusively one-way: even if the site you linked to wanted to tell you to update your link, they have no way to do so.

By the way, I apologize to anyone who linked to the earlier revision of this article, from 2012. If it no longer says what it did when you read it before, then I guess I've just demonstrated my point!
Nothing is archived. Sure, a lot of stuff gets archived by Google and archive.org and the Library of Congress and so on, but not in the way that libraries used to archive magazines and newspapers. Time was, you could go into any major library and find hardbound volumes or microfilm that had been made out of magazines and newspapers to keep them readable for well over a century even if civilization collapsed. These archival volumes were not ordered from the publisher, they were made by the libraries on their own initiative for the convenience of their patrons, so that none of us had to contact the publisher to read an old article. Show me a library that makes hardbound books or microfilm or even an electronic copy of, say, the Huffington Post or Grist or Slate. I'm sure these online periodicals have their own archives, and it's certainly convenient to contact them now, but what if something happens to them? What if computers are no longer available? All that work will be lost forever.
Security is an afterthought. It's not an accident that the Internet is insecure and compromises our privacy at seemingly every opportunity. It was designed to do that. The people who designed it had philosophical reasons for wanting it that way. Any security or privacy we have online is a chain of weak links, and if any link fails, private data gets spilled onto the public Internet.

Here's an example: several years ago, all my Web sites were hosted at a certain major hosting company with a reptilian mascot. One day, I botched an upgrade and one of my scripts began misbehaving. They invoked a clause that was buried deep in the fine print of their user agreement and suspended all the scripts on all of my sites, instead serving up cached versions of the pages that they had captured, and did not tell me for over a day that they were doing this. One problem: I had been logging into the sites to edit confidential customer data, which was displayed to me at the same URLs as the public versions of the pages that did not have confidential data. The versions of the pages the hosting company cached were only supposed to be shown to me, but instead they served them to the whole Internet. By the time I found out about it, Google had indexed my customer data, and I had to manually remove it. (No harm was done, but the hosting company lost my business.)
Logins and passwords are a terrible way to identify yourself. We've become accustomed to logging into sites with a username and password, and we know we should never reuse a password, but we all know someone who, against all advice, uses the same password on multiple sites. Or gives out their password to friends or relatives or colleagues. It's just so much more convenient than what we're somehow supposed to do, and under the current system, convenience trumps security.
HTML is the wrong language for the job. In the Web's early days, it seemed like a good idea to allow people to code markup by hand, and to tolerate their inevitable errors. This led directly to the browser wars and billions of wasted hours on the part of Web developers. Now that all browsers have adopted a standard Document Object Model (DOM), HTML is still a really inefficient and error-prone way to generate the DOM. Many sites today serve up very little HTML and provide most of the content as JSON, building the DOM in JavaScript in the browser. This works great for end users, but search engines don't generally render JavaScript, so the JSON content of these pages doesn't get indexed for searching!

So here is my proposal to fix all these problems. It amounts to a non-technical specification for a new type of Web server software. I am not in a position to write a technical spec, let alone write the software, so I leave that to you.

Forget the Web server. We need an identity server. Given a personal identifier such as joe_schmo@example.com, my computer should be able to ask example.com for joe_schmo's public encryption key. Using this, I can verify (by the digital signature) that he wrote any of the pages on his Web site, and I can contribute new information to the site that only he can decrypt. If I want to log into his site, I need only provide my own identifier (no password); his server can then obtain my public key and send me files that only I can open. The only password I ever need to use is the one that unlocks my private key, on whatever device I'm currently using. This personal identity service would take the place of the existing certificate authorities we currently use for secure Web pages. The major difference would be that if I save a secure file on my computer or a flash drive, or if my computer caches it to display again later, the file would remain encrypted the same way it was sent to me, instead of being saved as plain text!
A URL (including parameters) should represent a specific revision of a document, for all time. Once a file has been published at a given URL (including a timestamp, as specified below), that's what's there. If you log into a site (as specified above) to view a customized version of a page which is sent to you encrypted, that should also have a distinct URL or parameter to indicate that it is a different document from the unencrypted version the public sees. This requirement would need to apply to JSON data files and images as well as to the HTML, CSS, and JavaScript files that render the data.
All URLs should have timestamps and digital signatures. This could be as simple as a ?time= parameter in the URL; the format doesn't matter as long as it is consistent over time. The timestamp would be used to retrieve a specific version of a document (including all necessary files, such as JSON data) in the event that it changes. Whenever you visit a URL without a timestamp, you get the most recent version along with the timestamp of its last change; share that URL (which now includes the timestamp) in a research paper or on Twitter or wherever and everyone will be looking at the same version of the document you saw, even if it has changed since then. The key here is that a significant percentage of Web servers would have to put timestamps in all their URLs, in a recognizable format. People would need to be able to take for granted that any time they use a timestamped URL they will see a specific version of the document, not necessarily the most recent. The digital signature allows a reader (or ideally the browser software) to check whether the file has been altered by a third party since its publication.

Note that in order to preserve the look of a page as well as its data, the versions of the HTML, CSS, JavaScript, JSON data, and images all need to be kept in sync with each other. For example, if I request the latest version of an HTML page (with no timestamp) and it renders with the latest version of a CSS file (again, no timestamp specified), and if I save a copy of the page to disk and reopen it a year later after the site's CSS has changed, it needs to load the old version of the CSS in order to display the same way I saw it a year before. This means the browser needs to save the HTML with the URL of the CSS file I received, not the one it originally requested. There's a precedent for doing this; if you save HTML in most browsers any relative links will be saved as absolute.

Ideally, I'd like to see the timestamping accomplished by version control. That is, the timestamp would refer not to a revision of a single file (say, the HTML), but to a commit in the repository that might affect multiple files or even all of the files that make up the page. That means that the server doesn't need to store old versions of each file in perpetuity; it keeps the newest copy ready to serve and generates old ones upon request by consulting the record of changes. If servers and browsers across the Internet could standardize on a version control system, then large, cached files could be updated by just loading the changes instead of the entire file, cutting down on Internet traffic.
Any timestamped file should be cacheable. Suppose an article published in California becomes all the rage in New York -- everybody is passing this URL around to their New Yorker friends. Under the current system, every single time any one of them requests the article, that request goes to the publisher's server in California. The server gets "slashdotted" -- it can't handle the flood of requests. There are huge, elaborate systems to try to avoid this problem, but they all require the publisher to take the initiative (and usually invest a lot of money) to prepare for a spike in traffic that might occur at any time. With the timestamp system, a server in New York seeing multiple requests for the same version of a document (judged by the timestamp) would have every right and full permission to serve up a cached copy instead of sending the request to California, effectively cutting the traffic for everyone. The publisher in California benefits by not having to preemptively invest in infrastructure, the intermediate servers don't have any more traffic than they did before, the readers in New York get their article faster, and everyone in between sees less traffic; the only people who don't benefit are the ones selling the huge, elaborate systems required to address the current problem. (The caching servers would need to collect traffic statistics and send them on to the publisher once the traffic dies down; otherwise the publisher would have no knowledge of their own popularity!)

Now apply this caching to a viral video instead of an article and suddenly YouTube needs a tiny fraction of the servers it currently has, because any server can help serve popular videos. For that matter, anybody who wants to publish a video on their own site no longer needs YouTube, because the more popular the video becomes, the faster it will load instead of slower. If a site goes down because the publisher forgot to pay her bills, or if a hosting company has a technical glitch, the cached copies of the pages will still be accessible for as long as they are in demand.

Now suppose a publisher wants to remain the one and only authoritative source of some information. How do they prevent its being cached? Just leave off the timestamp from the URL when serving it. This would be necessary for any site where the content is constantly changing, such as the home page of a news outlet. It wouldn't prevent people making non-authoritative copies any more than the current system does, but it would create a clear distinction between an authoritative and non-authoritative copy, which would be useful for libel suits, etc.
Bad links should fix themselves. Since the browser has identified the page's author (see #1), it can notify them immediately if a page contains a broken link or image. If the requested URL has a 301 redirect, the notification could include the new URL of the page, and the server receiving the notification could even update the link automatically. (This would, of course, result in a new revision with its own unique timestamp.) This self-correcting mechanism would facilitate pages changing ownership when a publisher dies or goes out of business, because 301 redirects would only need to remain in place long enough to fix all the links they break.
Index the DOM, not the HTML. I'm sure search engines are working on this, but the current state of affairs in which they index only HTML and not JavaScript-generated content acts as a disincentive for site authors to abandon HTML. The sooner we can leave HTML behind, the better, in my opinion.

Are there problems with this proposal? I'm sure. But as I've pointed out, there are significant problems with the current system as well. Which is worse, losing a little income in the short run or losing the repository of all human knowledge because we stopped paying the rent?

Please comment below!

Interdependent Web

A modest proposal to fix the Web

Language switcher