Thoughts on Deploying and Maintaining SMW Applications

In September or October of last year, I received an email from someone who had come across CC Teamspace and was wondering if there was a demo site available they could use to evaluate it. I told them, “No, but I can probably throw one up for you.” A month later I had to email them and say, “Sorry, but I haven’t found the time to do this, and I don’t see that changing.” This is clearly not the message you want to send to possible adopters of your software — “Sorry, even I can’t install it quickly.” Now part of the issue was my own meta/perfectionism: I wanted to figure out a DVCS driven upgrade and maintenance mechanism at the same time. But even when I faced the fact that I didn’t really need to solve both problems at the same time, I quickly became frustrated by the installation process. The XML file I needed to import seemed to contain extraneous pages, and things seemed to have changed between MediaWiki and/or extension versions since the export was created. I kept staring at cryptic errors, struggling to figure out if I had all the dependencies installed. This is not just a documentation problem.

If we think about the application life cycle, there are a three stages a solution to this problem needs to address:[†]_

  1. Installation
  2. Customization
  3. Upgrade

If an extension is created using PHP, users can do all three (and make life considerably easier if they’re a little VCS savvy). But if we’re dealing with an “application” built using Semantic MediaWiki and other SMW Extensions, it’s possible that there’s no PHP at all. If the application lives purely in the wiki, we’re left with XML export/import[‡]_ as the deployment mechanism. With this we get a frustrating release process, Customization support, and a sub-par Installation experience.

The basic problem is that we currently have two deployment mechanisms: full-fledged PHP extensions, and XML dumps. If you’re not writing PHP, you’re stuck with XML export-import, and that’s just not good enough.

A bit of history: When Steren created the initial release of CC Teamspace, he did so by exporting the pages and hand tweaking the XML. This is not a straight-forward, deterministic process that we want to go through every time a bug fix release is needed.

For users of the application, once the import (Installation) is complete (assuming it goes better than my experience), Customization is fairly straight-forward: you edit the pages. When an Upgrade comes along, though, you’re in something of a fix: how do you re-import the pages, retaining the changes you may have made? Until MediaWiki is backed by a DVCS with great merge handling, this is a question we’ll have to answer.

We brainstormed about these issues at the same time we were thinking about Actions. Our initial thoughts were about making the release and installation process easier: how does a developer[◊]_ indicate these pages in my wiki make up my application, and here’s some metadata about it to make life easier.

We brainstormed a solution with the following features:

  1. An “Application“ namespace: just as Forms, Filters, and Templates have their own namespace, an Application namespace would be used to define groups of pages that work together.
  2. Individual Application Pages, each one defining an Application in terms of Components. In our early thinking, a Component could be a Form, a Template, a Filter, or a Category; in the latter case, only the SMW-related aspects of the Category would be included in the Application (ie, not any pages in the Category, on the assumption that they contain instance-specific data).
  3. Application Metadata, such as the version[♦]_, creator, license, etc.

A nice side effect of using a wiki page to collect this information is that we now have a URL we can refer to for Installation. The idea was that a Special page (ie, Special:Install, or Special:Applications) would allow the user to enter the URL of an Application to install. Magical hand waving would happen, the extension dependencies would be checked, and the necessary pages would be installed.

While we didn’t get too far with fleshing out the Upgrade scenario, I think that a good first step would be to simply show the edit diff if the page has changed since it was Installed, and let the user sort it out. It’s not perfect, but it’d be a start.

I’m not sure if this is exactly the right approach to take for packaging these applications. It does effectively invent a new packaging format, which I’m somewhat wary of. At the same time, I like that it seems to utilize the same technologies in use for building these applications; there’s a certain symmetry that seems reassuring. Maybe there are other, obvious solutions I haven’t thought of. If that’s the case, I hope to find them before I clear enough time from the schedule to start hacking on this idea.


date:2010-01-25 21:24:51
wordpress_id:1353
layout:post
slug:thoughts-on-deploying-and-maintaining-smw-applications
comments:
category:cc, development
tags:cc, mediawiki, semantic mediawiki, smw

“Actions” for SMW Applications (Hypothetically)

Talking about AcaWiki has me thinking some more about our experiences over the past couple years with Semantic MediaWiki, particularly about building “applications” with it. I suppose that something like AcaWiki could be considered an application of sorts — I certainly wrote about it as such earlier this week — but in this case I’m talking about applications as reusable, customizable pieces of software that do a little more than just CRUD data.

In 2008 we were using our internal wiki, Teamspace, for a variety of things: employee handbook, job descriptions, staff contact information, and grants. We decided we wanted to do a better job at tracking these grants, specifically the concrete tasks associated with each, things we had committed to do (and which potentially required some reporting). As we iterated on the design of a grant, task, and contact tracking system, we realized that a grant was basically another name for a project, and the Teamspace project tracking system was born.

As we began working with the system, it became obvious we needed to improve the user experience; requiring staff members to look at yet another place for information just wasn’t working. So Steren Giannini, one of our amazing interns, built Semantic Tasks.

Semantic Tasks is a MediaWiki extension, but it’s driven by semantic annotations on task pages. Semantic Tasks’ primary function is sending email reminders. One of the things I really like about Steren’s design is that it works with existing MediaWiki conventions: we annotate Tasks with the assigned (or cc’d) User page, and Semantic Tasks gets the email addresses from the User page.

There were two things we brainstormed but never developed in 2008. I think they’re both still areas of weakness that could be filled to make SMW even more useful as an application platform. The first is something we called Semantic Actions: actions you could take on a page that would change the information stored there.

Consider, for example, marking a task as completed. There are two things you’d like to do to “complete” a task: set the status to complete and record the date it was completed. The thought was that it’d be very convenient to have “close” available as a page action, one which would effect both changes at once without requiring the user to manually edit the page. Our curry-fueled brainstorm was that you could describe these changes using Semantic Mediawiki annotations[1]_. Turtles all the way down, so to speak.

The amount of explaining this idea takes, along with some distance, makes me uncertain that it’s the right approach. I do think that being able to easily write extensions that implement something more than CRUD is important to the story of SMW as a “real” application platform. One thing I that makes me uncertain about this approach is the fear that we are effectively rebuilding Zope 2’s ZClasses, only crappier. ZClasses, for those unfamiliar, were a way to create classes and views through the a web-based interface. A user with administrative rights could author an “application” through the web, getting lots of functionality for “free”. The problem was that once you exhausted ZClasses’ capabilities, you pretty much had to start from scratch when you switched to on disk development. Hence Zope 2’s notorious “Z-shaped learning curve”. I think it’s clear to me now that building actions through the web is going to by necessity expose a limited feature set. The question is whether it’s enough, or if we should encourage people to write [Semantic] Mediawiki extensions that implement the features they need.

Maybe the right approach is simply providing really excellent documentation so that developers can easily retrieve the values the SMW annotations on the pages they care about. You can imagine a skin that exists as a minor patch to Monobook or Vector, which uses a hook to retrieve the installed SMW “actions” for a page and displays them in a consistent manner.

Regardless of the approach taken, if SMW is going to be a platform, there has to be an extensibility story. That story already exists in some form; just look at the extensions already available. Whether the existing story is sufficient is something I’m interested in looking at further.

Next time: Thoughts on Installation and Deployment.


[1]The difference between Semantic Tasks and our hypothetical Semantic Actions is that the latter was concerned solely with making some change to the relevant wiki page.
date:2010-01-07 22:13:41
wordpress_id:1346
layout:post
slug:actions-for-smw-applications-hypothetically
comments:
category:cc, development
tags:cc, mediawiki, semantic mediawiki, smw, teamspace

Caching WSGI Applications to Disk

This morning I pushed the first release of wsgi_cache to the PyPI, laying the groundwork for increasing sanity in our deployment story at CC. wsgi_cache is disk caching middleware for WSGI applications. It’s written with our needs specifically in mind, but it may be useful to others, as well.

The core of Creative Commons’ technical responsibilities are the licenses: the metadata, the deeds, the legalcode, and the chooser. While the license deeds are mostly static and structured in a predictable way, there are some “dynamic” elements; we sometimes add more information to try and clarify the licenses, and volunteers are continuously updating the translations that let us present the deeds in dozens of languages. These are dynamic in a very gross sense: once generated, we can serve the same version of each deed to everyone. But there is an inherent need to generate the deeds dynamically at some point in the pipeline.

Our current toolset includes a script for [re-]generating all or some of the deeds. It does this by [ab]using the Zope test runner machinery to fire up the application and make lots of requests against it, saving the results in the proper directory structure. The result of this is then checked into Subversion for deployment on the web server. This works, but it has a few shortfalls and it’s a pretty blunt instrument. wsgi_cache, along with work Chris Webber is currently doing to make the license engine a better WSGI citizen, aims to streamline this process.

The idea behind wsgi_cache is that you create a disk cache for results, caching only the body of the response. We only cache the body for a simple reason — we want something else, something faster, like Apache or other web server, to serve the request when it’s a cache hit. We’ll use mod_rewrite to send the request to our WSGI application when the requested file doesn’t exist; otherwise it hits the on disk version. And cache “invalidation” becomes as simple as rm (and as fine grained as single resources).

There are some limitation which might make this a poor choice for other applications. Because you’re only caching the response body, it’s impossible to store other header information. This can be a problem if you’re serving up different content types which can’t be inferred from the path (note that we use filenames that look like deed.fr and deed.de, so we tell Apache to override the content type for everything; this works for our particular scenario). Additionally, this approach only makes sense if you have another front end server that can serve up the cached version faster; I doubt that wsgi_cache will win any speed challenges for serving cached versions.

We’re not quite ready to roll it out yet, and I expect we’ll find some things that need to be tweaked, but a test suite with 100% coverage makes that a challenge I’m up for. If you’re interested in taking a look (and adapting it for your own use), you can find the code in Creative Commons’ git repository.

date:2010-01-05 23:37:29
wordpress_id:1309
layout:post
slug:caching-wsgi-applications-to-disk
comments:
category:cc, development
tags:cache, cc, middleware, python, wsgi, wsgi_cache

AcaWiki: On Building Emerging Applications

I’m woefully late in noting the launch of AcaWiki. Mike does a good job exploring the sweet spot AcaWiki may fill between research blogging and open access journals, and where AcaWiki fits into the wiki landscape. AcaWiki is interesting to me for two reasons; first, I was the technical lead on the project, and second, it’s another recent example of building a site using MediaWiki as a platform. More specifically, we used MediaWiki along with Semantic MediaWiki, Semantic Forms, and several other related extensions as the platform for the site.

The idea of using a wiki for a community oriented site is far from new. The difference here is that Neeru came to us talking about specific ways people could interact with the site — specific structured data she wanted to organize and capture about academic articles. For anyone familiar with MediaWiki and Wikipedia, the obvious answer istemplates; Wikipedia uses them extensively to provide a consistent presentation for parts of an articles (messages about the article, citations, etc). The catch is that for someone coming to a site for the first time, who perhaps has not edited a wiki previously, templates are a bit of inside baseball — you need to know which one to use, and you need to know how to format them in your article. Of course these are trainable skills, but I suspect for many users they’re non-obvious. Semantic Forms lets us provide a form for entering these fields, which is then translated to a template.

The question that comes up when discussing this approach with non-wiki-philes is, “why use a wiki at all? if all you need are CRUD forms, why not just whip it up in Rails, Django, etc?” The question is a good one — a specialized tool almost always has the potential to look fantastic compared to an off the shelf one. And who wants to learn that weird markup syntax, anyway? The thing is, at the end of the day, AcaWiki isn’t a software project, it’s a community project. There isn’t a team of engineers available to help move the toolset forward. There isn’t staff available to fix bugs and write migration scripts. So using off the shelf tools with active communities is essential to achieving any amount of scalability.

As Mike points out, there are some niches AcaWiki seems primed to fill. While working on the site, however, it was clear there are lots of unanswered questions about how that will actually happen. AcaWiki, like many sites that seek to serve a community of interest in a given area, is an emerging application. The data schema isn’t well defined, and we don’t necessarily know how users are going to interact with the site. The goal is to get something that users can use in place; something that provides just enough structure to encourage newcomers, while retaining the plasticity and flexibility needed to grow and evolve.

As I mentioned before, this is not the first “application” we’ve built using this tool chain; we use MediaWiki and Semantic MediaWiki at Creative Commons in many places. We use it to track Events our community puts together, and we use it to track things we’d like developers to work on (NB: the latter is woefully out of date and stagnated; perhaps a negative use case for this sort of tool). We even built a system for tracking grants and projects using it.

Using MediaWiki and Semantic MediaWiki as an application platform isn’t appropriate for every project and it isn’t a cure all; there are real limitations, like any off the shelf system. In some cases these issues are magnified due to the fact that it’s not explicitly designed as a platform. For applications that rely on community involvement and that are only partially defined, it usually either gets the job done, or brings us far enough along with minimal effort that we can see what the real problem we’re trying to solve is.

AcaWiki is an exciting experiment in building community around academic research and knowledge. It’s also another in a line of interesting experiments with building applications in a different, organic manner. There’s some interesting work in the pipeline for AcaWiki, including data dumps, a shiny Vector-based skin, and improvements to the forms and templates used. The most interesting work, however, will be the work done by the community.


AcaWiki’s founder, Neeru Paharia, was one of CC’s earliest employees, and she turned to the CC technology team for help with this project.

date:2010-01-04 22:44:17
wordpress_id:1209
layout:post
slug:acawiki-on-building-emerging-applications
comments:
category:development
tags:acawiki, cc, mediawiki, platforms, semantic mediawiki, smw, wiki

gsc Bug Fixes

I announced gsc earlier this week because it worked for me. If you were brave and cloned the repository to try it out, you undoubtedly found that, well, it didn’t work for you. Thanks to Rob for reporting the problem with setup.py, as well as a few other bugs.

I’ve pushed an update to the repository on gitorious which includes fixes for the setup.py issue, support for some [likely] common Subversion configurations and a test suite. In addition to the installation issue Rob also reported that wasn’t able to clone his svn repository with gsc. Some investigation led me to realize the following cases weren’t supported:

  • svn:externals specified with nested local paths (ie, “vendor/product”)
  • empty directories in the Subversion repository with nothing but svn:externals set on them

Both now clone correctly.

One open question is what (if anything) gsc should do when you run it against an already cloned repository. I’ve envisioned it purely as a bootstrapping tool but received an email stating that it didn’t work when run a second time, so obviously it should do something, even if that’s just failing with an error message.

date:2009-07-25 18:41:27
wordpress_id:1087
layout:post
slug:gsc-bug-fixes
comments:
category:development
tags:cc, git, git-svn, gsc, svn, svn:externals

git-svn and svn:externals

UPDATE I’ve pushed a few bug fixes; see this entry for details.

At Creative Commons we’re a dual-[D]VCS shop. Since we started self-hosting our repositories last year we’ve been using both Subversion and git. The rationale was pragmatic more than anything else: we have lots of code spread across many small projects and don’t have the time (or desire) to halt everything and cut over from one system to the other. This approach hasn’t been without it’s pain but I think that overall it’s been a good one. When we create projects we tend to create them in git and when we do major refactoring we move things over. It’s also given [STRIKEOUT:recalcitrant staff] me time to adjust my thinking to git. Adjustments like this usually involve lots of swearing, fuming and muttering.

As I’ve become more comfortable with git and its collection of support tools, I’ve found myself wanting to use git svn to work on projects that remain in Subversion. One issue I’ve run into is our reliance on svn:externals. We use externals extensively in our repository which has generally made it easy to share large chunks of code and data, and still be able to check out the complete dependencies for a project and get to work[1]_. More than once I’ve thought “oh, I’ll just clone that using git-svn so I can work on it on the plane[2]_,” only to realize that there are half a dozen externals I’d need to handle as well.

Last week I decided that tools like magit make git too useful not to use when I’m coding and that I needed to address the “externals issues“. I didn’t want to deal with a mass conversion, I just wanted to get the code from Subversion into the same layout in git. I found git-me-up which was close, but which baked in what I assume are Rails conventions that our projects don’t conform to. Something like this may already exist, but the result of my work is a little tool, **gsc** — “git subversion clone”.

gsc works by cloning a Subversion repository using git svn and then recursively looks for externals to fetch. If it finds an external, it does a shallow clone of the target (only fetching the most recent revision instead of the full history). The result is a copy of your project you can immediately start working on. Of course, it also inherits some of the constraints associated with svn:externals. If you want to work on code contained in an external (and push it back to the Subversion repository) you may need to check out the code manually[3]_. Of course, the beauty of DVCS is that there’s nothing stopping you from committing to the read-only clone locally and then pushing the changes via email to a reviewer.

You can grab gsc from gitorious. There are also installation instructions and basic usage information in the README.

[1]It’s also led to some sub-optimal software release practices, but that’s probably a different post.
[2]Yes, I’ve actually encountered the “airplane” scenario; this either means DVCS advocates are prescient or I’ve been traveling way too much lately.
[3]This is true because some repositories spell read-only and read-write access differently; both CC and Zope do this, so the svn:externals definitions are often written using the read-only syntax to make sure everyone can make a complete checkout.
date:2009-07-21 09:47:17
wordpress_id:1073
layout:post
slug:git-svn-and-svnexternals
comments:
category:development
tags:cc, git, git-svn, gsc, svn, svn:externals

Open Access and Linked Data

I traveled to the midwest late last month and made a few stops, including PyCon and a brief visit with my parents. In between those two bookends I spoke at University of Michigan’s Open Access Week and had a few meetings with various parties. My topic was pretty broad — CC and Open Access — but I was [personally] pleased with how the talk came together. I’d like to re-create it for the purpose of creating a slidecast; maybe sometime soon.

In putting together the content I realized that while I had this gut level, assumed knowledge about what Open Access is, I hadn’t ever read a definition or really delved into it. When I read the Budapest Open Access Initiative, one part stood out to me.

By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.

Well of course it stood out to me, it’s a core descriptive sentence. But in particular, “availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, [or] pass them as data to software.” Interestingly this sentence ties right into the other meetings I was having that week which all seemed to come back to linked data (in particular RDFa). If you think about it, this sentence has implications that make OA materials perfect for linked data integration. It implies:

  • you have a stable, unique URL for the work
  • there isn’t a paywall or login requirement in front of the actual work
  • there isn’t any user agent discrimination — text in a Flash viewer need not apply (I’m looking at you, Scribd)
  • they’re in a format that’s useful as data; maybe [X]HTML?

So we have a growing corpus of information that’s ripe for markup with structured data. We’re doing a lot with embedded, structured [,linked] data right now at CC (things we need to do a better job talking about). I find it reassuring that the principles other efforts value mesh so well with what we’re doing.

date:2009-04-20 18:06:50
wordpress_id:1029
layout:post
slug:open-access-and-linked-data
comments:
category:cc, geek
tags:cc, linked data, oa, open access, rdfa

Unicode output from Zope 3

The Creative Commons licene engine has gone through several iterations, the most recent being a Zope 3 / Grok application. This has actually been a great implementation for us[1]_, but since the day it was deployed there’s been a warning in `README.txt <http://code.creativecommons.org/svnroot/cc.engine/trunk/README.txt>`_:

If you get a UnicodeDecodeError from the cc.engine (you’ll see this if it’srunning in the foreground) when you try to access the http://host:9080/license/then it’s likely that the install of python you are using is set to use ASCIIas it’s default output.  You can change this to UTF-8 by creating the file/usr/lib/python<version>/sitecustomize.py and adding these lines:

  import sys
  sys.setdefaultencoding(“utf-8”)

This always struck me as a bit inelegant — having to muck with something outside my application directory. After all, this belief that the application should be self-contained is the reason I use zc.buildout and share Jim’s belief in the evil of the system Python. Like a lot of inelegant things, though, it never rose quite to the level of annoyance needed to motivate me to do it right.

Today I was working on moving the license engine to a different server[2]_ and ran into this problem again. I decided to dig in and see if I could track it down. In fact I did track down the initial problem — I was making a comparison between an encoded Unicode string and without specifying an explicit codec to use for the decode. Unfortunately once I fixed that I found it was turtles all the way down.

Turns out the default Zope 3 page template machinery uses `StringIO <http://www.python.org/doc/lib/module-StringIO.html>`_ to collect the output. StringIO uses, uh, strings — strings with the default system encoding. Reading the module documentation, it would appear that mixing String and Unicode input in your StringIO will cause this sort of issue.

Andres suggested marking my templates as UTF-8 XML using something like:

< ?xml version="1.0" encoding="UTF-8" ?>

but even after doing this and fixing the resulting entity errors, there’s still obviously some 8 bit Strings leaking into the output. In conversations on IRC the question was then asked: “is there a reason you don’t want a reasonable system wide encoding if your locale can support it?”

I guess not[3]_.

UPDATE Martijn has a tangentially related post which sheds some light on why Python does/should ship with ascii as the default codec. At least people smarter than me have problems with this sort of thing, too.


[1]Yes, I may be a bit biased — I wrote the Zope3/Grok implementation. Of course, I wrote the previous implementation, too, and I can say without a doubt it was… “sub-optimal”.
[2]We’re doing a lot of shuffling lately to complete a 32 to 64 bit conversion; see the CC Labs blog post for the harrowing details.
[3]So the warning remains.
date:2008-07-19 12:57:33
wordpress_id:563
layout:post
slug:unicode-output-from-zope-3
comments:
category:cc, development
tags:cc, development, license engine, python, zope

Technology Summit

Yesterday was the first ever Creative Commons Technology Summit, hosted at Google. My photos and better ones taken by Joi.

I drove the Nerd Van (myself, Asheesh and the interns) to Google.

I’m still recovering (and inflicting pain — CC board meeting today) and collecting feedback, but I think it was a really successful day. We learned some things we’ll do differently next time (yes, there will be a next time). Anyway, special recognition to the CC interns for live blogging the event and for generally doing anything asked of them. I feel like I should write more about the event, but I’m feeling pretty brain dead at the moment.

date:2008-06-19 14:13:13
wordpress_id:555
layout:post
slug:technology-summit
comments:
category:cc, General
tags:cc, nerd van, techsummit

Avoiding git PTSD

In an attempt to prevent additional git (or maybe just git-svn?) induced PTSD, Asheesh kindly created a git phrasebook. If you, too, are a Subversion deserter and want to figure out how the whole branching thing works in git, this may be useful to you.

Someday I’ll write up my thoughts on distributed version control and “convention versus configuration”, which seem to overlap in this deployment. But not today.

date:2008-06-13 14:56:25
wordpress_id:554
layout:post
slug:avoiding_git_ptsd
comments:
category:cc, development
tags:cc, git, ptsd