Learning from the Web for Learning on the Web

Earlier this year Steven Stapleton from University of Nottingham emailed me and asked if I’d like to be a keynote speaker at OpenNottingham. I accepted, and was very excited to be part of the day. More recently, an opportunity unexpectedly presented itself, and I decided that after seven years, it was time to move on from Creative Commons. As a result of the timing of my departure, I was unable to travel to the UK this past week. What follows are the remarks I delivered via Skype for the event.

Update (16 April 2011): Video of my presentation via Skype is up on YouTube.


This is actually my last presentation as CTO of Creative Commons, and as I was preparing for it this week, I spent some time thinking about what questions are on my mind about open education, and where I look for answers. CC is a little different from a lot of organizations working in this space: we develop legal and technical infrastructure as much as anything, and as such we wind up with visibility into many different domains. I hope this perspective can help us think about the future of open education, and what’s next.

Let me begin by stating what I believe to be true, and what I hope you agree with going into this. First, there has been an amazing explosion in activity surrounding eduction and learning on the web. In less than ten years we’ve seen words and acronyms like OER, OCW, metadata, and repository enter our collective consciousness, and seen myriad exciting projects launch to support open education.

Second, there is a feeling that the web, the internet, can help us deliver educational materials to audiences that are exponentially larger, with only incremental increases in cost. This broadening of delivery puts us in a position to reach and empower people in ways that they have not been before: life long learners, remedial learners, and others who may be under served by traditional models.

And third, we aren’t there yet. There are still challenging questions that we haven’t quite figured out how to answer, or that we’re just beginning to explore. For example: How do users discover open educational resources on the web? How do we determine what our impact and reach is? And what do we call success? I want to spend the next 10 to 15 minutes talking about some possible answers and things that I’ve been thinking about over the past year. What can a very selective history of the web tell us about where we are and what the future holds for online education.

The first two statements I made about learning on the web — that there has been a massive surge in interest and activity, and that there is a potential to reach vast audiences with only incremental additional cost — could very well have been made about the web itself in its early days. People were fascinated by the potential of this new technology, and rushed to stake their claim by publishing their own documents, sharing their knowledge. Now if I were you I’d be thinking, “Right, but I think we’re doing something a little more important than uploading scans of our favorite unicorn photos to GeoCities.” True enough, but the point is this: people were publishing, and they weren’t sure what came next.

As people began uploading and creating content and we saw this rise in the creative output of people on the web, there was an increasing need to capture this and organize it in an approachable form. You might have been able to draw a diagram of the early web on a sheet of A4 paper, but that rapidly became inadequate. The question of how you approach and understand this network of content became critical to answer. And the first answers were decidedly hands on processes. Yahoo did not start as an index of text on the web: it started as a way to search a hand curated set of resources, classified by human beings into categories and topics. DMoz, another directory of the web, took a similar approach: organizing resources into a hierarchy, bringing order where there was none. In both cases this was fundamentally a task of curation: what belongs in the list, and what does not. These were the web’s librarians, trying to provides an ontology that was flexible enough to handle the growing amount of content, and rigid enough that people could understand it.

Now there were definitely issues with this approach when applied to the early web, not the least of which was that they did a poor job of coping with different languages and cultures. Additionally, these directories didn’t leverage the fundamental relational nature of the web. Cross-referencing the list of resources categorized under different facets, such as language and subject, wasn’t an easy task.

As the web continued to grow, people began to realize that they could exploit the natural structure of the web — documents and links — to build a better index. Instead of searching the terms that a human used to label a resource, we could write software that followed links and created an index of the resources. So instead of hand curating a list of documents, we could trust that things linking to a document probably had a similar topic, or described the topic of what they were linking to. And eventually that “good” resources would have more links than those that did not.

It’s interesting to note that even as this transition from searching a curated list to searching a text index was taking place, the curated list still served an important purpose. Both the Yahoo index and DMoz were useful as the seeds for initial crawls. By starting with those pages, and following the links on them to other pages, software was able to being building a graph of content on the web. Curation was an important activity on its own, but it also enabled bigger and better innovations that weren’t obvious at the beginning.

So we look at the evolution of the web and see the move from curation as the primary means of discovery to curation and links as the seeds for larger and more complex discovery.Learning on the web has done a lot of this basic curation, both on a de facto and explicit basis. OCWC members publishing lists of open courseware, Connexions publishing modules and composite works, and OER Commons aggregating lists of resources from multiple sources are all acting as curators.

It’s this evolutionary question that we’re starting to face now: what is a link in online learning, and how do we compose larger works out of component pieces while giving credit and identifying what’s new or changed? Creative Commons licenses and our supporting technology provide a framework for marking what license a work is offered under, and how the creator wishes to be attributed. There seems to be wide-spread acknowledgement that linking as attribution is reasonable, but what about linking to create a larger work, or linking to cite a source work? Too often it is not obvious what components went into a work, or how to find them in a useful format for deriving your own work.

Last week I came across a website developing a free college curriculum for math, computer science, business, and liberal arts. At first I was really excited: the footer of the pages contained a link to Creative Commons Attribution license, and a full curriculum for things like computer science under a very liberal license is the sort of thing that gets me excited. But as I dug deeper, I found that the curriculum was actually more like a reading list: links to PDFs and web pages with instructions to read specific sections, pages, or chapters. Now it’s really exciting that the web and educational publishing on the web has progressed to the point where someone can act as a curator and assemble such a reading list, where all the resources are accessible.

What’s frustrating — and illustrative of this question of what a link means in education and how we create largely, composite documents, I think — is the information that’s missing. Links to PDF files and other sites are a start, but they don’t capture the actual relationship that exists between the component pieces. By exploring what the graph of educational works looks like, we can enable applications and tools that help answer the question of discovery, like the search engines that grew out of exploring the graph of documents on the web at large.

So with a multitude of ways to discover content on the web, publishers began asking the question: who is finding me, how, and what are they actually looking for? Am I reaching the individuals that I think I am, and how are they interacting with my site. In other words, what is success, and how do I measure it? The web as a whole answered this question through the development of tools like Google Analytics, Piwik, and others. For many publishers, success is defined as more visitors who spend more time on the site. I’m not actually sure that’s true for education on the web, or at least I don’t think it’s the entire story.

When I think about success for open education and education on the web, I think about both web metrics and education metrics. Web metrics look a lot like everyday web publishing metrics: visitors, time on site, bounce rate, etc. If you’re trying to drive visitors to a particular site or service, you might also measure conversions as part of your success metrics. Education metrics, however, are a lot tougher to work with on the open web. We may want to determine whether a particular resource helps people pass an assessment, but where does that assessment come from? And how do I even find alternative resource to compare results with?

As we continue to curate a pool of educational resources online, one of the facets that I’ve encountered frequently of late is how OER align to curricular standards or quality metrics. This is an example of curating for something other than the subject. That is, while early curation systems classified web pages based on their topic, there’s no reason they couldn’t classify based on what curricular standard they address instead. Embracing curation has the potential to enable new assessments and metrics that build on the nature of the web and are more broadly applicable. For example, if online education embraces a culture of linking and composition using links, it’s possible to imagine a measure of reach and impact based on links and referrers, instead of just visitors.

As we begin to explore these questions, there’s also the opportunity for this community to lead developments on the web instead of just following past trends. As this community of practice continues to develop, we can learn from the past and iterate to increase our impact and reach. While search engines initially just leveraged links to determine where a resource fits in the web, there is increasing recognition that structured data can help us develop tools that provide better results and user experiences. Web scale search providers are beginning to leverage this information to improve search results to include information like the number of stars a restaurant is reviewed at, or the cost of a product you searched for. Creative Commons uses structured data to indicate that the link to our license isn’t just another link, it actually has some meaning. By annotating links with information about their meaning, we can enable tools which give weight to different relationships based on context.

There is a great opportunity to develop rough consensus and working code around how structured data can be used to indicate the relationship between parts of a curriculum, alignment of resources to a curricular standard, or the sources a work uses. This is the next reasonable step for the use of structured data and curation on the web, and the open education community has a real opportunity to lead. As we publish resources online, we can develop a practice of linking, annotation, and curation.

This high level, incredibly vague, and very, very selective history of the web shows that there are many lessons education on the web can learn, and at least an equal number of areas where it can lead. There is excitement and passion, but we need to ask ourselves some hard questions as we move forward. What does success look like, how do we measure our impact and reach, and what can we learn from those who have gone before.

Thank you.

date:2011-04-08 15:45:10
wordpress_id:1895
layout:post
slug:learning-from-the-web-for-learning-on-the-web
comments:
category:cc
tags:openn11

CI at CC

I wrote about our roll-out of Hudson on the CC Labs blog. I wanted to note a few things about deploying that, primarily for my own reference. Hudson has some great documentation, but I found Joe Heck’s step by step instructions on using Hudson for Python projects particularly helpful. We’re using nose for most of our projects, and buildout creates a nosetest script wrapper that Hudson runs to generate pass/fail reports.

Setting up coverage is on the todo list, but it appears that our particular combination of libraries has at least one strange issue: when cc.license uses Jinja2 to load a template, coverage thinks it’s a Python source file (maybe it uses an import hook or something? haven’t looked) and tries to tokenize it when generating the xml report. Ka-boom. (This has apparently already been reported.)

Another item in the “maybe/someday” file is using Tox to run the tests using multiple versions of Python (example configuration for Tox + Hudson exists). I can see that this is a critical part of the process when releasing libraries for others to consume. We have slightly less surface area — all the servers run the same version of Python — but it’d be great to know exactly what our possible deployment parameters are.

Overall Hudson already feels like it’s adding to our sanity. I just received my copy of Continuous Delivery, so I think this is the start of something wonderful.

date:2010-08-20 10:37:43
wordpress_id:1734
layout:post
slug:ci-at-cc
comments:
category:cc, development
tags:cc, CI, coverage, Hudson, python, sanity

i18n HTML: Bring the Pain

I have to stay up a little later this evening than I’d planned, so as a result I’m finally going through all the tabs and browser windows I’ve had open on my personal laptop. I think some of these have been “open” for months (yes, there have been browser restarts, but they’re always there when the session restores). One that I’ve meant to blog is Wil Clouser’s post on string substitution in .po files. It’s actually [at least] his second post on the subject, recanting his prior advice, coming around to what others told him previously: don’t use substitution strings in .po files.

I wasn’t aware of Chris’s previous advice, but had I read it when first published, I would have nodded my head vigorously; after all, that’s how we did it. Er, that’s how we, uh, do it. And we’re not really in a position to change that at the moment, although we’ve certainly looked pretty hard at the issue.

A bit of background: One of the core pieces of technology we’ve built at Creative Commons is the license chooser. It’s a relatively simple application, with a few wrinkles that make it interesting. It manages a lot of requests, a lot of languages, and has to spit out the right license (type, version, and jurisdiction) based on what the user provides. The really interesting thing it generates is some XHTML with RDFa that includes the license badge, name, and any additional information the user gives us; it’s this metadata that we use to generate the copy and paste attribution HTML on the deed. So what does this have to do with internationalization? The HTML is internationalized. And it contains substitutions. Yikes.

To follow in the excellent example of AMO and Gnome, we’d start using English as our msgids, leaving behind the current symbolic keys of the past. Unfortunately it’s not quite so easy. Every time we look at this issue (and for my first year as CTO we really looked; Asheesh can atest we looked at it again and again) and think we’ve got it figured out, we realize there’s another corner case that doesn’t quite work.

The real issue with the HTML is the HTML: zope.i18n, our XSLT selectors, the ZPT parse tree: none of them really play all that well with HTML msgids. The obvious solution would be to get rid of the HTML in translation, and we’ve tried doing that, although we keep coming back to our current approach. I guess we’re always seduced by keeping all the substitution in one place, and traumatized by the time we tried assembling the sentences from smaller pieces.

So if we accept that we’re stuck with the symbolic identifiers, what do we do? Build tools, of course. This wasn’t actually an issue until we started using a “real” translation tool — Pootle, to be specific. Pootle is pretty powerful, but some of the features depend on having “English” msgids. Luckily it has no qualms about HTML in those msgids, it has decent VCS support, and we know how to write post-commit hooks.

To support Pootle and provide a better experience for our translators, we maintain two sets of PO files: the “CC style” symbolic msgid files, and the “normal” English msgid files. We keep a separate “master” PO file where the msgid is the “CC style” msgid, and the “translation” is the English msgid. It’s this file that we update when we need to make changes, and luckily using that format actually makes the extraction work the way it’s supposed to. Or close. And when a user commits their work from Pootle (to the “normal” PO file), a post-commit hook keeps the other version in sync.

While we’ve gotten a lot better at this and have learned to live with this system, it’s far from perfect. The biggest imperfection is its custom nature: I’m still the “expert”, so when things go wrong, I get called first. And when people want to work on the code, it takes some extra indoctrination before they’re productive. My goal is still to get to a single set of PO files, but for now, this is what we’ve got. Bring the pain.


For a while, at least. We’re working on a new version of the chooser driven by our the license RDF. This will be better for re-use, but not really an improvement in this area.

This works great in English, but in languages where gender is more strongly expressed in the word forms, uh, not so much.

date:2010-03-01 23:21:20
wordpress_id:1501
layout:post
slug:i18n-html-bring-the-pain
comments:
category:cc, development
tags:cc, i18n, license engine, zope

Houston Connexions

I spent the first half of this week in Houston, Texas for the Connexions Consortium Meeting and Conference. What follows are my personal reflections.


Connexions (http://cnx.org) is an online repository of learning materials — open educational resources (OER). Unlike many other OER repositories, Connexions has a few characteristics that work together to expand it’s reach and utility.

While it was founded by (and continues to be supported by) Rice University, the content in Connexions is larger in scope than a single university, and isn’t tied to a particular course the way, say, MIT OCW is. Attendees of the conference came from as far away as the Netherlands and Vietnam.

In addition to acting as a repository, Connexions is an authoring platform: content is organized into modules, which can then be re-arranged, re-purposed, and re-assembled into larger collections and works. This enables people to take content from many sources and assemble it into a single work that suits their particular needs; that derivative is also available for further remixing. At the authors’ panel at the conference, we heard about how some authors have used this to update or customize a work for the class they were teaching. [UPDATE 5 Feb 2010: See the Creative Commons blog for information on this, and thoughts from the authorDr. Chuck“ (Charles Severance), who was on the authors panel. ]

Finally, Connexions is an exemplar when it comes to licensing: if you want your material to be part of Connexions, the license is CC Attribution 3.0. While OER is enabled by CC licenses generally, this choice provides a lot of leverage to users. The remixing, re-organizing, and re-purposing enabled by the authoring platform is far simpler with no license compatibility to worry about. Certainly you can imagine a platform that handled some of the compatibility questions for you — and the idea of developing such a system based on linked data is intriguing to me personally — but the use of a single, extremely liberal license means that when it comes to being combined and re-purposed, all authors are equal, all content is equal.

This year was the second Connexions Conference, and from my perspective there were two themes: the consortium, and Rhaptos. The consortium is actually why I was in Houston. The Connexions Consortium is an, uh, consortium of organizations with a vested interest in Connexions: universities and colleges that are using it and companies that are using the content. And Creative Commons, who I was representing at the meeting. I’ve also been elected to the Technology Committee, a group of people representing consortium members who will provide guidance on technical issues to Connexions. During our meeting on Monday afternoon there was discussion of a variety of areas. One that we didn’t get to, but which is interesting to me, is how content in Rhaptos repositories can be made more discoverable, and how we can enable federated or aggregated search.

Rhaptos was the other prominent theme at the conference. Rhaptos is the code that runs Connexions: cnx.org without the specific look and feel/branding. While the source code behind Connexions has always been available, in the past year they’ve invested time and resources to making it easy (or at least straight-forward) to deploy. Interestingly (to me) Rhaptos is a Plone (Zope 2) application, and the deployment process makes liberal use of buildout. It’s not clear to me exactly what the market is for Rhaptos. It’s definitely one of those “unsung” projects right now, with lots of potential, and one really high profile user. I think it’ll be interesting to see how the Consortium and Rhaptos interact: right now all of the members are either using the flagship site to author content, or the content from the site to augment their commercial offerings. One signifier of Rhaptos adoption would be consortium members who are primarily users of the software, and interested in supporting its development.

Overall it was a great trip; I got to hear about interesting projects and see a lot of people I don’t get to see that often. I’m looking forward to seeing how both the consortium and Rhaptos develop over the next year.


If needed, and the evidence to date is that the staff is more than competent. I expect we’ll act more as a sounding board, at least initially.

This is an area that’s aligned with work we’re doing at CC right now, so it’s something I’ll be paying attention to.

date:2010-02-04 22:15:06
wordpress_id:1457
layout:post
slug:houston-connexions
comments:
category:cc
tags:cc, cnx, IAH, oer, travel

Thoughts on Deploying and Maintaining SMW Applications

In September or October of last year, I received an email from someone who had come across CC Teamspace and was wondering if there was a demo site available they could use to evaluate it. I told them, “No, but I can probably throw one up for you.” A month later I had to email them and say, “Sorry, but I haven’t found the time to do this, and I don’t see that changing.” This is clearly not the message you want to send to possible adopters of your software — “Sorry, even I can’t install it quickly.” Now part of the issue was my own meta/perfectionism: I wanted to figure out a DVCS driven upgrade and maintenance mechanism at the same time. But even when I faced the fact that I didn’t really need to solve both problems at the same time, I quickly became frustrated by the installation process. The XML file I needed to import seemed to contain extraneous pages, and things seemed to have changed between MediaWiki and/or extension versions since the export was created. I kept staring at cryptic errors, struggling to figure out if I had all the dependencies installed. This is not just a documentation problem.

If we think about the application life cycle, there are a three stages a solution to this problem needs to address:[†]_

  1. Installation
  2. Customization
  3. Upgrade

If an extension is created using PHP, users can do all three (and make life considerably easier if they’re a little VCS savvy). But if we’re dealing with an “application” built using Semantic MediaWiki and other SMW Extensions, it’s possible that there’s no PHP at all. If the application lives purely in the wiki, we’re left with XML export/import[‡]_ as the deployment mechanism. With this we get a frustrating release process, Customization support, and a sub-par Installation experience.

The basic problem is that we currently have two deployment mechanisms: full-fledged PHP extensions, and XML dumps. If you’re not writing PHP, you’re stuck with XML export-import, and that’s just not good enough.

A bit of history: When Steren created the initial release of CC Teamspace, he did so by exporting the pages and hand tweaking the XML. This is not a straight-forward, deterministic process that we want to go through every time a bug fix release is needed.

For users of the application, once the import (Installation) is complete (assuming it goes better than my experience), Customization is fairly straight-forward: you edit the pages. When an Upgrade comes along, though, you’re in something of a fix: how do you re-import the pages, retaining the changes you may have made? Until MediaWiki is backed by a DVCS with great merge handling, this is a question we’ll have to answer.

We brainstormed about these issues at the same time we were thinking about Actions. Our initial thoughts were about making the release and installation process easier: how does a developer[◊]_ indicate these pages in my wiki make up my application, and here’s some metadata about it to make life easier.

We brainstormed a solution with the following features:

  1. An “Application“ namespace: just as Forms, Filters, and Templates have their own namespace, an Application namespace would be used to define groups of pages that work together.
  2. Individual Application Pages, each one defining an Application in terms of Components. In our early thinking, a Component could be a Form, a Template, a Filter, or a Category; in the latter case, only the SMW-related aspects of the Category would be included in the Application (ie, not any pages in the Category, on the assumption that they contain instance-specific data).
  3. Application Metadata, such as the version[♦]_, creator, license, etc.

A nice side effect of using a wiki page to collect this information is that we now have a URL we can refer to for Installation. The idea was that a Special page (ie, Special:Install, or Special:Applications) would allow the user to enter the URL of an Application to install. Magical hand waving would happen, the extension dependencies would be checked, and the necessary pages would be installed.

While we didn’t get too far with fleshing out the Upgrade scenario, I think that a good first step would be to simply show the edit diff if the page has changed since it was Installed, and let the user sort it out. It’s not perfect, but it’d be a start.

I’m not sure if this is exactly the right approach to take for packaging these applications. It does effectively invent a new packaging format, which I’m somewhat wary of. At the same time, I like that it seems to utilize the same technologies in use for building these applications; there’s a certain symmetry that seems reassuring. Maybe there are other, obvious solutions I haven’t thought of. If that’s the case, I hope to find them before I clear enough time from the schedule to start hacking on this idea.


date:2010-01-25 21:24:51
wordpress_id:1353
layout:post
slug:thoughts-on-deploying-and-maintaining-smw-applications
comments:
category:cc, development
tags:cc, mediawiki, semantic mediawiki, smw

“Actions” for SMW Applications (Hypothetically)

Talking about AcaWiki has me thinking some more about our experiences over the past couple years with Semantic MediaWiki, particularly about building “applications” with it. I suppose that something like AcaWiki could be considered an application of sorts — I certainly wrote about it as such earlier this week — but in this case I’m talking about applications as reusable, customizable pieces of software that do a little more than just CRUD data.

In 2008 we were using our internal wiki, Teamspace, for a variety of things: employee handbook, job descriptions, staff contact information, and grants. We decided we wanted to do a better job at tracking these grants, specifically the concrete tasks associated with each, things we had committed to do (and which potentially required some reporting). As we iterated on the design of a grant, task, and contact tracking system, we realized that a grant was basically another name for a project, and the Teamspace project tracking system was born.

As we began working with the system, it became obvious we needed to improve the user experience; requiring staff members to look at yet another place for information just wasn’t working. So Steren Giannini, one of our amazing interns, built Semantic Tasks.

Semantic Tasks is a MediaWiki extension, but it’s driven by semantic annotations on task pages. Semantic Tasks’ primary function is sending email reminders. One of the things I really like about Steren’s design is that it works with existing MediaWiki conventions: we annotate Tasks with the assigned (or cc’d) User page, and Semantic Tasks gets the email addresses from the User page.

There were two things we brainstormed but never developed in 2008. I think they’re both still areas of weakness that could be filled to make SMW even more useful as an application platform. The first is something we called Semantic Actions: actions you could take on a page that would change the information stored there.

Consider, for example, marking a task as completed. There are two things you’d like to do to “complete” a task: set the status to complete and record the date it was completed. The thought was that it’d be very convenient to have “close” available as a page action, one which would effect both changes at once without requiring the user to manually edit the page. Our curry-fueled brainstorm was that you could describe these changes using Semantic Mediawiki annotations[1]_. Turtles all the way down, so to speak.

The amount of explaining this idea takes, along with some distance, makes me uncertain that it’s the right approach. I do think that being able to easily write extensions that implement something more than CRUD is important to the story of SMW as a “real” application platform. One thing I that makes me uncertain about this approach is the fear that we are effectively rebuilding Zope 2’s ZClasses, only crappier. ZClasses, for those unfamiliar, were a way to create classes and views through the a web-based interface. A user with administrative rights could author an “application” through the web, getting lots of functionality for “free”. The problem was that once you exhausted ZClasses’ capabilities, you pretty much had to start from scratch when you switched to on disk development. Hence Zope 2’s notorious “Z-shaped learning curve”. I think it’s clear to me now that building actions through the web is going to by necessity expose a limited feature set. The question is whether it’s enough, or if we should encourage people to write [Semantic] Mediawiki extensions that implement the features they need.

Maybe the right approach is simply providing really excellent documentation so that developers can easily retrieve the values the SMW annotations on the pages they care about. You can imagine a skin that exists as a minor patch to Monobook or Vector, which uses a hook to retrieve the installed SMW “actions” for a page and displays them in a consistent manner.

Regardless of the approach taken, if SMW is going to be a platform, there has to be an extensibility story. That story already exists in some form; just look at the extensions already available. Whether the existing story is sufficient is something I’m interested in looking at further.

Next time: Thoughts on Installation and Deployment.


[1]The difference between Semantic Tasks and our hypothetical Semantic Actions is that the latter was concerned solely with making some change to the relevant wiki page.
date:2010-01-07 22:13:41
wordpress_id:1346
layout:post
slug:actions-for-smw-applications-hypothetically
comments:
category:cc, development
tags:cc, mediawiki, semantic mediawiki, smw, teamspace

Caching WSGI Applications to Disk

This morning I pushed the first release of wsgi_cache to the PyPI, laying the groundwork for increasing sanity in our deployment story at CC. wsgi_cache is disk caching middleware for WSGI applications. It’s written with our needs specifically in mind, but it may be useful to others, as well.

The core of Creative Commons’ technical responsibilities are the licenses: the metadata, the deeds, the legalcode, and the chooser. While the license deeds are mostly static and structured in a predictable way, there are some “dynamic” elements; we sometimes add more information to try and clarify the licenses, and volunteers are continuously updating the translations that let us present the deeds in dozens of languages. These are dynamic in a very gross sense: once generated, we can serve the same version of each deed to everyone. But there is an inherent need to generate the deeds dynamically at some point in the pipeline.

Our current toolset includes a script for [re-]generating all or some of the deeds. It does this by [ab]using the Zope test runner machinery to fire up the application and make lots of requests against it, saving the results in the proper directory structure. The result of this is then checked into Subversion for deployment on the web server. This works, but it has a few shortfalls and it’s a pretty blunt instrument. wsgi_cache, along with work Chris Webber is currently doing to make the license engine a better WSGI citizen, aims to streamline this process.

The idea behind wsgi_cache is that you create a disk cache for results, caching only the body of the response. We only cache the body for a simple reason — we want something else, something faster, like Apache or other web server, to serve the request when it’s a cache hit. We’ll use mod_rewrite to send the request to our WSGI application when the requested file doesn’t exist; otherwise it hits the on disk version. And cache “invalidation” becomes as simple as rm (and as fine grained as single resources).

There are some limitation which might make this a poor choice for other applications. Because you’re only caching the response body, it’s impossible to store other header information. This can be a problem if you’re serving up different content types which can’t be inferred from the path (note that we use filenames that look like deed.fr and deed.de, so we tell Apache to override the content type for everything; this works for our particular scenario). Additionally, this approach only makes sense if you have another front end server that can serve up the cached version faster; I doubt that wsgi_cache will win any speed challenges for serving cached versions.

We’re not quite ready to roll it out yet, and I expect we’ll find some things that need to be tweaked, but a test suite with 100% coverage makes that a challenge I’m up for. If you’re interested in taking a look (and adapting it for your own use), you can find the code in Creative Commons’ git repository.

date:2010-01-05 23:37:29
wordpress_id:1309
layout:post
slug:caching-wsgi-applications-to-disk
comments:
category:cc, development
tags:cache, cc, middleware, python, wsgi, wsgi_cache

Open Access and Linked Data

I traveled to the midwest late last month and made a few stops, including PyCon and a brief visit with my parents. In between those two bookends I spoke at University of Michigan’s Open Access Week and had a few meetings with various parties. My topic was pretty broad — CC and Open Access — but I was [personally] pleased with how the talk came together. I’d like to re-create it for the purpose of creating a slidecast; maybe sometime soon.

In putting together the content I realized that while I had this gut level, assumed knowledge about what Open Access is, I hadn’t ever read a definition or really delved into it. When I read the Budapest Open Access Initiative, one part stood out to me.

By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.

Well of course it stood out to me, it’s a core descriptive sentence. But in particular, “availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, [or] pass them as data to software.” Interestingly this sentence ties right into the other meetings I was having that week which all seemed to come back to linked data (in particular RDFa). If you think about it, this sentence has implications that make OA materials perfect for linked data integration. It implies:

  • you have a stable, unique URL for the work
  • there isn’t a paywall or login requirement in front of the actual work
  • there isn’t any user agent discrimination — text in a Flash viewer need not apply (I’m looking at you, Scribd)
  • they’re in a format that’s useful as data; maybe [X]HTML?

So we have a growing corpus of information that’s ripe for markup with structured data. We’re doing a lot with embedded, structured [,linked] data right now at CC (things we need to do a better job talking about). I find it reassuring that the principles other efforts value mesh so well with what we’re doing.

date:2009-04-20 18:06:50
wordpress_id:1029
layout:post
slug:open-access-and-linked-data
comments:
category:cc, geek
tags:cc, linked data, oa, open access, rdfa

Unicode output from Zope 3

The Creative Commons licene engine has gone through several iterations, the most recent being a Zope 3 / Grok application. This has actually been a great implementation for us[1]_, but since the day it was deployed there’s been a warning in `README.txt <http://code.creativecommons.org/svnroot/cc.engine/trunk/README.txt>`_:

If you get a UnicodeDecodeError from the cc.engine (you’ll see this if it’srunning in the foreground) when you try to access the http://host:9080/license/then it’s likely that the install of python you are using is set to use ASCIIas it’s default output.  You can change this to UTF-8 by creating the file/usr/lib/python<version>/sitecustomize.py and adding these lines:

  import sys
  sys.setdefaultencoding(“utf-8”)

This always struck me as a bit inelegant — having to muck with something outside my application directory. After all, this belief that the application should be self-contained is the reason I use zc.buildout and share Jim’s belief in the evil of the system Python. Like a lot of inelegant things, though, it never rose quite to the level of annoyance needed to motivate me to do it right.

Today I was working on moving the license engine to a different server[2]_ and ran into this problem again. I decided to dig in and see if I could track it down. In fact I did track down the initial problem — I was making a comparison between an encoded Unicode string and without specifying an explicit codec to use for the decode. Unfortunately once I fixed that I found it was turtles all the way down.

Turns out the default Zope 3 page template machinery uses `StringIO <http://www.python.org/doc/lib/module-StringIO.html>`_ to collect the output. StringIO uses, uh, strings — strings with the default system encoding. Reading the module documentation, it would appear that mixing String and Unicode input in your StringIO will cause this sort of issue.

Andres suggested marking my templates as UTF-8 XML using something like:

< ?xml version="1.0" encoding="UTF-8" ?>

but even after doing this and fixing the resulting entity errors, there’s still obviously some 8 bit Strings leaking into the output. In conversations on IRC the question was then asked: “is there a reason you don’t want a reasonable system wide encoding if your locale can support it?”

I guess not[3]_.

UPDATE Martijn has a tangentially related post which sheds some light on why Python does/should ship with ascii as the default codec. At least people smarter than me have problems with this sort of thing, too.


[1]Yes, I may be a bit biased — I wrote the Zope3/Grok implementation. Of course, I wrote the previous implementation, too, and I can say without a doubt it was… “sub-optimal”.
[2]We’re doing a lot of shuffling lately to complete a 32 to 64 bit conversion; see the CC Labs blog post for the harrowing details.
[3]So the warning remains.
date:2008-07-19 12:57:33
wordpress_id:563
layout:post
slug:unicode-output-from-zope-3
comments:
category:cc, development
tags:cc, development, license engine, python, zope

Technology Summit

Yesterday was the first ever Creative Commons Technology Summit, hosted at Google. My photos and better ones taken by Joi.

I drove the Nerd Van (myself, Asheesh and the interns) to Google.

I’m still recovering (and inflicting pain — CC board meeting today) and collecting feedback, but I think it was a really successful day. We learned some things we’ll do differently next time (yes, there will be a next time). Anyway, special recognition to the CC interns for live blogging the event and for generally doing anything asked of them. I feel like I should write more about the event, but I’m feeling pretty brain dead at the moment.

date:2008-06-19 14:13:13
wordpress_id:555
layout:post
slug:technology-summit
comments:
category:cc, General
tags:cc, nerd van, techsummit