As I’ll be having a long drive/ride down to Code4Lib this year I decided I’d try out the CastingWords service for transcribing podcasts. I tend to prefer to read over listen. Once podcast I’d like to catch up on is the OpenURL talk by Jon Udell and Dan Chudnov. Both Jon and Dan have agreed to let me post it publicly, so here it is for any others who prefer to read. Thanks guys!

I can attest that the CastingWords process is pretty nice. They seem to have done a good job from a glance through (I haven’t read it yet) but then I haven’t listened to the original. But if they manage to get “COinS” right then there must be something good going. So without further ado here are some downloadable versions:

Jon Udell:

Hi, this is Jon Udell. My guest for this week’s podcast is Dan Chudnov. He is a programmer who has worked on library information systems at MIT, at Yale, and soon at the Library of Congress, where he will be starting a new job shortly. This conversation is kind of a sequel to an earlier one with Tony Hammond of Nature Publishing Group. Tony talked about how digital object identifiers are being used in the world of academic publishing to create a stable, long-term framework for persistent linking and reliable citation.

There is a related technology called OpenURL that plays a role in academic publishing, but it is also something that Dan Chudnov are approaching from the perspective of libraries, that need citations to be not only durable and persistent but also to be able to redirect into the holdings of a particular library. It all sounds kind of esoteric, and in a lot of ways it is, but these questions of digital archiving and persistent linking will matter more and more, because we are all creating the library of the future.

Dan Chudnov: Hi Jon, how are you?

  Why don't we kind of start there, from the perspective of the various library-related linking and resource-identification standards that you have been involved in. So if you could sort of back up and state the problem first, in a way that everyone could relate to, that would probably a good place to start.
  You want to be able to follow the references, essentially, and see which giant's shoulders they have been standing on, and verify that these people are building on the right combination of work when they are proposing a new idea or a new solution to a problem.
  
  Part of that is simply verifying that their statement of problem is accurate and they know what they are talking about; and the more important probably, progressive aspect of that is being able to verify scientific results and claims. So you need to be able to follow people's references if you want to do science.
  Basically, what you want to be able to do if you are reading an article online, much like you would want to do if you have a physical copy of the paper, is when you are looking at the references you want to go and follow those references. You want to find which journals, which issues, which volumes, which pages these references people are citing, are relevant to your understanding of the problem you are trying to understand with the author's perspective.
  
  So, you literally need to go into the book stacks, the journal stacks, whatever, and find that volume, that issue, that page of that journal, so you can find the article, and this is exactly the problem as it translates to the Web that led to the development of the OpenURL standard seven or eight years ago. That problem is, you are using an online article, you are reading an article in an online journal, and at the bottom of the article there are a bunch of references.
  
  And the references aren't homogeneous in any way, in the sense that, well, they are probably formatted the same way visually, and there are obviously styles people use to keep that consistent--short form of the author's name, and a title of the article, and a short form of the journal title, and reference to volume, issue, year, page--but that is really hard to translate mechanically online for a couple of reasons. One is that metadata, first of all, is very sloppy. Author names are ambiguous, journal names can be ambiguous, journal names change every time journals split. Book titles are mistyped, misquoted, that sort of thing. There is a lot of slop in the system.
  If we go back six or seven years that wasn't necessarily so at all. This is all actually very recent, that you could assume that most recent things would be online. But they are all published by different publishers, and the different publishers maybe publish the journals themselves on their own sites, or they contract with third parties to publish them for them in their own systems, or there are gateway systems that essentially aggregate resources from a variety of publishers into whole new interfaces.
  
  And when you are working at a big library like Yale University's libraries, not only are you going to have an unholy combination of all these subscription agents and aggregators and source publishers themselves subscribed to in your library, but the resources themselves are going to overlap.
  
  So, when you follow a reference link, what you need to know is whether you have that journal at your library, and not just whether you have that journal, but which system you can use to get to it, and within that which of those systems actually allow you the specificity of naming a volume, an issue, a page, an author name, a unique identifier, and can get you right to that article itself.
  
  What you really want to be able to do is bring people right to the full text of the article with one click. Unfortunately, with all these reasons, it is just not always that easy.
  The identifier aspect of this is one big step in mitigating some of these issues, because if you have a known, unique identifier, you can recognize that as such and have a little more mechanical strength in trying to resolve it to a direct link. A lot of what you talked with Tony Hammond is in that space.

Dan: Yes. Other obvious kinds of identifiers are URLs in general, and PubNet identifiers if you are in a medical space, like I have been working in most recently. These are well-known, they are well understood, they are easily recognized, and they are usually fairly easily resolved.

  Now one of the jobs I have had in a previous life was I was wondering on the D-Space project at MIT Libraries about six years ago, and even just the "how to generate IDs and resolve identifiers" problem alone, on that project, was a source of endless debate, and not even... What we came up with isn't even necessarily something that pleases all parties all the time.
  With the D-Space project, one of our exclusive goals was to solve this, and in many ways to support open access publishing by making it easier for authors and institutions to publish their own words by, as much as anything else, not only taking a copy of the content and offering to manage that over time, but also simply offering a stable URL.
  
  If you talked to a number of authors at MIT five years ago, they would line up right away and tell you that, sure, they had been publishing their own articles, but their graduate assistant system administrator had moved on, and the system had been compromised, and now the links don't work, and they have been publishing papers and now no one could get to their articles anymore. "Yes, please take a copy of this for me and give me a stable URL that I can cite."
  So in this particular sense I think there is an importance to the role of needing some sort of distributed identification resolution stuff, because if all the libraries put their eggs in one basket, with one centralized service, and then anything happened to that one centralized service, that is a traditional single point of failure problem: we simply can't provide the service we need to our communities. So negotiating this spectrum of what you need to centralize, what you need to distribute, is an ongoing and fairly difficult process.
  And that is certainly the problem we can't ignore in libraries, and on that level we have hierarchies of authority and responsibility, from individual responsibility to organizational affiliation to networks of organizations; and in a sense this tapestry that is building out on the 'Net, this thing that is missing, is approximating the complexity of "meet space" that in some sense then maybe it is fully built. In another sense, maybe the tapestry doesn't work unless there are clear overlaps and interconnections between.
  So, in fact, what it looks like happening here is, it seems very similar to what Tony was describing as the query process, going into the DOI and handle stuff, where you want to get back a concise identifier, but you need to ask the system, "What would that be for these values?" Is that a fair statement?
  You are going to be able to go to an open-access copy of it. You are going to be able to go to an aggregated version of it in something like Lexus-Nexus, Academic Universe, that kind of resource, or a gateway subscription service like EBSCL. So you might get several copies of links that are all different, that are all going to take you to versions of the article; and below that you will get links to an interlibrary loan service if you aren't lucky about that.
  
  Now, another project I have been working on, and some other people in the community are trying to improve the step, is basically what you are doing is, not only are you presenting users who typically just want to get right to the full text with this confusing screen offering these options in the middle of which is the full text link, so just take me there darn it. You are also, you are moving users through three different systems at once.
  
  For every reference you want to follow, you need to come up out of the silo of content you started in, which might or might not be the same one you have to go back to a month later to get to the same article; step through the library's user interface; remember how to follow the links in there; and end up at the ultimate site, which might be, again, different from one month to the next depending on library subscriptions, which come and go and change over time as well. And that is really confusing.
  
  Every time one of my colleagues has reported to me, informally or more publicly, on usability testing results from OpenURL screens, the result is almost invariably, "I think I know from this that if I click the first thing I will get to the full text, and I remember seeing it, so I do that." But they don't read it, they don't remember the different options, because these screens are typically just not that good yet. They need a lot of improvement.
  
  So, a project I am working on now, I have been consulting on this effort called LibraryFind, a metasearch project coming from the Oregon State University Libraries, funded by the State of Oregon. It is a metasearch tool which allows you to search multiple resources at once. We have basically embedded an OpenURL resolver into the result stream, so that when you do a search on someone's name, like yours, or a topic or the name of a book, hopefully what you are going to get are not just links to OpenURL resolver screens--we control the whole system, right? We have got these front-end-to-back-end searches that are licensed resources; and we have got the OpenURL resolver; and we have the knowledge base of which resources were subscribed to.
  
  So, what we are trying to do is reduce it to two clicks. You send your search query to find what you want, and if we can find a link directly to full text we are just going to give you that link right away in the results. So one click to find, one click to get. And when we have done usability testing on this, it is amazing. There is no question what it does. People instinctively want to just click on the result and get the full text of the article when you are doing article searching.
  
  And when we are able to provide that, which is in many cases, people don't even recognize what is happening, because the number of systems they have moved through has been reduced and the number of screens we were confusing them with has gone down.
  
  Of course, this doesn't always work. If you are doing book searching, you are trying to find a book, you are not necessarily going to find an e-book copy. So then we have to move you through data from the online catalogue, and figure out if it is online, and figure out where it is on the shelf and whether it is checked out. These problems just don't go away; there is always a possibility for great confusion. But we are trying to find ways, both through this project at Oregon State and a lot of other initiatives going on, to shorten these steps.
  But on the other end of it, I want to cite this thing, and I want to cite it in a way that is canonical and durable. So, at that point, what does that part of the process look like? Where do I end up with the OpenURL that I want to point other people to?
  And essentially what needs to happen at that moment is we need to restore metadata, we need a restoring metadata system; and in some cases, certainly when you have things like a reference object that you are passing by reference with an identifier, you can do that fairly reliably. If you have some combination of fields like title and author and pages and publication date, it's a bit more of a crapshoot. But if at all possibly you do want to be able to restore that full record, and in addition to providing a service which would lead people to full text, you do want to provide a canonical citation form.
  We thought about what it would like if you could just track a community's OpenURL resolutions, and just have a stream that would show up, and a feed you could subscribe to, of all the references people are going to over time. And in a lot of ways this is what social bookmarking is: you are reading over people's shoulders, you are seeing what your friends are reading.
  
  What I think hasn't happened yet is for those two areas of functionality to really be bridged. I don't think there is an OpenURL resolver on the market that not only provides full text links and a citation form but also automatically lets you save a resulting citation into a community bookmarking service.
  
  But thinking about those issues led us to this other generic problem, which is, "Well what if I'm--well, I'm working at Yale, which is lucky for me, I'm lucky to be there and they have great resources; but if I'm writing a paper and want to publish it on the 'Net, and I want to blog about it, how do I put up OpenURL links that other people can follow? This is a huge problem.
  
  We have come up with a temporary solution to that. Again, it is another community initiative, and this one is called COinS.
  There are also blog plug-ins. I think there is one for Re-Press.
  The problem, then, goes back to, how do you give people this thing they have to install? I mean, what we are learning is if you make people install something, they won't do it unless it comes with... you know, if you get Microsoft Office Suite or Firefox. If it is not something that is right there and one click away, it is not something people are going to find on their own.
  And what we have done, I have worked with them to produce a simple database of Greasemonkey scripts and Bookmarklets for all the institutions in that registry, so that librarians can go and get one of these things and distribute it on their campus.
  If you're a blogger, you're going to want to rely on one of these things like the WordPress plug-in or the Structure Blogging plug-in for removable type or Drupal.
  You might have varying amounts of detail encoded in the queries, so some resolvers could go all the way down the page, other could only get you down to the article. So it seems you're set up to generate lots of different variance of that query, on the other hand that query is an URL that you want to function as an attention attractor, then you have to canonicalize that somehow. What is the canonicalized version of it?
  So, when you think about an object of some platonic form, the surface area of the expressions of that object is pretty diverse. And the more popular something is, the more thorny shapes and little drop outs you're going to have over that topology. One of which is metadata representations of it that is essentially surrogate the in hand copy which you don't have when you're looking at it.
  
  Sort of by definition, there is loss between representing an object through metadata and the object itself. Particularly when it's hard to say what the object itself really is when we have PDF copies and HTML copies, AAC files and MP3 files and so on, and transcripts of the following.
  There are a lot of norms to be developed. On the one hand, what you're describing seems to be likely, that a gradient will begin to develop between things that have a more of a permanent nature and things that don't. Systems will be refined and improved to support the things that have more permanence. But on the other hand there's this great phrase that showed up on a Book Arts Press annual mailing a few years ago. Mess is Lore.
  But this is in some ways simply just what happens over time. To a large extent you really expect and hope that the best stuff gets kept reliably, but on the other hand we know that it doesn't.
  I guess I would just like to hope that there's a business model for someone in enabling somebody like me who looks at blogging as really quite a professional activity. For me to be able to make a similar investment, it's my intellectual life and I take it very seriously and I constantly refer to things in the past that I have written and just the other day I had to try to go back to something in the <a href="http://byte.com/" >byte.com</a> archive from ten years ago and already that's problematic.
  Certainly for academics the role of the institution in many ways is to promote and somewhat protect the reputation of a researcher or a professor by simply standing behind them and employing them on one hand if they are tenured and whatnot. But also archives exist to take on the papers of important scholars and members of the public sphere and authors, artists and organizations, governments, big companies.
  
  It is the ongoing imperative mission of archives and other libraries to do this work. So, until there is an organization that chooses as it's mission to preserve my bits as an individual, I can't assume that Amazon will or the Internet Archive will because I haven't made that arrangement with them.
  If I have any influence at Microsoft, which I probably don't, at product development that would be one thing that I would love to nudge it's online properties in the direction of doing because I think it's something that a lot of people will ultimately want.
  Or if you enter into an arrangement like that what do you end up doing when you recognize that literally publishing and republishing don't necessarily have to be disconnected? Should there be a permanent <a href="http://wordpress.com/" >wordpress.com</a> or long term <a href="http://typepad.net/" >typepad.net</a> or whatever the popular - <a href="http://blogger.com/" >blogger.com</a>, should there be a long-term <a href="http://blogger.com/" >blogger.com</a> where it's not free, you pay. You pay maybe a lot more than you normally would and the presumption is the extra payment you make is in exchange for a long-term agreement to keep your work alive.
  This was like a lot of fun.