A Position on LSIDs

Clover

I recently took part in a very long discussion on LSIDs on the TDWG-TAG mailing list. This seems to have been a perpetual discussion over the past four years. On reflection I realised that over two posts I had produced a kind of personal position paper on LSIDs and that it would be worth capturing the text in a blog post so it didn’t disappear into the mailing list archives. People often ask about LSIDs and it would be useful to have somewhere to point them to. Note that this text is off a technical discussion list and not newbie friendly. It assumes you know about LSIDs as a technology.

One issue that repeatedly comes up with LSIDs is that they may be more permanent than URIs. They offer a sociological advantage in that they are separate from ephemeral HTTP URLs that are used for everything on the web. The act of minting an LSID indicates that you intend to try to make it permanent or at least never re-use it for another resource.

The barrier to everyone hosting LSIDs is that they don’t all have access to DNS servers and can’t host the relevant SRV records. There are other barriers to do with binding LSIDs to particular institutional domains that may change. A solution to this may be to have a central service that hosts DNS records and it is implied that this would help with persistence but just hosting SRV records or supplying a redirect service does not actually provide any persistence at all to the data/metadata. Persistence of a GUID to 500 error rather than a not found is not helpful.

I have said in the past “If persistence is important to you then keep your own copy.” This is how it has worked for 100s of years in the library community. If the reason for having a centralised resolution mechanism is to try and support persistence then the centralised service should actually cache metadata (not data). I would imagine a scalable infrastructure would be quite simple to implement. Data could be stored in a Lucene index or Hadoop cluster or something. It would only be a very large hash table and only keep the latest version of the RDF.

Without some kind of persistence mechanism the only advantage of LSIDs is that they look like they are supposed to be persistent. Unfortunately, because many people are using UUIDs as their object identifiers LSIDs actually look like something you wouldn’t want to look at let alone expose to a user! CoL actually hide them because they look like this:

urn:lsid:catalogueoflife.org:taxon:d755ba3e-29c1-102b-9a4a-00304854f820:ac2009

No normal person is going to read this or type it in. I am afraid that when people started using UUIDs in LSIDs it blew the sociological argument for LSIDs out of the water for me. I had carefully designed BCI identifiers to be human readable and writable like this:

urn:lsid:biocol.org:col:15670

Which would work as a foot note in a paper but the only way a UUID can work in that context is if it is hyperlinked and to be hyperlinked it will have to be an HTTP URL underneath which begs the question of why we are displaying a non human readable string as the human readable part of a hyperlink! So we hide the LSID completely and have no sociological advantage.

I understand why people used UUIDs. There are good technical reasons especially in distributed systems.

If LSIDs are a brand then they need a “unique selling proposition” and that implies something behind them beyond what can be had for free from other brands. You must use LSIDs because…. “We recommend them” is not an adequate answer.

Another point that worries me is that all discussion of LSIDs is about how to publish them not how to consume them. LSIDs are better than HTTP URIs for the client because… (I still can’t answer this question)

Currently the reason for me tagging my data with GUIDs has to be because it enables users to access and exploit my data in cost effective ways they couldn’t before whilst crediting me with producing it so that I can attract funding to my organisation to curate and collect more data.

The reason for clients using GUIDs is that it enables them to mix and match data in ways they couldn’t before so as to produce more, higher quality scientific publications and so attract funding and kudo.

These are the selling points for GUIDs. How well do LSIDs enable them?

If LSIDs are to succeed for the biodiversity community they need a service with long term support from large organisations and projects.

DOIs have a business model. LSIDs currently do not. Without a business model (read funding) we should stick to something that doesn’t have the implementation/adoption impediment of LSIDs and make the best of it (i.e. just have a usage policy for HTTP URIs).

The up coming e-Biosphere conference (June) is billed as an opportunity for the heads of the bigger projects to get together and decide what will happen for the next 10 years. If we are going to be using LSIDs in the future those heads need to agree to fund a DOI-like infrastructure for LSIDs or come out and say they are not prepared to do it.

TDWG can act as a forum for these projects/organisations to coordinate their actions but doesn’t have its own resources.

I believe this is largely a political problem not a technical one. It needs to be resolved quickly.

If it is decided to develop a central service we have to have a service that adds real value (of an order of magnitude that crossref adds to DOIs) to LSID usage. Without this we are better off sticking with today’s standard web technologies.

To use a fine English idiom “you cut your cloth to suit your purse”. Currently we don’t have a shared purse so arguments about how to cut our cloth will never be resolved. X is undefined and I am not sure Y is that well defined. This is “chicken and egg” in that we need to come up with requirements to request a purse  but we really need to have some indication (from a political point of view) that some one will be willing to commit long term resources to the common good before we can present a menu of choices for what money could be spent on.

1.0 developers/year (in perpetuity) gets us a server or two managed to support some kind of DNS based SRV hosting or redirect services or Handle system with support for some library development and help desk stuff.  (Note I am not talking servers or meetings or reports or technology and I am talking commitment to pay people to have it as their responsibility to maintain the system both socially and technically – for the long term!!!).

Without the indication that some one (a consortium perhaps) is likely to formally commit to a minimum of this level of resources we are wasting our time talking about resolution mechanisms that are not DNS based i.e. ‘just’ variations on the PURL model.

Even if we (as the technical community) have an offer of funding we have to be very careful that it is the best way forward. Will it actually provide something useful or will it just add another level of complexity with few if any benefits?

If we do not have the money to build a walled garden we have to graze on the common with everyone else.

If we do have the money to build the garden we have to look at the benefits of doing so not just waste our time arguing about the features of the garden.

Addendum

A very interesting post came to my attention (courtesy of Rod Page) about CrossRef going down. This is an interesting read when you consider the whole of DNS never goes down – although the odd domain may disappear for a brief while. After reading this post you may be of the opinion that using a central resolver is probably a bad idea and I would tend to agree with you.

3 thoughts on “A Position on LSIDs

  1. Chuck Koscher

    The post regarding CrossRef going down conflates two issues, 1) CrossRef’s OpenURL resolver and 2) the DOI resolver.

    Many applications are using CrossRef’s resolver to translate from metadata to the link target (e.g. get the DOI). Alternatively they are using CrossRef’s OpenURL resolver to convert a DOI back into metadata for local processing. This activity is not the same as using the DOI to resolve to targets via the DOI resolver at dx.doi.org.

    An apples-to-apples comparison to DNS must be done to the DOI resolvers not CrossRef’s OpenURL resolver. Whats the difference? Well, CrossRef’s OpenURL resolver is a singular instance while the DOI dx.doi.org resolver is not. Yes dx.doi.org is not as ubiquitous as DNS but its architecture and implementation is not one of just a central resolver

  2. Roger Hyam

    @Chuck Koscher
    Thanks for you contribution but I am afraid I don’t follow. dx.doi.org is a domain name. Doing a “dig any” I get the output pasted in below with a single A record pointing to a single IP address. i.e. a name resolving to a physical location (that isn’t a physical location really but just another name that resolves to a MAC address but that’s splitting hairs). So a DOI is a name for which most people use a single resolver which uses a DNS name that resolves to an IP address which resolves to a MAC address of a networked machine (or virtual machine or router). All seems like a singular instance to me.

    It actually has to be a singular instance to be authoritative. There has to be somewhere I go that I know I can trust to give me the normative information associated with the identifier – even if that place just forwards me to a list of alternative places that are trusted. Names have to be associated with locations or they are strings like any other.

    When you get married in church in the UK they pre-announce the marriage for a few Sundays and the use a phrase like “Betty Smith of this parish will marry Gordon Brown of the parish of St John over the hill”. By giving the parish they are disambiguate the name and allow people to resolve it back to a place where they can find authoritative information about that person. The name and the resolution mechanism are inextricably linked i.e. the URL/URN debate is really a false one. This has been going on for hundreds of years. It isn’t an IT issue but a social one and I guess it is covered in my other blog post GUIDs http://www.hyam.net/blog/archives/346

    ; < <>> DiG 9.4.3-P1 < <>> any dx.doi.org
    ;; global options: printcmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6697
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 4

    ;; QUESTION SECTION:
    ;dx.doi.org. IN ANY

    ;; ANSWER SECTION:
    dx.doi.org. 242 IN A 38.100.138.149

    ;; AUTHORITY SECTION:
    dx.doi.org. 242 IN NS proxy2ns.doi.org.
    dx.doi.org. 242 IN NS proxy1ns.doi.org.
    dx.doi.org. 242 IN NS crossrefns1.doi.org.
    dx.doi.org. 242 IN NS proxy3ns.doi.org.

    ;; ADDITIONAL SECTION:
    proxy2ns.doi.org. 1403 IN A 38.100.138.162
    proxy1ns.doi.org. 1403 IN A 38.100.138.161
    crossrefns1.doi.org. 1403 IN A 208.254.38.90
    proxy3ns.doi.org. 1403 IN A 38.100.138.163

  3. Pingback: KIM-TWR » Blog Archive » Persistent identifiers – an overview

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>