Press "Enter" to skip to content

UUIDs may be Dangerous

There is no doubt that Globally Unique Identifiers (GUIDs) are important and not just because I have been hammering on about them for the last few years. It is hard to imagine a world where biodiversity data can flow from one application to another without some kind of tagging of that data to show its provenance. The question we can’t get beyond is how to tag the data.

TDWG recommends the use of LSIDs -which are pretty restricted to the life science community (the clue is in the name). Publishers see books, journals and papers as “assets” and therefore legitimately use DOIs. Neither of these technologies play nicely with Semantic Web technologies al la W3C and so some people prefer PURLs or even Plain Old URL.

All these types of GUID are resolvable. Each can act as a form of address (by a variety of mechanisms). An implication of this is that each must be associated with some form of issuing/resolving authority. If you or your PC comes across one of these things it can look it up and get an authoritative answer as to what data it represents and who “owns” or curates that data.

Running some form of authority implies having access to some technical resources such as a web server and possibly DNS server. In the case of DOIs it implies paying some one else to run the infrastructure for you. This is not easy for many data suppliers.

UUIDs to the rescue! UUIDs are great. Just get the computer to make up a string of random digits that is complex enough to be guaranteed globally unique. Anyone should be able to tag anything with a GUID. This solves the multiple routes problem. If a consumer receives two pieces of data that bear the same UUID then they can assume that the two pieces of data are actually the same thing and one can be ignored.

But what happens if the two pieces of data tagged with the same UUID differ? Perhaps one has an extra field or two. How do we know which is correct? Perhaps one was originally created as the result of a search and the other was harvested as an RSS feed and the two mechanisms give different versions or views of the object identified by the UUID. Or perhaps one has been changed by some intermediary. How does the consumer of the information resolve the situation?

One way would be to approach an indexing service that has recorded the occurrence of all the UUIDs. Google’s pretty good at this kind of thing but it would not be possible to differentiate on Google between pages that mention the UUID and those that contain the authoritative data associated with the UUID. To do this we would need a real register that maps UUID to data source. At this point we are close to recreating the Handle system that drives DOIs and the advantages of UUIDs are slipping away. You can’t just use UUIDs you must use registered UUIDs!

It gets worse. Suppose I have created the preferred description of a taxon. I tag it with a UUID to uniquely identify it. Other people refer to it using the UUID. All is well in the world until an awkward person disagrees with me. Perhaps I haven’t added new data to my description or corrected errors. So the awkward person puts up their own description of the taxon and tags it with the same UUID – they redefine the meaning of that UUID by cutting and pasting it to a new authoritative description. They want to say “this is the correct description”. Nobody owns the UUID so there is no arbiter. If their is a register of UUIDs then they could perhaps sort it out but they couldn’t stop other people from setting up their own registers that favour different interpretations of particular UUIDs. This situation is very similar to the taxon name problem we have. No one owns a name so who has the definitive description of the taxon that name should be used for – nobody!

So UUIDs are dangerous. They appear to give a simple way to uniquely tag things but it is likely humans will mess this up. The fact that machines will not generate the same UUID twice is irrelivant if humans can cut and paste them with impunity.

5 Comments

  1. Markus Döring Markus Döring

    Good thoughts, Roger. UUIDs are usefull for integrating your own disparate datasets without worrying about ID clashes, but it needs ownership for GUIDs to be useful when being shared. So back to PURLs or LSIDs – maybe paired with UUIDs…

  2. Kevin Richards Kevin Richards

    Yes, that was my instinctive response to the suggestion to use UUIDs at TDWG. They are easy and may help, but they have distinctive disadvantages.

    From a recent discussion with Tim, and having just been to the sem web conference (http://iswc2008.semanticweb.org) , I am also rather unsure about LSIDs in their pure form, in that they are not at all semantic web friendly – ie they cannot be resolved using default HTTP resolution. The idea of using the http proxy version of the LSIDs is a good way to get around this, but this does have some drawbacks:
    – 1st you really need everyone to agree to use it everywhere, which is a bit difficult considering it is not at all part of the LSID standard, and we struggle to get “everyone” to do anything
    – 2nd, it seems very much like a hack – you might as well just use permanent http urls – ie the main advantage of LSIDs in this case is the “encouraging a degree of thought before making URIs publically available”. But we don’t really need to pick up the whole LSID overhead just to achieve this.

    So it seems to me like good old Plain Old URLs are just great! : -)
    Or at least the suggestion of REST styled, permanent HTTP URLs as GUIDs ??

  3. Dmitry Mozzherin Dmitry Mozzherin

    There are 2 different and separate things — identifiers and locators.

    UUID is an identifier, and a good one. It is not a locator at all. The issues you mention appear when people try to use identifier without a proper locator mechanism. In case of UUIDs people would need to agree what locator to use.

    The great thing about UUIDs — you do not mix two separate tasks together, they are ONLY identifiers, and nothing else. To be useful they have to be resolved by a locator, whatever it will be at the moment or circumstance LSID, PURL, a local database API… However when locator gets outdated it is easy to migrate to the next and great one.

    UUID5 also creates an opportunity to have a distributed ‘minting’ service for uuids where a textual information needs a 1 to 1 unique identifier (which works well for example for name strings).

  4. Hi Dmitry,

    Thanks for you comment.

    Not sure what you mean by ‘identify’ in this context. If I say 550e8400-e29b-41d4-a716-446655441234 stands for spicy pizza and you say it is chocolate gateaux then nobody else in the world can know what that particular UUID stands for. It stands for at least two things and possibly more. They don’t know whether to order it for first course of second course.

    For me identity only makes sense within a particular context – with a particular issuing authority. An identity card I create myself doesn’t count for much but one issued by the state does. On the internet we have a perfectly good hierarchically managed system of authority for issuing identifiers called the DNS system. It is the singularity of the authority system that makes the identifiers valuable. As soon as you say there are multiple ways to ‘get’ to an object then the identifier no longer identifiers a single thing. It’s just namespaces really.

    Anyhow this is making me hungry!

    Roger

  5. Dmitry Mozzherin Dmitry Mozzherin

    Hi Roger,

    I am with you on the notion that without proper resolver/locator using UUIDs is dangerous (as any other identifier).

    Identifier to me is a unique ‘handle’ to one ‘thing’. Global identifier is a globally unique handle to one thing. Your example describes a usage of a global identifier, when two agents by ignorance of malevolence assign one handle to 2 different things. Such identifier stops to be a global identifier by definition. It still works fine as a local one. Situations like these can be resolved as you say by the appropriate resolver/locator.

    What makes one thing to be ‘one thing’ is another story and is probably included in the wisdom of the food resolver.

    UUIDs in my understanding allow someone to mint an identifier for their chocolate cake and then submit it to a locator, being pretty sure that there is no such UUID assigned to anything. If someone picks a created UUID for spicy pizza and submits it attached to a chocolate cake to a global food resolver, they won’t succeed. If they make their own UUID for spicy pizza, they can be quite sure their registration will come through without consulting food resolver beforehand.

    UUID5s allow creation of exactly the same UUID in the same name space for the same text string by anyone using any popular modern programming language as long as they know the algorithm. It opens interesting opportunities and makes pizza/cake situation impossible.

    For my projects I am nervous to use locators as identifiers. I remember time when there were no urls. There is no guaranty, that urls ability to resolve will persist in the future. Numbers on the other hand are used throughout whole human existence and don’t carry an overhead/ambiguity/inconsistency/semantic load of urls. A UUID is nothing but a 128bit number so I like to use it for identifying things. I also think it is silly to stop there and not use adequate resolvers of some kind, as you point in the post.

    Saying all that, if something still escapes my understanding, sorry for my limitations, I’d like to try to wrap my head around it.

    Dima

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.