Mar 162013
 

I have been on core staff at the Royal Botanic Garden Edinburgh for a year now and, as part of my role there, have established a WordPress blog for the institution. This will act as a combined blogging platform for everyone associated with the organisation as well as a more general tool for gathering information about points of interest within the gardens. The site is called Botanics Stories. If you are interested in biodiversity or horticulture I urge you to check it out and add its feed to your RSS reader.

From now on any biodiversity related blogging I do will be at Botanics Stories. You can follow my latest blogs there or add the feed to your reader.

This hyam.net/blog will now focus more on my personal stuff – mainly MindfulnessPhotography and occasionally “truly pathetic verbiage” .

Nov 262012
 

Phytotaxa 73: 17–30 (2012)

Well maybe it won’t rock science to its foundations but it is nice to have our paper finally published after delays with the proofs.

Hyam, R.D., Drinkwater, R.E. & Harris, D.J. Stable citations for herbarium specimens on the internet: an illustration from a taxonomic revision of Duboscia (Malvaceae) Phytotaxa 73: 17–30 (2012).

A taxonomic revision of Duboscia (Malvaceae) with two species, D. macrocarpa and D. viridiflora, is presented and used to demonstrate a mechanism for linking from revisions to specimens held in herbaria using HTTP URIs. The implementation of this mechanism at the Royal Botanic Garden Edinburgh (E) is used as an example. Advantages of this approach include near universal support amongst web-connected devices. Hindrances to widespread adoption of such an approach are also discussed.

It is open access so you can download the full PDF and read it on the loo if you like.

Sep 142012
 

I am busy writing a summary of our information resources at RBGE and as part of this I am asking people to list what the high level ‘uses’ of their databases are. I am just after a power point or two worth of information. Most of these databases are catalogues of physical objects – they are collections management systems varying from spreadsheets to complex multi-user systems. While I am waiting for the results I found myself thinking up a generic list of functionality for a collections management system.  This is my list:

  • Preserve- Key function is to make sure the collection persists. If it doesn’t persist it will not be available for future use.
    • Track loans in and out of the system so we don’t lose stuff when people borrow it.
    • Control destructive sampling so when we do lose stuff it is done with purpose.
    • Control deaccessioning of material so we only throw out the dead weight – not the good stuff.
    • Manage prophylactic preservation treatments such as cleaning and insect control.
    • Manage restoration tasks. When damaged objects are discovered arrange for them to be stabilized/fixed.
  • Publish- Key function is to make the contents of the collection available for research and enjoyment.
    • Search/Query interface to discover content.
      • Index of what the collection contains – who/what/when/where
      • Track physical location of the object within the collection so it can be retrieved.
    • Share with aggregators of collections information.
    • Control access – copyright/moratoriums/sensitive info.
    • Generate Sales leads – potentially raise funds by selling reproduction rights.
    • Profile/Brand/Marketing – wider awareness to support public funding of collection.
    • Provide electronic citation mechanism e.g. persistent URLs and DOIs to support researchers and publishers.
  • Enrich- Key function is to increase the value of the collections.
    • Accept donations/purchase of new materials
    • Extract/augment information not inherent in the objects e.g. geocoding of collecting location
    • Collect annotations to object by scholars and others.
    • Link to related external sources and encourage linking back to individual objects.

This is just a brain dump and may be a little too long – four power point slides total. What do you think?

Jul 202012
 

I have been seeing DataCite.org mentioned quite a lot so I thought I’d take a look at what they were up to. They have an OAI-PMH provider so you can simply go to the List Records page and see how many records they have. Try it for yourself now. At the bottom of the page today it says that there are 654,748 records in their repository.

Then I read on the DataCite.org home page (my emphasis):

We think it is very important that the two largest DOI Registration Agencies work together in order to provide metadata services to DOI names.

This seemed an amazing claim as I had it in my head that CrossRef had 50+ million references. So if the second biggest only had just over half a million that implies there is only really one registration agency. I wonder who claims to be numbers three, four and five in the DOI Registration listings?

So I tweeted:

DataCite 654,748 records – I think this inflated. CrossRef have 50 million. CrossRef is about 100x bigger. Who is the 3rd largest?

And @epentz replied:

@rogerhyam #datacite has registered about 1.3 million DOIs, #CrossRef 54.6 million

Where did those numbers come from? Now I was curious so I went back to the DataCite.org OAI-PMH provider but this time I used a script to have a look. This is the deposition history by month for the entire registry:

The vast majority were created in December last year – a single provider I presume. Then there was another large batch in June this year.

Next I looked at the Sets in the registry. DataCite seem to create a set and then subsets for each organisation the create DOI’s for. Here is the pie chart showing the number of records per organisation:

The vast majority are TIB (German National Library of Science and Technology) and CDL (California Digital Library) who have contributed about 90%. There are 15 organizations in total 11 of those have contributed less than 2,500 records each.

I have no intention of knocking the project but from looking at the DataCite.org website and reading their promotional material (which talks of providing access to ‘data sets’) you do not get the impression that they are, in fact, mainly libraries providing citations to publications rather than data. The data citations I have seen do not look like the will give permanent access to data. Look at this example doi:10.5520/SAGECITE-1 which resolves to a website on a free Google service. How long is that going to last? If  you read the Google terms and conditions they give no warranty and may remove the service when they like. The site is called sagecitedemorepository. The clue is in the name. The data is hardly longer than the DOI that is used to cite it so why does it have a DOI? I’m confused. What value have DataCite.org added to this process other than indicate that something will persist that clearly is never intended to. Where is the quality control? What does it mean for a piece of data to have a DOI?

I will watch with interest to see how this develops and whether it makes the leap to linking to significant quantities of raw scientific data that is being properly curated.

Here is the code and data from my analysis – you can run it again as command line PHP scripts if you like: datacite

Your comments and corrections are always welcome!