Interoperability in sharing species data

The class is currently reading “The Green Internet”, a chapter in Conservation in the Internet Age, edited by James N. Levitt (2002, Island Press). It discusses the problem posed by the plethora of biodiversity data collected by museums and others that remains isolated in separate institutions:

After 300 years of species inventory, the biodiversity science community lacked the means–an information architecture and a set of common practices–for the discovery, retrieval, and integration of data. From one collection to the next–often within the same institution–underlying specimen data are heterogeneous and incompatible. The data are recorded and stored in thousands of idiosyncratic, independently developed information systems and are dispersed worldwide across academia, government agencies, conservation organizations, research institutions, and private museums (p. 146).

The solution is an architecture called Species Analyst that creates a standard for storing and sharing information, as well as an interface and tools for analyzing data. Species Analyst was developed by a consortium of biodiversity researchers and computer scientists at the University of Kansas’s Biodiversity Research Center and the Natural History Museum.

A report from the Cover Pages covers some of the technical details:

The Species Analyst relies heavily upon the fusion of the ANSI/NISO Z39.50 standard for information retrieval (ISO 23950) and XML. Z39.50 provides an excellent framework for distributed query and retrieval of information both within and across information domains. However, its use is restrictive because of the somewhat obscure nature of its implementation. All of the tools used by the Species Analyst transform Z39.50 result sets into an XML format that is convenient to process further, either for viewing or data extraction. This fusion of Z39.50 and XML brings standards based information retrieval to the desktop by extending the capabilities of existing tools that users are familiar with such as Microsoft’s Internet Explorer and Excel and ESRI’s ArcView.

The Inter-American Biodiversity Information Network (IABIN) has a slide show that demonstrates the structure and features of Species Analyst.

What’s interesting about the chapter is not its report on the technical challenges of broad system diffusion, which are considerable, but its discussion of the social barriers to interoperability. First, the article points out that “too many museums have not grasped the first principle of the information age–namely, that access to their authoritative biotic information for knowledge creation and decision making is as valuable as the information itself” (p. 155). What the authors do not acknowledge is that transforming data into a format compatible with the information age (e.g., using the Darwin code standards) takes a lot of time and resources. Who in academia and elsewhere has the time to adapt their datasets to a particular standard and what’s in it for them? This is not a cynical review of university practices but a pragmatic reflection on the paradigm in which academics operate. The focus is on doing what’s necessary to get published and therefore advance in one’s career. Fail to complywith the paradigm and you get fired/aren’t promoted. This paradigm fails to recognize the prosaic needs of academia to broadly diffuse its source data after the articles are published. Not recognizing the prosaic needs means not giving out grants to do it or acknowledging the effort when promotion time comes around.

Second, the authors indicate that many institutions have policies that discourage and even prohibit sharing of biodiversity data. The authors don’t mention that many of these policies protect the intellectual property of the individuals as well as the intellectual capital of the institutions. Institutions may be governed by liability concerns over potential misuse of the data or copyright laws over which they have no control. For example, Canada operates under Crown Copyright Law (e.g., all the benefits of government activity must financially benefit the Queen), which renders nearly impossible sharing of spatial data by government agencies.

Third, the authors report that the successful integration and publication of all of the species collections will convince decision makers in institutions and government that sufficient amounts of data already have been collected to analyze biodiversity. Therefore, no further funds are necessary. This is, of course, the irony of developing a system such as Species Analyst, which has as its raison d’etre the idea that if only we could integrate all the species information out there, we could conduct phenomenal analyses of the world’s biodiversity. Why collect any more data or why not wait until the analyses are done before we collect more data? Promoting the system for broad diffusion inevitably undercut the need for further basic data collection. This speaks to the low regard in which basic research is held, on both the left (“Who needs basic research on an insignificant species such as snail darters when there’s so much poverty in the world?”) and the right (“Basic research on an insignificant species such as snail darters impeded economic development, which is more important to the well-being of individuals”). It also speaks to the myth of technology that it can automatically create knowledge out of data.

Last, the authors mention that lots of data is still not associated with technology nor with geography. For example, what do you do about the legions of archival data that exists in museums? Who’s paid to adapt collections data that can stretch back to the 1800’s? Also, most data doesn’t have locational data (location is a prime method used to integrate data in Species Analyst) or has vague spatial data (e.g., a species may be found along a river reach instead of at a specific point). I’ve discovered instances in which the geographic data collected by biologists is irrelevant to their studies. The lat/long point at which data is collected really represents an entire region (even though the actual point has been GPSd) or represents an ideal landscape in which species are modeled. Datasets which contain abundant temporal and species diversity may be represented by one data point.

I don’t want to detract from the research achievement of Species Analyst. Many people propose architectures to increase interoperability for biodiversity data but few engage in the technical difficulties of actual implementation. Still, interoperability can be limited more by social hurdles than by technical obstacles.

  1. Hannah says:

    I wonder if they’ll eventually have compulsory datasharing, so instead of volunteering one’s computer over to
    Species Analyst, you’ll be required to and in return you can use the internet. You know how they are slapping on
    extra taxes for technology to recover damage to the environment, this could also be a nonmonetary cost of using the technology.

  2. Liam says:

    It strikes me this would be somewhere a semi-intelligent agent, either in software, hardware, or some combination thereof to help convert and input data from one medium or format to another would be very useful. It seems that no one is willing to finance the leg work that would be required to manually transform the data. Financing a tool that anyone could run to transform and submit local datato a central database seems like it would be an effective means of overcoming some of the difficulties.

    It would some logical that when trying to convert large amounts of data into a new format, convenient and automated tools to convert the old formats to the new would be among the first things released. Even for the physical cataloguing, it seems like it’s a very repetitive, fairly basic task which would be an excellent candidate for automation.

    Of course, I’ll some huge fraction of the worlds electronic data is stored in Excel spreadsheets rather than in an actual database, so I suppose I am getting the cart before the horse.

  3. Bruce Miller says:

    Species analyst is certainly interesting but the ephemeral nature of web sites precluded linking to this as the site was often down and not available to search key museum data. When the sytem was first posted by KU it was great, but now either the links are dead or the functionality no longer works.
    Short lived but useful concept.