Beyond ‘Download Science’: Or How to Not Drown in Data at the AGU

Topic-specific conferences are no longer just focused on research; Data sharing initiatives are now a major part of the discussions happening in research. With Mendeley Data and other Data Management initiatives, we are always looking to learn more, directly from the researchers. Anita DeWaard, Vice President of Data Research and Collaboration for Elsevier, shares how she learned data repositories often struggle with similar issues — and how collaboration can help address those issues.

Beyond ‘Download Science’: Or How to Not Drown in Data at the AGU

I attended my first AGU meeting in New Orleans last fall, with the intention to learn more about informatics, metadata and research data in the Earth and Planetary sciences. For a newbie, this meeting is an intimidating affair where over 25,000 scientists gather to discuss topics ranging from plate tectonics to probabilistic flood mapping, and from solar prominences to paleoclimatology.

Informatics and metadata played a huge role in the program sessions. The Earth and Space Science Informatics Section alone, for example, amounted a staggering 1,200 talks and posters on that topic alone. And that by no means comprises the full extent of sessions about informatics and metadata: for instance, the Hydrology Section has not one but two sessions (with 10 – 20 papers each) on  ‘Advances in Data Integration, Inverse Methods, and Data Valuation across a Range of Scales in Hydrogeophysics’, and the Public Affairs Section hosts ‘Care and Feeding of Physical Samples and Their Metadata’.

It is easy to feel overwhelmed. Yet once I stopped focusing on watching the endless streams of people moving up and down escalators to more and more rooms full of posters and talks (and once I finally retrieved my Flat White after the seemingly endless fleece-clad line at the Starbucks!) I learned that if you just jump in the stream and go with the flow, the AGU is really just a great ride.

I was involved in three events: a session about data discovery, one on Unique Identifiers, and the Data Rescue award, which Elsevier helped organize, together with IEDA, the Interdisciplinary Earth and Data Alliance (http://www.iedadata.org/).

Data Repository issues; Or, how to come up with a means of survival

In the Data discovery session, we had 8 papers pertaining to searching for earth science data. Siri Jodha Khalsa and myself are co-chairing a nascent group as part of the Research Data Alliance on this same topic, very is quite relevant to us in developing our Datasearch platform. It struck me how very comfortable and aware of various aspects of data retrieval the earth science community seems to be, compared to repositories in other domains, who are just starting to talk about this.

The data repositories that presented were struggling with similar issues; how to scale the masses of content that need to be uploaded, how to build tools that provide optimal relevancy ranking over heterogeneous and often distributed data collections, keep track of usage, provide useful recommendations, and offer personalisation services when most search engines do not ask for login details, all with a barebones staff an an organisation that is more often than not asked to come up with means for its own survival.

The end of download science

At the Poster session that evening, it was exciting to see the multitude of work being done pertaining to data discoverability. One of the most interesting concepts for me was in a poster by Viktor Pankratius from MIT, who developed a ‘computer-aided discovery system’ for detecting patterns, generating hypotheses, and even driving further data collection from a set of tools running in the cloud.

Pankratius predicted the ‘end of download science’: whereas in the past, (earth) scientists did most of their data-intensive work by downloading datasets from various locations, writing tools to parse, analyze and combine them, and publish (only) their outcomes, Pankratius and many others are developing analysis tools that are native to the cloud, and are shared and made available together with the datasets for reuse.

Persistent Identifiers

On Thursday, I spoke at a session entitled:“Persistent Identification, Publication, and Trustworthy Management of Research Resources”: two separate but related topics. The first three talks focused on trustworthiness, Persistent identifiers are a seemingly boring topic, that, however, just got their own very groovy conference, PIDaPalooza (leave it to Geoff Bilder to groovify even the nerdiest of topics!).

One of the papers in that session (https://agu.confex.com/agu/fm16/meetingapp.cgi/Paper/173684) discussed a new RDA Initiative, Scholix, which uses DOIs for papers and datasets to enable a fully open linked data repository that connects researchers with their publications and published datasets. Scholix represents a very productive collaboration, spearheaded by the RDA Data Publishing Group, involving many parties including publishers (including Thomson Reuters, IEEE, Europe PMC and Elsevier), data centres (the Australian National Data Service, IEDA, ICPSR, CCDC, 3TU DataCenter, Pangaea and others) and aggregators and integrators (including CrossRef, DataCite and OpenAIRE).

Persistent identifiers combine with semantic technologies to enable a whole that is much more than the sum of its parts, that surely points the way forward in science publishing; it allows, for instance, Mendeley Data users to directly address and compare different versions of a dataset (for some other examples see my slides here).

Celebrating the restoration of lost datasets

A further highlight was the third International Data Rescue Award. This award, the third so far, is intended to reward and celebrate the usually thankless task of restoring datasets that would otherwise disappear or be unavailable. This reward brings together (and aims to support the creation) of a community of very diverse researchers, who all have a passion for restoring data.

This year’s winners, were from the University of Colorado in Boulder. Over a period of more than fifteen years, they rescued and made accessible the data at the Roger G. Barry Archive at the National Snow and Ice Data Center, which consists of a vast repository of materials, including over 20,000 prints, over 100,000 images on microfilm, 1,400 glass plates, 1,600 slides, over 100 cubic feet of manuscript material and over 8,000 ice charts. The material dates are incredibly diverse and date from 1850 to the present day and include, for instance, hand-written 19th century exploration diaries and observational data. Projects such as these tell us the incredible importance of data, especially in times like these, where so much is changing so quickly. Looking at the pictures in the Glacier Photograph Collection shows the incredible extent of glacial erosion between, for instance, 1941 and today, when an entire glacier has simply vanished and offers a grim reminder of the extent to which global warming is affecting our world.

In short, there is a lot out there for all of us to learn from going to the AGU. Earth science is abuzz with data sharing initiatives: there are exciting new frontiers to explore, important lessons to be learned, and invaluable data to be saved.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s