A Mendeley data mashup wins at Data In Sight hacker competition.

Two weekends ago, a group of developers and designers gathered at the Adobe offices in downtown San Francisco to work on data visualization projects taking open data sets and fusing them in creative ways to yield new insights. swissnex San Francisco and Creative Commons organized the event and datasets were provided by Infochimps and Factual and judges were brought in from some of the top design firms and startups in SF and Europe, such as Stamen, LUST, Color, and Square. About a hundred developers and designers showed up for the event, and 20 teams competed in the event. Given such strong competition and high standards, I was really thrilled when my team was chosen as the best data mashup! Here’s what we did…

About the team

I joined up with Giorgio Caviglia, a visiting researcher from DensityDesign Lab (Politecnico di Milano) and Pino Trogu a professor of Information Design from San Francisco State. I had some data on research trends from Mendeley that I was interested in doing a mashup with and Giorgio was also interested in readership data, having worked on the fascinating “Mapping the Republic of Letters” project. Since LinkedIn was also promoting their API at the event we decided it might be interesting to look at Mendeley readership vs. LinkedIn social graph size.

What we did

We first took the data I had on popular papers on Mendeley and used the API to assemble a list of authors from the top 500 life science papers with their associated readership from the top papers they appeared in. We then looked for those authors on LinkedIn. The first hurdle we ran into was that while Mendeley’s data is open and the API makes it easy to get statistics and trends on papers and authors, the LinkedIn API wasn’t designed for this sort of use and blocked us while we were still testing our author lookup script. Unable to get the data we wanted this way, we would have been stuck but for the awesome coding skills of Giorgio. He built a script to scrape the data from public LinkedIn profile pages. The disadvantage of this method is that we couldn’t get all the data we needed to make sure we had the right Li or Smith, but this being a weekend coding challenge we just took that on the chin and incorporated it into our results. We limited our name matches to matches within life science-related industry categories, which narrowed down the number of matching names to 2-3 in most cases. What we found was actually quite fascinating. We noticed that industry researchers tended to be better represented than academic ones, and that the vast majority of the activity and relationships of the academic world are hidden from LinkedIn. Even big names in academia like Eric Lander and David Altschuler don’t appear on LinkedIn in the categories we looked. In fact, while the most popular author in our data sample was Richard Gibbs, head of an academic sequencing group at Baylor College of Medicine, he had no connections on LinkedIn, but the most widely connected on LinkedIn was Michael Egholm, the former VP of R&D at 454, a next-gen sequencing company, and he had far fewer readers. One notable exception to the trend, that I can’t believe we missed at the time, was Felisa Wolfe-Simon, recently internet-infamous for claiming to huge publicity to have found evidence for artificial life based on work that turned out to be rather shoddy.

Based on this information, we decided that the best presentation would be to show a visualization that revealed this academia/industry split. Inspired by Pino, we developed a new type of plot we called an “iceberg plot” which shows two dimensions of data on a standard bar plot, mirrored at the point of relationship between the two sets.

In our case, the point of relationship was the name, and the resulting graph reveals strikingly how few published authors have any presence on LinkedIn. What you can see via LinkedIn really is just the tip of the iceberg. You can see the prototype we developed here and see the slides from the brief presentation here.

So what’s next?

While there’s certainly much that still needs to be done to refine the work and expand the study, we think the data hints at a possible opportunity for academics to expand their circles by establishing a presence on LinkedIn, but it also points to an opportunity for LinkedIn to examine how they can make their offering more interesting and useful for academics. Perhaps now that profile information is available through the Mendeley API, this information could be used to enrich LinkedIn profile pages, making them more useful and appealing to academics. Giorgio and Pino and I think this would be an interesting direction to take this initial work and are open to suggestions on how it could be improved.

The other teams did great work and deserve recognition as well, so check out some of the other winning entries at the Data In Sight blog. swissnex San Francisco also did a fantastic job organizing and running the event. You can see other blog posts about the event, as well as pictures and presentations here.