Mendeley at JCDL 2014

by Patrick Hochstenback @hochstenbach
Image by Patrick Hochstenback @hochstenbach

The Mendeley Data Science team have been busy attending some important events around the world. One of them has been JCDL 2014, the most prominent conference in the Digital Libraries arena. The conference looks at many of the problems we’re tackling at the moment, such as article recommendations and the best ways of automatically extracting information from research articles.

Maya Hristakeva, Senior Data Scientist at Mendeley, was particularly excited about the various approaches to topic modelling that were discussed at the event. “Topics were used as features for a diverse range of tasks, such as prediction of an author’s future citation counts, making personalised recommendations, search, author disambiguation, and creating more relevant citation networks, all features that make a direct impact to the research workflow on Mendeley.”

“We saw some really thought-provoking output come out of the JCDL14 proceedings such as Characterizing Scholar Popularity : A Case Study in the Computer Science Research Community. In JCDL’14” explains Kris Jack, Chief Data Scientist at Mendeley. “Some of the interesting research questions raised included one by Gonçalves, G. D., Figueiredo, F., Almeida, J. M., & Gonçalves, M. A. (2014) which asked whether it is possible to represent the popularity of a researcher using the number of readers that they have.”

It was also nice to see evidence in some of the papers presented that Mendeley readership is highly correlated with various measures of academic impact, such as h-index and publication venue importance,” says Mendeley Senior Data Scientist Phil Gooch.

Overall, this was a really valuable opportunity to connect with researchers who are working on similar problems to Mendeley, such as metadata extraction, recommendations, and citation/author/venue disambiguation, so we’re thinking about the idea of perhaps running an open challenge to focus this research into concrete output that could be of use in features for our users. If you have any ideas around that, do get in touch on Twitter with @_krisjack @mayahhf and @Phil_Gooch

Note: At Mendeley, we believe in dogfooding (it’s not as disgusting at it sounds, merely techy slang for using your own product to validate the qualities of that product…) so Maya, Kris and Phil took notes using Mendeley Desktop 🙂

 

Discussing the Future of Recommender Systems at RecSys2014


Maya and Kris from the Mendeley Data Science team have just returned from RecSys2014, the most important conference in the Recommender System world. RecSys is remarkable in that it attracts an equal number of participants from industry and academia, many of whom are at the forefront of innovation in their fields.

The team had a chance to exchange perspectives and experiences with various researchers, scholars and practitioners.

“To me, it was encouraging to see how top companies across the world are investing in recommenders, as they are shown to enhance customer satisfaction and bring real value to both users and companies,” says Mendeley Senior Data Scientist Maya Hristakeva. “LinkedIn reported that 50% of the connections made in their social network come from their follower recommender, while Netflix says that if they can stop 1% of users from cancelling their subscription then that’s worth $500M a year, which of course justifies the fact they are investing $150M/year in their content recommendation team, consisting of 300 people.”

But one of the advantages of such a hybrid event is that it did not shy away from addressing the broader issues, such as how to ward against creating a “filter bubble” effect, how to preserve user’s privacy, and optimising systems for what really matters (and how this can be effectively defined). Daniel Tunkelang, LinkedIn, and Xavier Amatriain, Netflix, moderated a panel on “Controversial Questions About Personalization“, tackling some of these topics head on. Hector Garcia Molina from Stanford University also put forward the view that we’ll increasingly see a convergence of recommendations, search and advertising, despite noticeable scepticism from the attendees.

Kris Jack, Chief Data Scientist at Mendeley, says one of the main messages that he took away from the conference was the importance of winning a user’s trust in the early stages of using a recommender system.

“The best systems have been shown to start off by providing recommendations that can quickly be evaluated by users as being useful before gradually introducing more novel recommendations. So in the case of helping researchers to find relevant articles to read, it’s probably best to start by recommending well known but important articles in their field, before recommending some less well known but very pertinent articles to their specific problem domain.” explains Kris. “Other important factors include reranking (the order in which recommendations should be shown), the UI design that can best support interaction with the recommender system, and the ways in which we can build context-aware recommendations.”

What do you think of the current recommendation features on Mendeley? Are there any particular ones that you’d like to see implemented? Would you like to join the team and work on making them even better? Let us know in the comments below, or Tweet the team directly @_krisjack @mayahhf and @Phil_Gooch .If you’re interested in finding out more about what the Data Science Team is developing in that arena, you can also watch their Mendeley Open Day presentation here.

 

 

Finding Better Ways of Mining Scientific Publications

TDM Workshop

Mendeley is supporting the 3rd edition of the International Workshop on Mining Scientific Publications, which will take place on the 12th September 2014 in London. The event will bring together researchers and practitioners from across industry, government, digital libraries and academia to address the latest challenges in the field of mining data from scientific publications.

Kris Jack, Chief Data Scientist at Mendeley, is part of the organizing Committee, which also includes The Open University and The European Library. Following a very successful call for papers, he is now looking forward to a very busy and productive day of presentations and discussions:

“We’ve had a record number of high-quality submissions this year, so were really spoiled for choice in putting together the agenda, which combines long papers, short papers, demonstrations and various presentations. We also worked with Elsevier to engage directly with the research community, which is really fantastic.”

As part of that ongoing outreach, Gemma Hersh, Policy Director at Elsevier, will be giving a brief presentation and answering questions from the participants regarding the company’s recently updated Text and Data Mining policy, and how it can best support the evolving needs of the research community.

As in previous years, this workshop is run in conjunction with the Digital Libraries conference – DL 2014 – and participants can register on the City University London website to attend the entire conference or just the workshops/tutorials.

See the full programme below, and for the latest updates be sure to follow @WOSP2014  or send any questions to @_krisjack or @alicebonasio on Twitter

 

PROGRAM

09:00-09:10

Introduction

09:10-09:45

Keynote talk

Information Extraction and Data Mining for Scholarly Big Data

Dr. C. Lee Giles

09:45-10:10

Long paper

A Comparison of two Unsupervised Table Recognition Methods from Digital Scientific Articles

Stefan Klampfl, Kris Jack and Roman Kern

10:10-10:30

Short paper

A Keyquery-Based Classification System for CORE

Michael Völske, Tim Gollub, Matthias Hagen and Benno Stein

10:30-10:50

Short paper

Discovering and visualizing interdisciplinary content classes in scientific publications

Theodoros Giannakopoulos, Ioannis Foufoulas, Eleftherios Stamatogiannakis, Harry Dimitropoulos, Natalia Manola and Yannis Ioannidis

10:50-11:10

Break

11:10-11:35

Long paper

Efficient blocking method for a large scale citation matching

Mateusz Fedoryszak and Łukasz Bolikowski

11:35-12:00

Long paper

Extracting Textual Descriptions of Mathematical Expressions in Scientific Papers

Giovanni Yoko Kristianto, Goran Topic and Akiko Aizawa

12:00-12:20

Short paper

Towards a Marketplace for the Scientific Community: Accessing Knowledge from the Computer Science Domain

Mark Kröll, Stefan Klampfl and Roman Kern

12:20-12:40

Short paper

Experiments on Rating Conferences with CORE and DBLP

Irvan Jahja, Suhendry Effendy and Roland Yap

12:40-13:00

Short paper

A new semantic similarity based measure for assessing research contribution

Petr Knoth and Drahomira Herrmannova

13:00-13:10

Presentation

Elsevier’s Text and Data Mining Policy

Gemma Hersh

13:10-14:00

Lunch

14:00-14:35

Keynote talk

Developing benchmark datasets of scholarly documents and investigating the use of anchor text physics retrieval

Birger Larsen

14:35-14:50

Demo paper

AMI-diagram: Mining Facts from Images

Peter Murray-Rust, Richard Smith-Unna and Ross Mounce

14:50-15:05

Demo paper

Annota: Towards Enriching Scientific Publications with Semantics and User Annotations

Michal Holub, Róbert Móro, Jakub Ševcech, Martin Lipták and Maria Bielikova

15:05-15:20

Demo paper

The ContentMine scraping stack: literature-scale content mining with community maintained collections of declarative scrapers

Richard Smith-Unna and Peter Murray-Rust

15:20-15:35

Break

15:35-16:00

Long paper

GROTOAP2 – The methodology of creating a large ground truth dataset of scientific articles

Dominika Tkaczyk, Pawel Szostek and Lukasz Bolikowski

16:00-16:25

Long paper

The Architecture and Datasets of Docear’s Research Paper Recommender System

Joeran Beel, Stefan Langer, Bela Gipp, and Andreas Nürnberger

16:25-16:50

Long paper

Social, Political and Legal Aspects of Text and Data Mining

Michelle Brook, Peter Murray-Rust and Charles Oppenheim

16:50-17:00

Closing

Submit your paper for Mining Scientific Publications Workshop!

Data Mining Workshop

The 3rd International Workshop on Mining Scientific Publications will take place from the 8th to the 12th September in London, and is a cross-disciplinary workshop for researchers, industry practitioners, digital library developers, and open access enthusiasts. Kris Jack, Chief Data Scientist here at Mendeley is co-organizing the event along with CORE, the Open UniversityAthena Research and Innovation Center, and the European Library/Europeana .

The aim is to bring together people from different backgrounds to explore the possibilities around data mining tools, and how they can be used to save researcher’s time by finding and processing huge amounts of information quickly and easily.

We’re asking for submissions before the 13th July 2014 from those interested in analysing and mining databases of scientific publications, developing systems to enable such analysis, or designing new technologies to improve research and the free availability of research data. Researchers should submit their papers online, for inclusion in the programme. Both long papers (up to eight pages in the ACM style) and short papers (not exceeding four pages) are welcome, as are practical demonstrations and presentation of systems and methods (demonstration submissions should consist of a two-page description of the system, method or tool).

“We’re looking to attract researchers from across academia and industry to work through the amazing possibilities and challenges around mining scientific content. The collaborations that come from these initiatives always yield really interesting results, so I’m looking forward to see what submissions we get through this year” says Kris

The workshop will be structured around three main themes:

  1. The whole ecosystem of infrastructures, including repositories, aggregators, text-and data-mining facilities, impact monitoring tools, datasets, services and APIs that enable analysis of large volumes of scientific publications.
  2. Semantic enrichment of scientific publications by means of text-mining, crowdsourcing or other methods.
  3. Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence etc.

This year, we also put together a CORE publications dataset containing a large array of publications from various research areas. This includes full-text as well as enriched versions of metadata, with the aim of providing workshop participants with a framework for developing and testing methods and tools around the workshop topics. You can access this data through the CORE portal.

If you have any questions or comments, leave them below or tweet @WOSP2014

Mendeley at ACM Recommender Systems 2013

 

RecSys1

By Mark Levy, Senior Data Scientist at Mendeley

Last week I had the pleasure of travelling to Hong Kong to give two workshop presentations at the ACM Recommender Systems conference.  The art and science of recommender systems have come some way since the first time that “users who like X also like Y” appeared on an e-commerce site on the internet, and this year’s conference attracted several hundred delegates from both industry and academia.  Despite its close association with customer satisfaction and the commercial bottom line, as a research topic Recommender Systems occupies a tiny and somewhat recherché niche within the computer science discipline of Machine Learning, which centres on the idea that if you present a computer program with enough examples of past events, it will be able to come up with a formula to make predictions about similar events in the future.  For a recommender system these events record the interaction of a user with an item, for example Alice watched Shaun of the Dead, or Kris read Thinking Fast And Slow, and the program’s predictions consist of suggested new books that Alice or Kris might like, or of other movies similar to Shaun of the Dead, and so on.  In our products these scenarios correspond to Mendeley Suggest, currently available only if you subscribe to a Pro, Plus or Max plan, and to the Related Research feature which we recently rolled out to all users in Mendeley Desktop.

One challenge for anyone trying to build a recommender system is that it’s hard to tell whether or not your predictions are going to be accurate, at least until you start making them and can see how often your users actually accept your suggestions.  As there is a huge space of possible methods to choose from – far too many to test every possibility on unsuspecting users – ideally we’d like to be able to figure how well each prediction formula (technically each mathematical model) matches reality before we get to that stage.  If and how that might be possible was a recurring theme of this year’s conference, and the subject of my first talk in Hong Kong.

Surprisingly for a field that has now seen several years of quite intense research interest and hundreds of peer-reviewed publications, most practitioners remain highly sceptical of the results reported even in their own research.  This made it particularly interesting to hear conference presentations from large tech companies such as Google, Microsoft, LinkedIn, Ebay, not to mention Chinese counterparts such as Douban, TenCent and AliBaba, which were new names to me but who also operate at colossal scale.  These organisations have both the scientific expertise to develop cutting edge methods and the opportunity to test the results on significant numbers of real users.  You might be surprised to learn quite how much sophisticated research has gone into recommending which game to play next on your XBox.

At Mendeley we use a great deal of wonderful open source software, and so we’re very happy that the work we did in the Data Science team for my other presentation at the conference also gave us a chance to give something back to the developer community in the form of mrec, a library written in the very popular Python programming library and intended to make it easier to do reproducible research on recommender systems, even if you’ll still need to test your new algorithm on real people to convince most of us that it actually works.