Above is an image from a talk that I gave earlier this year. As you can see, if I lived decades ago, I could somewhat keep up with all new research that pertained to me. Today though? Forget about it. There is just way too much going on. Even if I consider myself to be in a niche research field, I should still be keeping up with cross-disciplinary material that is relevant to my research. There is just no way to keep up with all of that information. It is information overload.
Ask yourself how you find out what is relevant to you in your research field. Got it? OK, we’ll get back to that, but before we do, ask yourself what percentage of all relevant information are you actually consuming? Let’s look at that figure above in the form of a pie chart to help us answer that question. As you can see below, the blue wedge represents the percent knowledge you are actually obtaining within the total body of potential relevant knowledge. We’ll call this active knowledge versus potential knowledge.
Alright, getting back to that first question, how do you find out what is relevant to you? In other words, where does that blue wedge of active knowledge come from? In no particular order, your list probably looks somewhat like this:
-informal chatting with colleagues
-daily/weekly literature review
Even if you happen to be the most voracious consumer of knowledge, that would still leave you far short of consuming that entire pie of relevant knowledge. So, this raises the question of how one can possibly ever consume that entire pie? To answer that, let’s start off by asking ourselves how Google might go about it and seeing how far that gets us.
If you aren’t already familiar with Google PageRank then here’s a detailed link. Just to summarize though, Google would enlarge our blue wedge of relevant knowledge by trying to find links between information. These links would be in the form of citations if we were talking about academic literature.
Google would tell us that the most relevant bit of information is the one with the most links or citations. Is that true though? The right answer is of course, No. Links and citations are not necessarily correlated with what is most relevant to ME. It’s more of a popularity estimate. What we are missing is the long-tail of relevance.
Hidden somewhere within that long-tail is what actually matters to me and what I need to know. So, Google, Google Scholar, or similar search engines for citation links such as Scopus or Web of Science enlarge that slice of knowledge, but only part of the way. We need something else.
What if you could somehow poll every single person in your research field (and even other fields) about what they are reading? You would then discover the most similar researchers (and reading lists) by those that mostly overlap your own reading list. If there is anything on those other similar reading lists not already on your own, then those items would be recommended.
What was just described is called collaborative filtering and if you’ve ever used Amazon to find books then you are already familiar with it. Amazon recommends new books based on what other items have been purchased in addition to the one you are currently looking at.
Such systems for academic literature are, in fact, starting to appear. So, how far does that get us in consuming the entire pie? It definitely gets us further, but we are still not there and this is for at least two major reasons.
First, to date, collaborative filtering for academic literature has been somewhat limited to a few fields of research. Second, it is based on the assumption that what is relevant to Person A is going to be relevant to Person B. For something such as general books in Amazon, this is quite good enough. However, for academic research, this can be quite limiting or provide too many irrelevant results mixed with relevant results.
OK, so here’s what our pie of knowledge now looks like:
What we haven’t yet discussed is text-based recommendations. In the biological sciences, at least, these are also popping up everywhere. If you have ever used PubMed, then you may have noticed the short list of recommendations on the right side of the screen when viewing an abstract of an article. These are based on a combination of keywords or Mesh terms and even some human curation.
The theory here is that articles with similar keywords or key phrases will be articles that might interest you as well. And this is usually true. Another example is the new use of ontologies to provide recommendations. Ontologies are a way to classify terms and create structure where none seemed to exist.
What PubMed and similar sites give then are the most similar research articles. This is fantastic! Now we are getting somewhere, but we are still not there yet. Remember, we want to dive into that long-tail of relevant knowledge. And if the long-tail represents knowledge from any possible discipline, then niche sites are not finding all relevant bits.
To get a better idea of what exactly might be in that long-tail, here’s another diagram. It looks very similar to an evolutionary tree and in fact, that’s what it is, but for academic disciplines. Every discipline today is the descendant of another discipline. This makes every discipline related to each other and because of that, there are similar basic fundamentals that they each share. This is a key point.
The theory is this. If we can take every discipline and every research paper within those disciplines and break it down to their most basic elements, we can then compare those elements to each other and build them back up to provide recommendations.
At first glance, this theory may seem like a bit of a stretch. For example, what does computer science (CS) have to do with my niche field of the study of wombats in Tasmania? (Note: Despite sounding really cool and fun to play with, I personally never studied wombats)
If you had asked computer scientists 30 years ago if they would be working on the human genome you would have gotten some blank stares. Today, we call such a union bioinformatics and it is possible only because of applying CS principles to biology (also note that biology principles are helping to shape CS today as well).
Getting back to those wombats then, if I am studying migration patterns and there’s a new computer algorithm that, in principle, would help me model that migration then I should know about it. That might be near impossible or take years to discover though if I stick to the usual sources of information (those blue slices in our pie).
Those leaps of knowledge and cross-pollinating influences ARE possible though if we break down each piece of knowledge into its basic constituents. These are the most basic entities and principles within each discipline, research paper, paragraph, sentence and word.
Here’s one last figure to capture what this entails and it makes use of a similar technique called comparative genomics to discover important features in the human genome. In comparative genomics, the DNA of several different species is aligned. In other words, if the DNA has the same sequence then it’s a signal that this is a relevant area to look at more in depth.
In this figure, instead of DNA, we have textual elements (principles, entities, etc) that are aligned from different subject areas and research papers. In this way then, we are getting at that long-tail of relevant information. We have vastly increased the size of the blue wedge in our pie.
We are almost there. We almost have the entire pie of relevant knowledge filtered and at our disposal. What’s missing is time. Both a lag in time of information retrieval and lacking a history of knowledge over time prevents us from discovering what is relevant-to-us-right-now. Let’s briefly explore this.
One of the greatest problems in using citations to judge academic importance and relevance is that it takes time to accumulate. Once an important AND relevant research paper comes out, it will take anywhere from six months to several years for other researchers to cite it in their own publications.
Now, to the experienced researcher, the relevancy of a paper might be immediately obvious and there is no need to wait for citations to accumulate. What isn’t immediately available though is what the community of researchers thinks. This is an important bit of information, because it will determine how intensely this new knowledge will be explored. And that is a relevant bit of information that is important to me.
What we need is a system that allows us to immediately see when people are reading a new research paper. We could then aggregate that information amongst all people and study it. We wouldn’t have to wait 6-36 months for that citation data to accumulate.
That takes care of the lag in time problem, but I also mentioned a lack of historical data. If we take a snapshot of citation data in a moment of time, which is what we normally do, then we are missing out on valuable trending data. If we could see how interest in a research topic builds up and drops back down over time, then we could, in theory, use that information in several ways.
If, for example, we see that interest in artificial ribosomes is picking up, then we can start to dig deeper and look at which other topics led to artificial ribosomes, who started the trend, and possibly even where it might lead to next. If I see that my research topic has remained stagnant for some time, then again, I can dig deeper to find out why. This could even help determine where my research dollars should be spent in the future. And that is very relevant information to me.
With time added to our recommendation engine, we now have consumed our entire pie of relevant knowledge. We are now reaching the long-tail of information and put it into context of what is relevant to us right now, not yesterday.
The last two problems to gaining all relevant knowledge, time and the basic elements of information, are what we are trying to solve right now. It is certainly an ambitious challenge, but we are well on our way. Every time you add a new research paper to Mendeley, we are one step closer. It is also why we don’t just focus on one major discipline, such as the biological sciences. If the basic elements of every discipline are needed to provide us with the best possible recommendations, then we have to include everything.
Going back then to the very first figure that we looked at, what we have concluded is the following: Yes, there is a lot of information out there. What concerns us though is the part that is actually relevant to us. And even that portion is impossible to obtain through traditional means, so we must use several different methods to derive time-relevant recommendations and find the long-tail. This includes citation tracking, collaborative filtering, text analysis, cross-discipline comparative analysis, and real-time trending. No one has the perfect recommendation engine yet, but we are getting close.