With the Mendeley for Life Scientists webinar coming up on Thursday, I thought I would take a look at the readership stats for Biological Sciences. Biological Sciences has long been our biggest discipline, and having done my doctoral work in the Life Sciences, I knew this would be interesting. Overall, researchers in bioinformatics contributed most strongly to the most read papers, along with the older disciplines of micro- and molecular biology. Regardless of discipline, however, it’s clear that the days of toiling away in isolation to thoroughly study one gene are over. Today, it’s all about huge consortia and massive data. Here’s what I found:
What I found was a fascinating list of topics, with expected papers like Venter’s synthetic paper and lots of next-gen sequencing reports, but also an interesting view on how these techniques, especially as they apply to metagenomics, are starting to support related fields like conservation biology.
The top graph summarizes the overall results of the analysis. This graph shows the Top 10 papers among those who have listed biological science as their discipline and have chosen a subdiscipline. The bars are colored according to subdiscipline and the number of readers is shown on the x-axis. The bar graphs for each individual paper show the distribution of readership levels among subdisciplines. 24 of the 34 biological sciences subdisciplines are represented and the axis scales and color schemes remain constant throughout. Data analysis was done in R and graphs were prepared using ggplot2. (NB: A minority of biological scientists have listed a subdiscipline. I would encourage everyone to do so, so you’ll show up in the stats.)
1. RNA-Seq: a revolutionary tool for transcriptomics.
Given the volume of published research featuring some form of microarray analysis in the past decade, it’s no surprise that this technique, which presents a major advance over the existing technology, would be as widely read as it is. After much initial enthusiasm, it became clear that microarrays had serious problems holding the technique back from wider application – sequences have to exist to be probed for, the sample prep required to get enough material for detection skewed the relative proportions of the individual sequences, and background noise prevented the detection of rare species. RNAseq avoids these problems and makes it possible to sequence with greater diversity and breadth. This particular paper comes from Michael Snyder, whose lab was the first to do the sort of large-scale functional genomics studies for which this technique is well-suited.
2a. Next-generation DNA sequencing.
If Snyder’s paper on RNAseq is popular among Mendeley readers, you’d also expect this review of the various types of next-generation sequencing to be very popular as well. This paper comes from Jay Shendure’s lab at UW. Shendure worked with George Church, a hero of the personal genomics and DIY Biology movements, on the technology underlying the open sequencing platform called the Polonator and in this paper he discusses the current start of the art in sequencing technology.
2b. Creation of a bacterial cell controlled by a chemically synthesized genome.
Tied for second place, this paper reports the synthesis of a bacterial genome entirely from scratch. Starting from nothing but a digital copy of the sequence and bottle of reagents on the shelf, they made the entire sequence of the DNA which, when transplanted into a cell, gave rise to a entirely different bacterial strain, entirely controlled by the originally- synthesized DNA. This work was done in the lab of Craig Venter, who you may remember was the guy who took on the National Institutes of Health in a race to be the first to sequence the human genome. All you need to know about this guy can be learned from the unique watermark he added to the genome of the synthetic cell – he actually spelled out his own name, among other things, in the DNA of the organism. And in case you’re not entirely clear about the significance of the work reported here, consider that reaching this result required an engineering feat of approximately same magnitude of as putting a man in space.
3. 2011: the immune hallmarks of cancer
This paper is an answer and an update to the classic 2000 paper by Weinberg & Hanahan that laid the blueprint for pretty much all the anti-cancer strategies over the past decade. In “Immune Hallmarks”, Cavallo et al. discuss a strategy for attacking cancer via common immunological characteristics.
4. Sequencing technologies – the next generation.
Here’s yet another next-gen sequencing paper, a testament to how transformative this technology has been to the field of life science. This is a review of practical considerations for setup and application of the various types of next-gen sequencing systems available. The author of this review, Dr. Metzker, was one of the contributors to the human genome sequencing effort and has developed a novel type of next-gen sequencing which should further drive down the cost and time required for sequencing projects.
5. Why Most Published Research Findings Are False: Author’s Reply to Goodman and Greenland
This isn’t actually a research paper at all. It’s a paper about some of the pitfalls encountered when doing scientific research and helps illustrate how surprisingly hard it is avoid bias and make a true statement. Its wide popularity across disciplines is a testament to the advice and guidance given therein. This paper, which is actually a follow-up response to criticism of his 2005 paper that started this debate, explains part of the reason for the schizophrenic “X CURES CANCER!!” / “um, sorry, X actually doesn’t cure cancer” reporting that’s so common in mainstream health reporting. Specifically, this research shows how scientific studies which seem to give a positive result for a certain effect, such as the effect of Vitamin E and heart disease, are often actually not conclusive when looked at on a larger scale. In most cases, this isn’t due to any deliberate attempt of the researchers to deceive others about the validity of their work, but rather an effect that arises because researchers try to find the most broad applicability for their results, leading them to tend to overstate their case. As larger studies are done & more details come to light, the exceptions arise and new details that the smaller study simply didn’t pick up change the story. Of course, even studies that are later proven not-entirely-true add to our store of knowledge. In fact, one useful way of looking at it is that nothing is ever proven true. Rather, the alternative theories are shown to be less and less likely, and that happens through the processes that Ioannidis describes.
6. Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness
Another common theme among the top papers is biodiversity, with 3 of the top 10 papers dealing with this subject. This paper is one of those and perhaps one explanation for its popularity is that it’s sort of like the Ioannidis paper for this niche. This paper describes sampling and estimation techniques for describing, for example, how many species of insects there are in the rain forest. Since it’s impossible to know when you’ve found every species, sampling techniques are used to assess how well a conservation strategy is working, or how much of an impact a environmental change is having on an ecosystem. In other words, the techniques described by Gotelli et al. help you make true statements, not about human health interventions, but about ecological ones.
7. A map of human genome variation from population-scale sequencing.
The 1000 Genomes Project reported in this paper is an ambitious attempt to take the data acquired in the Human Genome Project and figure out what it means. By comparing many sequenced genomes to one another, differences can be found in the genes that explain the differences seen in the outward physical appearance of the individual. By making this genotype-phenotype link, it’s hoped that the clues to disease origins and treatments can be found. This paper is popular in part due to the sheer size and ambition of the undertaking. For example, they expect to capture all the genetic variation that’s present in more than 1% of the population. An interesting method has been developed here as well, where they sequenced a great number of people but with much lower resolution, and then took advantage of the fact that most people are similar enough that they could compare across genomes to correct the errors. I’m not actually certain how far they pushed this technique, and to what extent they weren’t able to do the error correction, but it’s an interesting approach. I would expect that as the cost of sequencing drops further, this technique will lose its utility, but I also expect that they’ll be proven mostly right in their sequence estimations.
8. Mapping and quantifying mammalian transcriptomes by RNA-Seq.
If it’s good enough for people, it’s good enough for the scientist’s favorite research mammal, the mouse. This paper is one of the older ones in the sample and the first to really show the power of the RNA-seq approach. Using this technique they uncovered nearly 600 new genes, despite 90% of the sequences they obtained already being known. The cool thing about this paper is that not only did they publish the paper, but they also made the data and the code used for the analysis available for download and re-use. This is a great example of how this sort of study should be reported. Not just published in a paper, but made available as a living website.
9. Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009
The journal Nucleic Acids Research publishes an annual issue focusing on various databases or web services available to researchers. This paper is an interesting anomaly, because it essentially serves just as a placeholder, something to hang citation information on because the current system of scholarly communication requires citing of volume and page number rather than web address or database fingerprint. Here’s the list of databases in this issue.
10. Conservation: Biodiversity as a bonus prize.
In this paper, Myers et al. argue that conservation efforts could be directed towards areas of the world which have the highest levels of biodiversity, in order to maximize the too limited resources applied to conservation projects. They identify 25 hotspots which contain the only habitat for 44% of the world’s plants and 35% of the world’s vertebrates, despite only making up about 1.4% of the earth’s land surface. This work helped focus worldwide conservation funding to the most needed areas around the globe. With this paper, Myers made the word hotspot part of the global conservation vocabulary and has greatly increased the amount of funding made available for conservation projects in these areas. You can keep up to date on how the project is going at the Conservation International site.
Again, the Mendeley data shows itself to be a good reflection of overall trends in the research community. Compared to the top papers in computer science, these papers are read by a more broad sample of of subdisciplines, highlighting how DNA serves as common thread uniting all of Biology. The new discipline pages on in the Mendeley research catalog, like this one, show a useful snapshot of research as well as related groups and researchers. This highlights the importance of having your profile as complete as possible. We can’t include you on one of these pages unless you specify a discipline and subdiscipline on your profile. A good profile pic also increases your chances or appearing here. If you’d like to know more, sign up for the Mendeley for Life Scientists Webinar on Thursday the 7th. It’s a free 1 hour tutorial on how to get the most out of Mendeley, conducted by yours truly.
The data also suggest interesting directions for new research. For example, the techniques described in the Gotelli et al. paper and the newer, cheaper sequencing techniques could really help put some hard numbers to the estimates used in the biodiversity hotspot study.
To do this analysis, I got a sample of widely-read papers from a range of subdisciplines and calculated the aggregate readership of each, so this shows a static picture of the data, but a similar analysis could be done in a dynamic fashion using the Mendeley API. Carl Boettiger has written a wrapper for the Mendeley API in R. The analysis and plots for this were done using R and ggplot2, and code is available upon request. As I did the analysis interactively, I don’t have a full script, but I’m happy to share bits of code. The raw data can be obtained here.
If you do something fun with this, please let me know, I’d love to do a follow up.