Thursday, October 2, 2014

Data Driven Discovery

As I mentioned yesterday, one of the reasons I started this blog is because I thought it might help kickstart a new project that I'm very excited about. This new project is ultimately going to require some community involvement and buy-in if it is to be successful, and a blog post seems like
one good way to kick things off. So here, as promised, are more details.

As you may or may not have heard, the Gordon and Betty Moore Foundation today announced the selection of 14 new investigators in Data Driven Discovery (DDD), and I was fortunate enough to be one these. I'm really excited about this initiative, and particularly about its potential to help the sharing of information and ideas across disciplines. To give you an idea of the variety of disciplines represented, and the potential for new links to be made, I had met only one of the 13 other selectees in person before the selection process started - and yet I believe that I could have useful and interesting interactions with all of them.

My proposal had a couple of aims, but the one I want to talk about today is about ``Dynamic Statistical Comparisons". This aim came out of my belief that the way that we (the community) currently compare statistical methods is horribly ineffective and inefficient. And, I think we can do better by making comparisons more ``dynamic" - by which I mean easily update-able by adding new methods or new datasets. Here's an extract from my grant proposal:

Many new [statistical] methods are published without software implementation, and without comparisons with existing methods. Even when comparisons are made, they are usually performed by the research group that developed one of the methods, which almost inevitably favors that method. Furthermore, performing these kinds of comparisons is incredibly time-consuming, requiring careful familiarization with software implementing the methods, and the creation of pipelines and scripts for running and comparing them. And in fast-moving fields new methods or software updates appear so frequently that comparisons are out of date before they even appear. In summary, the current system results in a large amount of wasted effort, with multiple groups performing redundant and sub-optimal comparisons.

And here's a bit more of a personal story I wanted to tell, but didn't have space.

In 2009 I was working with my then-postdoc Yongtao Guan on a Bayesian approach to variable selection in large-scale regression (published here in the Annals of Applied Statistics). It seemed natural to compare our method with the regularized regression method LASSO, which also
attacks this problem. Surprisingly, even though both Bayesian sparse regression and LASSO have been around 15-20 years,  we could find not one empirical comparison of these approaches! (In the meantime some comparisons, including ours, have been published.) When we applied LASSO to our problem we were surprised how poorly it performed in our hands. Since LASSO has had a lot written about it, with few if any mentions of poor performance, we were concerned that we might be mis-applying it, or at least not making the best use of it. After all, this was the first time we had actually used LASSO in practice ourselves. So we talked to a few people with more experience than us, who gave us a variety of suggestions. However, none of these seemed to help much, and some made things considerably worse. Since we never intended our research focus to be ``how to get LASSO to work on this problem", we published our best attempt at a fair comparison.

But I remained concerned that our comparisons may be unfair to LASSO, or at least could be done better. So I tried a different strategy, with a Masters student (Weiling Tang). We took the paper on the "Elastic Net" (EN), by Zou and Hastie, who are experts in regularized regression, and who compared EN with LASSO under several simple simulation scenarios. After brief correspondence with the authors we were able to largely reproduce their results (boosting our confidence that we were using the methods reasonably). Then we added our Bayesian method to the comparisons. We found it generally outperformed both EN and LASSO on these simulations.  However, on reflection I now had another concern: the EN paper used 2-fold Cross-Validation (CV) to select tuning parameters for both LASSO and EN, which being the same for both methods, could be regarded as "fair", even if 2-fold CV performed less well than, say, 10-fold CV.  However, our Bayesian method does not use CV. With a different CV strategy, might EN or LASSO outperform our method? So I'm left wondering what the best CV strategy is. I feel like I can't do the comparison properly without knowing that. But at the same time that almost seems like a research question in itself, lying outside of the main thrust of my own research program.

The Solution

OK, so the obvious solution is to work together with other people, right? For example, we could have contacted Zou and Hastie and invited them to work with us on this and perhaps publish a joint paper? But although I'm certainly open to this idea, I think we can do even better. We should start by acknowledging that these types of comparisons are better done as a dynamic exercise, with the aim of steadily improving methods - not as static exercise to determine a single best method once and for all.
Indeed, I really believe these types of benchmarks and comparisons are more important for what they can help us learn about the methods - what does and does not work well in various situations - than for declaring some kind of  "winner". And I also believe this gets lost in most published comparisons, partly of course because you have to show that your method is the best to get it published....

 So here's the proposal (again, excerpted from my grant text):

To address this we will create public Internet repositories that compare methods with one another in a reproducible and easily-extensible way. The repositories will provide “push-button” reproducibility of all comparisons: running a single script will run all the methods in the repository on all the data sets, and produce tables and graphs comparing performances. It will be simple to add code for a new method, or a new data set, and re-run the comparison script. These repositories will help establish which methods perform well on which data, and allow users to easily investigate how software settings or types of input data affect performance. Having code for each method available will substantially reduce the bar for others to build on and improve methods. The repositories will be under version control (git), so that the full history will be available, progress can be tracked, and citations can reference a specific version. A wiki-style forum will allow discussion or comments on results and comparisons - for example, allowing users to note why some comparisons favor certain types of methods. We envisage the repository as complementing, rather than replacing, the standard publication model: papers introducing a new method will “deposit” the method in the comparison repository, in the same way that scientific data are deposited in data repositories. 

[At this point I should probably acknowledge that there are a whole bunch of examples of this kind of community-based comparisons out there, some of which I did mention elsewhere in my grant proposal, and which serve as inspiration: CASP, the Netflix prize, DREAM, GAW, Sage Bionetworks, just to start a list.]

So will it work? Obviously there are some non-trivial barriers to making this work in complete generality. In some cases there will be cross-platform compatibility issues, and data privacy issues, not to mention the problem that coming up with good benchmark criteria can be really hard for many problems. But I think that getting some simple examples up and running should really be doable quite quickly - even by a single research group working alone if we can be relieved of the burden of being sure that we have done the best possible job for every method. At that point anyone can join in to refine their favorite method, or add a method or data set to the repository.

Although it is probably already clear that I'm super-enthusiastic about this idea, I can't help but point out some of the reasons I'm enthusiastic.

First, I think it will actually make comparisons not only better, but also easier and less work! This is because everyone will only have to be responsible for getting their own method to work, and not all the others.

Second, this provides what I'm going to call a carrot for reproducibility. Usually, making your work reproducible is a lot of work, and perhaps the gain is not immediately obvious to everyone. Here, by adding your method to the repository in a reproducible way you get to exploit and take advantage of all the infrastructure that other people have helped build that runs all the other methods reproducibly.
So everyone reaps the benefits of everyone's careful work to make things reproducible!

Finally, I am hopeful that this approach might help promote collaboration through competition. It may seem like collaboration and competition do not naturally go together... but I believe that once people have competed methods against one another, they may naturally move towards collaborating. (Effectively this is what happened with the Netflix competition.) And the availability of code etc will of course facilitate this whole process.

There's a lot more to say, but this post is already a bit long, and I think it is time to post it.
I do of course welcome, and indeed actively encourage, comments and suggestions.




7 comments:

  1. Matthew,

    First of all, congrats on the award, that's great! I wonder how you plan to perform such comparisons in research areas where there is a lack of ground truth? One could argue that this is the case for most biological problems. This is especially true for novel biological problems based on new technology/data, i.e. where we need new methods. To some extend this applies to your variable selection problem too.

    The problem with method comparisons in publications (especially statistical ones) is that one often has to use (unrealistic) simulations, or hand selected datasets where the ground truth is questionable.

    Most of the competitions mentioned above face this problem to some extend. This is particularly relevant for the field of flow cytometry, see for example:
    Aghaeepour, N. et al. Critical assessment of automated flow cytometry data analysis techniques. Nat. Methods 10, 228–238 (2013).

    Do you have a solution to this problem? I'd love to hear your thoughts.

    Cheers,

    Raphael

    ReplyDelete
    Replies
    1. Thanks Raph,

      I guess the short answer is that I don't have a solution!
      Also, I should clarify that I don't intend to do all the comparisons myself! I just propose to develop a framework for doing comparisons in a more automated and
      collaborative, and dynamic way, and put together an example or two. Then I hope
      other people will follow the example, and start a new set of comparisons for
      a problem that they care about - and it will be up to them to solve the problem
      of coming up with good benchmarks.

      That said, I agree that finding good benchmarks is often hard. My hope is that centralizing the benchmarks might help come to some consensus on which ones are best. (Maybe we could have a voting system to allow users to vote the best benchmarks to the top? That might be particularly helpful if we're successful enough to attract a large number of benchmarks for some problems...)
      And maybe if we can reduce the amount of collective time spent on the nuts and bolts of just running the methods then we can create some time for thinking about better benchmarks.

      I would also add that I think even simple simulated benchmarks have an important role to play in methods development in general -
      if only as a test to make sure that
      a method is basically implemented correctly (enough). Of course we do this all the time in model-based methods development - simulate under the model we are assuming and check that we can at least get sensible results in that case (eg recover approximately the correct parameters). If you have two different model-based methods then it could be quite useful to run each on simulated data from the other model to see how robust they both are to model mis-specification.

      But I think you know all this already - it doesn't solve your overall problem of the
      difficulty identifying good gold (or even silver or bronze) standards.

      Matthew

      Delete
  2. Great post. Concerning the infrastructure for such a system, especially cross-platform compatibility, you might want to check out how http://nucleotid.es is using docker (https://hub.docker.com/u/nucleotides/) and github (https://github.com/nucleotides) to achieve a community-driven, continuously-updated evaluation platform for genome assemblers.

    Also, if you haven't read Michael Nielsen's "Reinventing Discovery: The New Era of Networked Science", it is worth checking out for its discussion of the structure and requirements if successful online collaborative science projects like the one you propose.

    ReplyDelete
  3. Thanks Casey,
    nucleotid.es looks very cool I will look into it, as well as the Nielsen book
    (neither of which I knew about before this blog... paying off already :))

    Matthew

    ReplyDelete
  4. Matthew,

    This is a very cool idea, and I'm glad you've gotten the funding to pursue it in earnest. I will definitely be curious to see how the project unfolds.

    My guess is that the hardest and most important part will be getting the research community to "buy in". It seems to me that a common denominator in the success or failure of any "open science" initiative (which is my umbrella term for projects that emphasize code sharing, reproducibility, public comment sections, etc.) is whether prominent scientists actively promote it. You need respected people to go out on a limb and become early adopters, and you need reviewers telling authors that they want to see new methods evaluated through the project. Without a strong push, it's very easy for inertia to keep things the way they've always been.

    Social media can be a powerful tool in this context. Haldane's Sieve is a great example: it started as a good, if possibly quixotic, idea, but it quickly gained traction through Graham and Joe's strong presence on Twitter. It wasn't long before leading scientists were posting preprints and discussing them in the comments section, and now the site is becoming a hub for modern population genetics. I don't think this would have happened without the community-building and direct engagement that happened on Twitter. (Example: Whenever someone tweets about a relevant preprint, they are sure to receive a request from @Haldanessieve to write a blurb about it for the site; I can imagine something similar happening with your project and preprints on new statistical methods.)

    It looks like several of the recipients of the Moore Foundation grant are veterans of the Twitterati, so I'd say that's a good start.

    Anyway, those are my initial thoughts. I'd be happy to help if there's anything I can contribute (after I finish writing the paper we're preparing to submit, of course).

    --Bryan

    ReplyDelete
  5. I'm sure you already know that, but I like this idea very much!

    Tim

    ReplyDelete
  6. Apparently I'm late here. I saved this post for future reading when I saw it. I guessed it would be a good read, and it really is, after I read it in detail today. I absolutely love the ideas you mentioned, in particular, "collaboration through competition". If the authors of methods care about comparisons, they should join your repository and try to improve their own methods. A nice consequence is that they are very likely to start improving the software implementation, so you will not need to worry if you have used LASSO/EN correctly or properly.

    For small-scale computations, I guess Github + Travis CI virtual machines should be enough. Reproducibility is guaranteed due to the fact that clean virtual machines are used before each build. For larger scale problems, Docker/Digital Ocean might be a good choice.

    I'm very much looking forward to such a platform like the Netflix competition for comparing statistical methods!

    ReplyDelete