— Astro Hack Week (@AstroHackWeek) September 18, 2014
Astro Data Hack Week was one of the most valuable weeks of my PhD so far. This wasn’t a week of watching; it was a week of doing.
Jake VanderPlas organised the five-day workshop, beginning September 15th, 2014, and held at the University of Washington in Seattle. Around 40 participants – students, post-docs, astronomers, ex-astronomers, (one surgeon!) – all with an interest in data science, occupied an ‘active learning room’ in the UDub library. Each morning, one or two experts in data science, statistics or machine learning delivered a lecture to the group. In the afternoons, unreserved hacking began.
First of all, what is hacking? Hacking can mean deconstructing something that already exists, ‘hacking’ into it and modifying it to fit your purpose. Or it can mean building a tool as quickly as possible, not fussing over details, just getting something working, exploring and experimenting as you go. We mostly did the latter. Our language was Python and our data sets ranged from galaxy images to stellar light curves.
The lectures: ipython notebooks, Gaussian processes, random forests, FML
The first morning’s lecture was delivered by Fernando Perez (UC Berkeley), creator of ipython. ipython notebooks are excellent tools for teaching, demonstrating and exploring ideas. After Perez’s talk, I coded in a notebook for the rest of the week. Especially cool notebook features: slider widgets! You can control a parameter with a widget and watch a function change in real time as you manipulate it by hand. After a coffee break, VanderPlas delivered a great lecture on effective computing with numpy (check out the notebook here). If you’re a python user interested in machine learning there are loads of useful tools available in the Scikit-Learn python module (another VanderPlas notebook here).
Stripped-down stats was Tuesday’s theme. We went back to basics – Daniela Huppenkothen (Amsterdam/NYU) gave a great lecture on classical statistics. On Wednesday morning David Hogg (NYU) got extremely philosophical, discussing Bayesian stats.
“The scientific community is Bayesian, individual people aren’t” – Hogg discussed the idea that each scientist lies somewhere on the Bayesian-frequentist spectrum and you can think of each person as being like a sample from the giant posterior PDF of Bayesianism.
We talked about whether you should ever compute the evidence, or Fully Marginalised Likelihood (FML). Hogg says no – you should use cross validation instead. The evidence is a normalisation constant that appears in Bayes’ theorum. For most inference problems you don’t ever actually need to calculate it, because you don’t care about proper normalisation. You only care about normalisation when you have to compare different models: e.g. are your data best described by a universe with or without inflation? In practise, the evidence calculation is extremely computationally expensive and can be sensitive to the choices you make (what kind of sampler you use, what kinds of priors you apply, etc). Cross validation is also used for model comparison but is way easier and may have fewer hidden pitfalls. It involves simply training your model on one subset of data and testing it on another.
Joshua Bloom (UC Berkeley) talked about classification problems on Thursday morning. In particular, he recommended random forest. Random forest is a type of classification algorithm that uses a collection of decision trees to assign labels to data. Given a set of data (instances) with characteristics (features), you can train your algorithm to classify them. This is supervised learning: you have a set of training data for which you know the correct label. You train the classifier on a subset of the data, whilst holding some back. Then you test the classifier on the remaining data to see how it did. There are a few different types of classifying algorithms out there, but it turns out that random forest almost always performs well for most problems. Random forest fever gripped the group and a load of (very successful) random forest hacks were implemented on Thursday afternoon, including an awesome hack by James Davenport (Washington) who used random forest to produce a periodogram! On Friday morning, Zeljko Ivezic took us through super-useful examples from the AstroML book; the book that a lot of the week was based upon.
— Phil Marshall (@drphilmarshall) September 19, 2014
The rules of hacking are simple:
1) Work with someone (whilst you can hack alone, it’s far more productive to work in small groups or pairs).
2) Be experimental (“fail fast”!).
3) Take advantage of the expertise in the room (there’s always someone who knows more than you in certain areas).
4) Offer help when needed (there’s always someone who knows less than you in certain areas).
Some participants arrived with clear hack ideas already laid out and some were happy to learn something new by contributing to other people’s hacks. Small groups and pairs formed immediately and everyone got to work. Michael Gully-Santiago (UT Austin) pitched a great idea: he’s taking all the figures that appear in the astroML book (for which the source code is available here) and making them interactive. A large group hacked on this for an entire afternoon. See some of the results here. Other hacks included using Gaussian processes to model time-series, multi-pixel hierarchical SED modelling, adding custom cell macros to the ipython notebook, and more. Check out the hack pad for a (incomplete!) list. An eerie outsider’s-view on the hacking process was provided to us by data science ethnographer, Brittany Fiore (Washington) who observed us across the week. Her field notes offer a fascinating insight into the world of hacking and the behaviour of astronomers from a different perspective. Her most shocking observation was the air of imposter syndrome that seemed to hover over the group…
— Brittany Fiore (@BrittaFiore) September 20, 2014
These were without a doubt my FAVOURITE part of the week. ‘Break-outs’ were encouraged throughout the afternoons. These were short sessions (between half an hour and an hour) where an expert in the room delivered a class on a topic requested by participants. They were informal yet informative, with the teachers barely given an hour’s warning, so often completely raw. Sometimes the ‘experts’ were just the people in the room with the most expertise, trying to teach topics that they were also learning. I’ve actually found that these can often be the best learning environments. When there is no ‘expert in the room’, there are no stupid questions and everyone tries to learn together. Dan Foreman-Mackey (NYU) delivered at least four break-outs, including his extremely popular Gaussian process tutorial (slides available here). Other break-outs covered hierarchical inference, nested sampling, probabilistic graphical models, Cython, Julia, neural networks and more.
I really think we should “Hack” more as a community. When you sit down with a collaborator, don’t just talk, test out ideas together and see if you can get something working in an afternoon – ipython notebooks are super useful for this. Why not organise a hack day in your department? It’s a really good way to encourage cross-disciplinary communication. You might find out that a cosmologist uses the exact same methods as you, and you’d have never found out otherwise. Also – why don’t more conferences have hack sessions? We’re starting to see it at conferences like AAS, SPIE, .astronomy and NAM, but it should be the norm, not the novelty. And whilst the conferences mentioned above usually concentrate on outreach and web site building style hacks, that doesn’t need to be the only focus of a hack day. Seriously useful scientific tools were built at the astro data workshop. Why stop there?
Each day ended with wrap-up and presentations (always very entertaining!), before we were kicked out of the ‘active learning space’. Hacking didn’t end, however! We just relocated to a pub and continued hacking well into the night, with beer for lubrication.
What’s the future of Astro Data Hack Week?
I really hope we see another of these hack weeks soon. Hack days at AAS and .astronomy are wonderful, but short. Having an entire week makes a huge difference to the amount you can achieve. Follow the Astro Data blog for updates and to see hacks as they get posted. Fingers crossed for a repeat next year!
— Fernando Perez (@fperez_org) September 19, 2014
And we're done! Thanks everyone for a great #AstroHackWeek 2014. Watch the website for updates, and (tentatively) see you at NYU next year!
— Astro Hack Week (@AstroHackWeek) September 20, 2014
mpld3 – matplotlib figures in the browser.
Seaborn – beautiful statistical plotting package.
Pep8 – make sure your python code is properly formatted.