As part of my “re-learn some statistics” goal for the year, I enrolled myself in the Data Analysis course taught through Coursera by Jeff Leek. It looked like a good survey of statistical and analytical methods and combined that with some R programming skills. So with that in mind, I signed myself up for the course and then spent eight semi-grueling weeks over February and March drinking from a firehose of singular value decompositions,
tapply statements, tree graphs, vectors, and every other thing that was packed in there. In the end I scored much better than I thought (“With Distinction”), but I wanted to jot down some reflections on the experience. In no particular order…
- 3-5 hours/week, my ass. The intro page for the course on the Coursera site asserted that the estimated workload was 3-5 hours each week. For someone working fulltime (and with small children) I thought to myself: This seems totally doable. But I don’t know anyone that took the course that felt like the lectures and quizzes and written assignments and/or the optional reading could all comfortably fit into that 3-5 hour period. I took the course alongside a couple of co-workers and everyone pretty much said the same thing: the time required each week was underestimated. Most people spent six hours/week, minimum.
- No prior math knowledge required, my ass. Again: the intro page for the course made no mention that you would need any kind of advanced mathematical or statistical knowledge going into the class — that you’d be able to get by without it. As my “With Distinction” grade shows, this is apparently true. However, there were many times when I felt quite lost. A lecture would introduce some topic (singular value decompositions, I’m looking in your direction) and suddenly you need four semesters of college-level maths to get by. To me this was summed up by a comment left in a forum thread by one of the TAs:
However, if you don’t have a background in Linear Algebra it may not help.
Great! Thanks! Again: if you watched a given lecture three or four times, you could pick up enough to get by with that particular technique, but you were not going to understand its inner workings, and so any ability to make informed decisions based on the data were ultimately crippled. (Or at least hobbled.) This was a pretty big disappointment for me.
- I learned a lot, but not as much as I would have liked considering the effort. As I mentioned above, the level of effort that I wound up putting in exceeded what I expected to be necessary. When you start something and think “3-5 hours/week” and then it winds up being 10… You start to feel pretty invested. And when you are not feeling confident that you have learned–really learned–the material after all that time… It’s a little discouraging. The exposure to R was excellent; the exposure to the vocabulary was good; the exposure to the techniques was adequate… But at the end of it all I’m not sure that I am significantly better off than when I started. (On the other hand, I have this useful pile of notes.)
- It needed some kind of focus? This is a conclusion I arrived upon about mid-way through the course and then in talking with peers, it sounds like they arrived at similar conclusions. Learning statistics on its own is useful but without a lens to guide your studies it winds up seeming very scattered. You need a common thread to pull all the techniques together, to weave them into a story. I have a feeling that this is why–outside of mathematics at least–statistics is taught attached to some discipline. It’s “bioinformatics” or “statistics for the social sciences” or “econometrics” or “computational physics” or “business intelligence” or some other such thing. And the type of data you expect to have is going to guide the type of analyses you do. If you just set out to study everything (as we did) then you’re studying a technique for the sake of topical breadth with no consideration for how useful it might actually be to your circumstances. And because it was all crammed in and we couldn’t really get to that level of depth on any one technique, I’m not sure that I know exactly when or why to use all the ones that we discussed. (As an aside to this: a slight self-criticism: I only had a vague idea of what I wanted to get out of the class as I went into it, so some of this “lack of focus” is my own fault. But only some of it.)
- I needed a text book. When I followed along in my outside reading, and that outside reading was germane to what we were learning about in the lectures, then I felt like I did much better. I think I needed both. But I couldn’t keep up with both, and the reading wound up going on hold while I scrambled to keep up with the lectures.
- I needed more math. I already mentioned this, but it’s worth calling out again. The course description specifically said that you didn’t need a detailed math/stats background to “get” the data analysis course. However, I’m of the opinion that to get maximum value out of the techniques presented in the lectures, then you really needed to understand the math behind them. And I couldn’t help but shake my head every time Jeff Leek would say “Don’t worry about the math behind this…” He’d get to the other side of some summation or square root or some bit of linear algebra and… “Just look for that.” And I’d think: “How did we get here? What does that mean? What am I supposed to do with it?”
- The R vs. stats chicken-and-egg question. I skipped Roger Peng’s “Computing for Data Analysis” course because (at the time) I thought that if I knew what statistical tools to use, then searching for things like “principal component analysis in R” would yield the results I needed. To an extent, I maintain that I made the right call — that if I needed to pick just one of the two that it would be Jeff Leek’s class (though take both for best results) — but I also think that I over-estimated the usefulness of “just knowing the names of the stats to use”. Indeed, if you know what statistical tool you need, then StackOverflow or CrossValidated is going to get you the rest of the way there. However, knowing which statistical technique to apply and when and which combinations of them to use… I think this takes more experience than you can get in just eight weeks of “3-5 hours/week”.
- I had a harder time with the format than I expected. My hands-on, small-class-size liberal arts education spoiled me. I really don’t know what to do in an anonymous lecture where I can’t get into
argumentswith the instructor and go to office hours and get personalized feedback on assignments. Online forums are terrible for that sort of thing.
Would I do it all again? Yes, I would. (Though I’m not doing another Coursera course for a while; that burnt me out.) Getting the high-level exposure to the statistics was fulfilling, and I think that as my readings progress, I’ll be in a better position to know where I should go deeper — and which techniques I’ll likely never visit again. I’ll get where I want to be with my statistical knowledge, but the Data Analysis course turned out to be the bus to the airport, and not the non-stop flight I was hoping for.
About Rob FrieselSoftware engineer by day, science fiction writer by night; weekend homebrewer. Author of The PhantomJS Cookbook and a short story in Please Do Not Remove. View all posts by Rob Friesel →
2 Responses to reflecting on Data Analysis
Pingback: 2013 goals: Q2 check-in | found drama