After performing some research on data visualization, I came across an explosion of articles on “data scientists”. This term seems to be the latest craze in Social Networking. All these web 2.0 companies with millions of users and their data, but no income. Why not find a way to turn that data into profits.
The trouble is that all this is very confusing…to everybody. So let’s start at the beginning.
What Is a Data Scientist?
According to Ryan Kim of Giga Om: “It’s someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning. It’s a set of skills that go beyond many existing job titles and it’s increasingly in demand.”
Which is pulled from Bit.ly’s data scientist who goes into the taxonomy:
- Obtain: pointing and clicking does not scale.
- Getting a list of numbers from a paper via PDF or from within your web browser via copy and paste rarely yields sufficient data to learn something `new’ by exploratory or predictive analytics. Part of the skillset of a data scientist is knowing how to obtain a sufficient corpus of usable data, possibly from multiple sources, and possibly from sites which require specific query syntax. At a minimum, a data scientist should know how to do this from the command line, e.g., in a UN*X environment. Shell scripting does suffice for many tasks, but we recommend learning a programming or scripting language which can support automating the retrieval of data and add the ability to make calls asynchronously and manage the resulting data. Python is a current favorite at time of writing (Fall 2010).
- Scrub: the world is a messy place.
- Whether provided by an experimentalist with missing data and inconsistent labels, or via a website with an awkward choice of data formatting, there will almost always be some amount of data cleaning (or scrubbing) necessary before analysis of these data is possible. As with Obtaining data, herein a little command line fu and simple scripting can be of great utility. Scrubbing data is the least sexy part of the analysis process, but often one that yields the greatest benefits. A simple analysis of clean data can be more productive than a complex analysis of noisy and irregular data.
- Explore: You can see a lot by looking.
- Visualizing, clustering, performing dimensionality reduction: these are all part of `looking at data.’ These tasks are sometimes described as “exploratory” in that no hypothesis is being tested, no predictions are attempted. Wolfgang Pauli would call these techniques “not even wrong,” though they are hugely useful for getting to know your data. Often such methods inspire predictive analysis methods used later.
- Models: always bad, sometimes ugly.
- Whether in the natural sciences, in engineering, or in data-rich startups, often the ‘best’ model is the most predictive model. E.g., is it `better’ to fit one’s data to a straight line or a fifth-order polynomial? Should one combine a weighted sum of 10 rules or 10,000? One way of framing such questions of model selection is to remember why we build models in the first place: to predict and to interpret…
- iNterpret: “The purpose of computing is insight, not numbers.”
- Consider the task of automated digit recognition. The value of an algorithm which can predict ’4′ and distinguish from ’5’ is assessed by its predictive power, not on theoretical elegance; the goal of machine learning for digit recognition is not to build a theory of ’3.’ However, in the natural sciences, the ability to predict complex phenomena is different from what most mean by ‘understanding’ or ‘interpreting’…
The ability and the skills to perform all of these processes makes the person highly valuable and, in my opinion, very interesting. The career pages of Facebook, Google, Twitter, Foursquare, LinkedIn and a host of other social sites are boiling over with requests for these folk. It is a super hiring phase to get the best employees.
It appears that LinkedIn is leading in innovation right now with their Data Science Team, though I suspect that Facebook, Amazon, and Google are not far behind. Some of the features that the LinkedIn team has built are:
- “People You May Know”
- “Jobs You May be Interested In”
- Groups You May Like”
- “Talent Match” (Post a job on LinkedIn and see the magic)
- “Signal” (news feed feature)
Which is funny because all sites seem to have these features. I can just imagine an arms race of sorts to build the next great “data science” feature so they can keep bringing us back to their sites (and hopefully get some ad money along the way).