How Do I Become A Data Scientist?

In the last post we tried to explain what is a Data Scientist. Now, with all the job openings for them we approach how to become one.

A Quora thread offers some advice:

Strictly speaking, there is no such thing as “data science”. (With that in mind) Here are some resources I’ve collected about working with data, I hope you find them useful  (note: I’m an undergrad student, this is not an expert opinion in any way).

1) Learn about matrix computations:

Take the Computational Linear Algebra course (it is sometimes called Applied Linear Algebra or Matrix Computations or Numeric Analysis or Matrix Analysis and it can be either CS or Applied Math course).

2) Start learning statistics

3) Learn about distributed systems and databases:

  • Note: this topic is not part of a standard Machine Learning track but you can probably find courses such as Distributed Systems or Parallel Programming in your CS/EE catalog. I believe it is important to learn how to work with a Linux cluster and how to design scalable distributed algorithms if you want to work with big data. It is also becoming increasingly important to be able to utilize the full power of multicore.
  • Download Hadoop and run some MapReduce jobs on your laptop in pseudo-distributed mode.
  • Learn about Google technology stack (MapReduce, BigTable, Dremel, Pregel, GFS, Chubby, Protobuf etc).
  • Setup account with Amazon AWS/EC2/S3/EBS and experiment with running Hadoop on a cluster with large data sets (you can use Cloudera or YDN images, but in my opinion you can better understand the system if you set it up from scratch, using the original distribution). Watch the costs.
  • Try out Hadoop alternatives, specifically the minimalist frameworks such as BashReduce:
  • Run Bryan Cooper’s Cloud Serving Benchmark on AWS, compare Hbase vs Cassandra performance on a small cluster (6-8 nodes)
  • Run LINPACK benchmark
  • Run some experiments with MPI try to implement a simple clustering algorithm with MPI vs Hadoop/MapReduce and compare the performance, fault tolerance, ease of use etc.  Learn the differences between the two approaches, and when it makes sense to use each one.

4) Learn about machine learning

5) Learn about least-squares estimation and Kalman filters:

  • This is a classic topic and “data science” par excellence in my opinion. It is also  a good introduction to optimization and control.
  • Start with Bierman’s LLS tutorial given to his colleagues at JPL, it is clearly written and is inspiring (the Apollo trajectory was estimated using these methods).
  • See Steven Kay’s series on statistical signal estimation

Leave a comment

Your email address will not be published. Required fields are marked *