However, it seems to me that Data Science is commonly interpreted it as the hybrid of two subjects: Data Analysis and Computer Science, and the weight for the latter seems to be heavier. I will share why I think it is so.
You see, much of what that has to be accomplished in data science is by a computer.
Probability and statistical tools such Student t-Test, Chi-Square Test, that are required in data analysis, can be easily computed using software such as R - no more computing of t- (or z-) score then refer to a table.
Data visualisation is also made easy by using R's ggplot2 or Python's matplotlib or seaborn - there is no need to draw graphs and scale by hand.
Not forgetting data manipulation and transformation. These can be done easily by using Python or R too.
However, let's not forget the ultimate aim of data science - that is to gather insights from the data. Hence a data scientist has to be good in both - programming and statistics. I opine that he does not need to be overly good with programming, just enough to gather, clean, manipulate and analyse the data. But I would place more emphasis on the statistics part because it is statistics that will allow the data scientist to quantify the findings.
Here I list the resource I learnt from.
Statistics
As an engineer, I learnt statistics from JC till University. Also, my Master's degree in Industrial and Systems Engineering is heavy on statistics, although I do not think my statistics is very strong.
Nevertheless here are the resources:
- Statistics for Experimenters: My lecturer sweared by it because it is written by legends. This book is so classic that it only has two revisions, but it is easy to read because the concepts are delivered in real and relatable experiments.
- Applied Statistics and Probility for Engineers. This is another textbook, more like a supplement to my course.
- The site Seeing Theory - A visual introduction to probability and statistics is a cool site to learn about the topic graphically.
In short: basically any probability and statistics textbook or resource that you can get a hold on. The ones I listed are the those I used.
Programming
I had exposure to R because my master course was very statistics-orientated. R is free and open source and has a big community supporting it, and it does the work. SPSS and SAS are simply too expensive to be accessible in my opinion. One of my lecturers used R to do bootstrapping and it got me interested to want to learn more, and I had been wanting to learn a programming language. R seems to fit the bill. Hence my inclination is towards R when it comes to data science.
I learnt R mainly from DataCamp, and tons of trial and error. You have to pay, a subscription, which I did, if you wish to access the more in-depth courses. Otherwise, the free way is to learn R in R using SWIRL. R also has a resource to learn probability and statistics, aptly called Introduction to Probability and Statistics Using R (IPSUR). RStudio, and along with it Shiny and RMarkdown, is a powerful IDE for R.
I got to know Python along the way while researching on R and Data Science. I started learning the basic syntax from Codeacademy. Thereafter I learn from courses in Udemy. The Python for Data Science and Machine Learning in Udemy is especially a good primer to learn data science in Python. i discovered that Python is a very easy language to learn. Compared to R, its syntaxes are more elegant. Also it is more versatile when it comes to big data, and it is able to be integrated with other languages/applications.
To really learn Python to be functional, that is able to write functional scripts, I would recommend the book Automate the Boring Stuffs. I have also read many books on Python such as Python for Data Science for Dummies, and Python Programming for the Absolute Beginner.
I also got to know about Kaggle and KDnuggets. Both are very good sites to browse for data science related information.
Machine Learning
I will regard machine learning as a computer science subject although the statistical counterpart is call statistical learning. Anyway, statistical learning are normally implemented using a computer and therefore machine learning is a more familiar term.
Machine learning is basically using a computer to identify patterns and then make predictions, without the hard coding the logic. In a way it seems like the machine learns about the data. It is a broad but interesting topic.
I would strongly recommend these two resources:
- The video lectures on the book, An Introduction to Statistical Learning with Applications in R, by the authors of the book, who are Stanford University professors. Sit through the 15hours worth of lectures. It is worth it.
- Andrew Ng's Machine Learning Course in Coursera. Coursera, founded by Ng, has a wealth of courses. However, his own course is the must-go to for machine learning. I paid for the certification and it is worth it.
Projects
I think the fastest way to learn is to really do it hands-on. Sure the courses and all will have exercises and assignments, but I do not think it is enough. I would pick up data sets that I think it would be interesting to analyse, and or I have certain personal project that I wish to implement in a programme.
I have:
- Attempted analyse Melbourne's land prices data from Kaggle and made predictions using Keras in Python
- Visualised land value in different areas in Singapore
- Wrote a Python script to scrape stock data and select stocks based on Formula Investing Methodology
Continuous Learning
Well, my reading never stops. These are the books that are in the queue:
- Data Analysis and Graphics Using R
- Machine Learning for Predictive Data Analytics
- Modelling Techniques in Predictive Analytics with Python and R
- Data-ism
- Algorithms to Live By
But I am not a Data Scientist, yet.
~Huat