30 December 2017

[Data Science] Machine Learning - Part 1, An Introduction

Sometimes I get questions from friends who are struggling with the term machine learning. How does a machine actually learn?

Well to answer that question, we need to understand what data analytics, or data science, is.

I can go on and on about data analytics, but simply put, my definition, data analytics is about describing a set of data in the most general way possible so we can make decisions or predictions.

We can use statistics to describe a data - what is the mode of the data? What is the mean? How about the variance and standard deviations?

We can also use statistics to test a new data to see if it belongs to the data we already have.

We can draw a best-fit line (linear regression) and use statistics (mean squares) to see if it is indeed ‘best-fit’. We can try to group data together by features (colours, size, etc).

So we can see, data analytics consist of two parts - statistics and going about deploying the statistical methods.

How can we describe the data?



Machine Learning is one of the tools for the latter part, the deploying statistical methods or other means (grouping, namely) to describe data. Of course, we can deploy the methods manually - we can try to draw lines and derive the equation of the line manually. However with the rapid growth in the size of data, doing so is becoming humanly impossible. Furthermore, computers are faster and less prone to mistake. (And hence the term data science was born.)

In short, machine learning is the use of computers and algorithms to describe data. However this is done not by explicitly coding the logic, but through recursive methods to find the optimal set of model parameters that minimizes inaccuracy (or maximises accuracy, I will explain myself for this clumsy wording soon).

In statistics, computing the mean squares error (MSE) is one of the ways to calculate the amount of error. The objective of the fitting line is to minimize the MSE, so that the fitted line can be used to make predictions.

A fitted line, but does it have minimum MSE?


In computer science speak, inaccuracies are costs described by a cost function that curiously looks like a MSE function. All machine learning algorithms (or methods) aims to minimize this cost functions.

I will elaborate in the next post.

Meantime, these are some of the resources (free!) that I helped me learn machine learning
Video Lectures on Statistical LearningThis is a series of video lectures based on the book Introduction to Statistical Learning (or ISR for short). You can download a free PDF copy of the book. Statistical learning is the statistician-speak of machine learning (which is a computer-scientist-speak). It covers most of the machine learning in a statisician point-of-view. I find it beneficial to go through this. You can also find this course in Stanford's Lagunita website. It is free!
Machine Learning by Andrew Ng on CourseraThis is like the de-facto go-to course to learn about the machine learning. Ng will go through the intuition behind the common machine learning algorithms. You will learn about matrix/vector multiplication (alot of it!). You will learn to use Matlab or Octave. From this, machine learning is nothing but a chunk of matrix multiplication. Nevertheless, IMO, it is a course worth the buck to get the certificate from Coursera. Ng also has a deep learning course that I am currently learning from.
Feel free to air your comments!

~ZF

[Investing] The Little Book of Common Sense Investing by John C. Bogle

I recently bought copy of The Little Book of Common Sense Investing by John C. Bogle. A 10th anniversary edition, updated and revised, may I add.

At the top of the cover is a review by the Oracle, Warrent Buffett:
"Rather than listen to the siren songs from investment managers, investors - large and small - should instead read Jack Bogle's The Litte Book of Common Sense Investing
That is it. Any investment book with the blessings from the Oracle himself must be a good book. You can also check out my reading list for the books I have read or am reading.

John C. Bogle is the founder and former chairman of the Vanguard Group. You may find the group a little familiar because it is indeed the Vanguard Group that manage the Vanguard 500 Index Fund. Warren Buffett had consistently backed investing in a low-cost index funds for everyday investors. This is easily validated by a simple Google search. Also, recently, Warren Buffett won a 10-year wager with a hedge fund manager; more about story here.

Back to the book. It is a little book, but it is also a thick one, a good 270 pages. It contains the rationale of behind investing in a low-cost and diversified index funds, and why it is a common sensical thing to do. The key take-away from the book I have is:
Participating in stocks investing is actually a loser's game. Whether profit or loss, the investment managers or brokers will always get a cut. Investors' earnings are eroding by transaction and management costs, and investing costs are exacerbated by transaction adn management costs. 
Only a low-cost and diversified index funds will allow an investor get his fair share of returns in the stock market, through capital appreication and dividends. 
There is also a chapter on Exchange Traded Fund (ETFs). Bogle warns against buying ETFs that are sector or industry specific and discouraged trading of ETFs.

So the investing principle is rather simple and straight forward (or common sense):

  1. Invests in a low-cost and diversified index fund. 
  2. ETFs that tracks a broad index (such as the Straits Times Index) is a good proxy. However do not trade the ETF.
  3.  Hold for long position. Forever if possible.

If you are starting out in investing, this may be a good book to read. If you are a seasoned investor, maybe this book can give you some ideas too.

To me, index investing is an auto-pilot and simple way of investing. There is no need to pick stocks, read financial statements of the companies that are picked (unless you enjoy doing so, like I do). It is low cost and risks are minimized.

Remember to check out my reading lists. There are some reading recommendations too!

Feel free to air your comments!

~ZF

29 December 2017

[Investing] Applied Dollar-Cost Averaging

In the previous post, I simulated the principle of dollar-cost averaging. Today I shall apply it onto real market data.

The data of interest is the Nikko AM STI ETF (G3B) that is traded in the Straits Times Index (STI). An ETF that tracks a broad-diversification index like the STI is similar to an index fund. Personally, I have invested in G3B via a regular savings plan, so this is also a little exercise for myself.

I have obtained 5-years' worth of monthly closing prices from Yahoo Finance for this exercise, and I will be investing $100 per month in the G3B. The effects of transaction fees are assumed to be negligible.


Price Vs Average Price

The average price is calculated by dividing the total amount invested by the units accumulated. The result is presented in the chart below.

As expected, the average price is smoothed and almost consistent. Consistent price = consistent cost. You may see that even with the very high price between Periods 50 and 60, the average price is still maintained.


Amount Invested Vs Holding Value

The following chart tracks the changes in holding value (the worth of the units that are bought and held) against the total amount invested.


You might have noticed that the portfolio started when the price is peaking, thus the Holdings Value exceeded the Amount Invested starting at about Period 15 because of the increasing prices, until at about Period 30, when prices began to dip.

Between Periods 30 and period 45 (this is slight more than a year, by the way), prices were suppressed, overall holding value declined too, but it was a good time for consolidating more units. As you can see, after everything recovers, the increase in holdings value is much greater, because of the consolidation period.

This is why we need to stay invested.


What if you started late?

It is never too late to start. I have also prepared similar charts for a 2-year investment period.


The trough period helps to consolidate units while the prices are suppressed.


From the charts, we see that thanks to the trough period, the holdings value exceeded the total amount invested quickly. Should an investor started 1 year later, at the midst of bullish price, he might not get that kind of performance. Instead, he might want to reduce (NOT stop) his contribution since the buying power has reduced, and increase only when prices are low again.

See, dollar-cost averaging works if one stays invested for a long, long time.

Feel free to comment!


~ZF


28 December 2017

[Investing] What is Dollar-Cost Averaging? A Simulated Example

Suppose you have $100, and you consistently and periodically buy a number of an item that you can afford with the $100.

If the price of the item does not vary much, the number of units that you can will also not vary much.

However, if the price of the item varies from period to period, then you get to buy more when the prices are low, and conversely, less if the prices are higher.

This is the principle of dollar-cost averaging, or DCA, the topic for today.

Let me illustate with some simulation and graphs. For the scenario, I am going to deploy DCA over 20 periods, each period I will spend $100 buy an item with a varying price. Prices are hypothetical and illustration purpose only.


Buy More When Prices Are Low

Well, it is expected to see that the number of units bought at each period is a mirror image of the prices.



What is the overall effect?


A Boring But Stable Cost

A cost that can be potentially lower too. The graph below illustrates.



Over a long period, the price of the units will be smoothen out, and at a level that closer to the lows.  Let's see if we extend period to 100 and see the over all effects.


Do you see the average price (green line) seems to be trending down, despite the volatility of prices?
Just for completeness, over a period of 200, the average price does decrease.


Simulated Long Term Performance (Perceived)

This is an update. I reckon it might be useful to see the total value of the investment too.

The graph below shows the overall performance over period = 100.
For a period 100 timeline, the total amount invested is $10,000, as shown by the green line. As time proceeds, the number of units increases, and the overall portfolio value becomes more sensitive to the price. Nevertheless, we can see that there is a potential to exceed the invested value.

In this simulation, prices are made volatile intentionally. If an investor invests in an index funds, the prices should be more stable; he should do better the simulated case.

Last point, the simularion also shows that one should not invest in a stock whose price is so volatile; that is not the intent of the DCA to make the movement of the portfolio value like a roller-coaster ride.


Benefits of DCA

However, the simulation is based on random number generation. In the market, while this kind of price movement is not uncommon, prices are generally cyclical. Nevertheless, I am confident that this little simulation of mine iterates some very important feature of DCA.

  1. The DCA is a great tool for long (very long, might I add) positions - it reduces the capital outlay, smoothens out risks and it is investing on auto-pilot. 
  2. It is especially good for investing in index funds, or ETFs that track an index, such as the STI, the STI ETF.
  3. It is good for discipline i investing. Many options are available in the form of a regular savings plan (RSP), such as the POSB Invest-Saver.*
*This is not an endorsement or recommendation for any investment products


But...

The DCA is not without flaws. In sustained bullish periods, the number of units that can be bought diminishes, until prices drop, and this makes the performance of the previous purchase look extraordinarily bad. You can see this in the graph above, the first few periods where there is a spike is akin to this situation. This should not be too much of a concern though if the investor is going very long. One can simply increase the principle to invest when prices are lower to avert this situation though, but the question remains - how low is low?

One of the golden rule in investing is to keep invested. DCA is one of those tools that can help the investor to do so.

Feel free to comment!

~ZF

P/S: Simulations are performed and plotted using Python.


Afternote: I shall attempt to do a simualtion with real data soon.


05 November 2017

[Data Science] So I Have Just Completed the Applied Data Science with Python Specialisation by University of Michigan in Coursera

This specialisation comprises of 5 courses:
Course 1 - Introduction to Data Science in Python
Course 2 - Applied Plotting,  Charting & Data Representation in Python
Course 3 - Applied Machine Learning in Python
Course 4 - Applied Text Mining in Python
Course 5 - Applied Social Network Analysis in Python
I shall summarise what I think about the course:

The Good:
1) The course is structured very well and the pace manageable.
2) All exercises are done and submitted in Jupyter notebookes and graded rather quickly. 
3) Exercises are challenging. Learners have to do much research (through Stack Overflow forums) and the toolkit's documentation pages. 
4) Personally, I learned more about Pandas and Numpy and also performing neural networks using Scikit-Learn MLP (stands for multi-layer perceptron) module.

The Bad:
1) While the course is structured, the materials are administered quickly. It is iterated many times that it is an applied course, and hence much of the intuition behind the methods were just introduced quckly.  
2) Stack Overflow, Stack Overflow.... 
3) The assignment grading can be buggy, and with no expected answers provided, the learner just have to submit and see if they passed. I think such is life as a data scientist? We never know if we get the correct answers. So I suppose that is intent? But much time is wasted just by waiting. 
4) Nothing to do with the course, but I do not quite like the Coursera charging a fee monthly instead of a lump sum. This makes the course very costly in the long run. For example, if this course has 5 modules and presumably each module is a mths's of content, it will cost well above USD$200 to get the specialization. Of course there is an option to just audit the courses.

Who Should Take This Course:

I think this is a course worth taking, afterall the University of Michigan is not a trivial name, but perhaps not spent too much lingering on. So I think it is best to learn the basics of Python, data science and machine learning throught other more cost-effective (by which, I mean cheaper) means.

Here are what I would recommend:

For Python/Data Science
1) I find the book Automate the Boring Stuff with Python is a good start, although I discovered this rather late. This book introduces all the basic programming in Python, such as creating functions, iterations, and also more advanced stuff like reading and automating spreadsheets, webcrawling, and generally automating work using Python. It is easy to read and follow, and the digital copy of the book is free to browse at the link provided.   
2) Jose Portilla's courses pertaining to Python and Data Science in Udemy is an excellent introduction to the subject. And with the constant barrage of sales and offer at Udemy (recently they have a $10 to all courses, for example), it is good to buy and learn at your own pace. I have personally done the Data Science and Machine Bootcamp with Python (and also SQL and R). He also have a Python bootcamp.  
3) Other resources such as Datacamp and Codeacademy is a good place to consider too. Personally, I started Python and R at Codeacademy and Datacamp respectively.

For Machine Learning

1) The videos based on the book An Introduction to Statistical Learning with Applications in R is an must-watch if you wish to have a deeper understanding in the intuition behind the machine learning algorithms, in a statistician point-of-view. It also covers more advanced methods such as Support Vector Machines. However, it does not cover neural networks. Statistical Learning is the statistician way of saying Machine Learning.

2) Andrew Ng has become the household name for Machine Learning. His Machine Learning course in Coursera is becoming a classic. This course approaches Machine Learning in the Computer Science's POV and Andrew is able to explain difficult concepts easy to understand in his softspoken manner. The assignment is challenging and it is in Matlab or Octave. It is this course that I finally appreciate the wonders of matrix algebra (and yes, I am not shy to say I am an engineer). I am currently auditing his Deep Learning course, also in Coursera.


Currently I am still looking for opportunities to use what I have learnt at work (or new opportunities). I am using Python to scrape stock data and am devising a project to scrape financial report data for REIT analysis. More on this in future blogs.

~yZhifa

16 August 2017

[Investment] Stamford Tyres (SGX: S29)

This stock caught my eye and made me went further into research because it is trading at 60% of its book value (BV). The price at the time of writing is S$0.34. BV is S$0.53.


Now for some key statistics

EPS = $0.03
P/E = 9.88


Financial health looks OK but the company has a lot of inventories, although it is appearing to be decreasing. Now, buying the stocks seems to be a good deal in the BV point of view: Buy a piece of the assets at about 40% and the assets have a return of 2.3% (5 year average). The company is constantly buying plants, properties and equipment too, based on the cashflow statement. However remember majority of the assets are actually inventories. A good 80M of it out of 390M of it.

Profitably is actually not so OK. The company is recovering (somewhat) from a massive 80% decrease in 2015. (http://www.straitstimes.com/business/companies-markets/stamford-tyres-full-year-profit-sinks-83) I have not figure out what happened but it was claimed to be some forex and business climate issue. (TBD later)

Cashflow-wise, the company seems to be quite profitable organically. But I notice it has been alternating between long term loans and trust receipts and revolving loans in the years 2014, 2015 and 2016. I wonder why is that. But a huge proportion of the balance cash is actually make up of loans. For example, in 2015, it has about 15.8M of cash. But on the background there was abt 18M of loans going around. In 2014 it is about 14M of loan vs 18.6M of net cash (in all fairness, there was a huge capital layout of 22M to purchase PPE).

Its joint ventures in HK and India seems to been picking up, especially the JV in India has become profitable recently. This could make buying the stock at S$0.34 a sensible, on top of the BV consideration mentioned earlier.

There is a double-edge sword that is worth mentioning. I found that the company receivables are also very high, well around 70M. This is recorded as an asset because it is potential cash. However, the Keppel saga (http://www.businesstimes.com.sg/stocks/hot-stocks-keppel-and-sembcorp-marine-fall-more-than-6), this 70M could be greatly affected if the economy turns sour and companies that owed money start to fold.

There are also numbers that I could not make sense yet. For example for a decrease in inventory from 2015 to 2016, there is no corresponding increase in revenue. Instead the revenue reduced.

There is potential in this company if:
  1. Its revenue growth continue to sustain
  2. Its cost-control methods continue to be effective
  3. It controls its debt well
  4. Its JV remains or becomes more profitable

~Huat





21 July 2017

[Data Science] Whatever I Know About Data Science

Data Science is the subject of gathering insights from data. Yes, it is a fancy name for Statistics.

However, it seems to me that Data Science is commonly interpreted it as the hybrid of two subjects: Data Analysis and Computer Science, and the weight for the latter seems to be heavier. I will share why I think it is so.

You see, much of what that has to be accomplished in data science is by a computer.

Probability and statistical tools such Student t-Test, Chi-Square Test, that are required in data analysis, can be easily computed using software such as R - no more computing of t- (or z-) score then refer to a table. 

Data visualisation is also made easy by using R's ggplot2 or Python's matplotlib or seaborn - there is no need to draw graphs and scale by hand.

Not forgetting data manipulation and transformation. These can be done easily by using Python or R too.

However, let's not forget the ultimate aim of data science - that is to gather insights from the data. Hence a data scientist has to be good in both - programming and statistics. I opine that he does not need to be overly good with programming, just enough to gather, clean, manipulate and analyse the data. But I would place more emphasis on the statistics part because it is statistics that will allow the data scientist to quantify the findings.

Here I list the resource I learnt from.


Statistics

As an engineer, I learnt statistics from JC till University. Also, my Master's degree in Industrial and Systems Engineering is heavy on statistics, although I do not think my statistics is very strong.

Nevertheless here are the resources:


In short: basically any probability and statistics textbook or resource that you can get a hold on. The ones I listed are the those I used.


Programming

I had exposure to R because my master course was very statistics-orientated.  R is free and open source and has a big community supporting it, and it does the work. SPSS and SAS are simply too expensive to be accessible in my opinion. One of my lecturers used R to do bootstrapping and it got me interested to want to learn more, and I had been wanting to learn a programming language. R seems to fit the bill. Hence my inclination is towards R when it comes to data science.

I learnt R mainly from DataCamp, and tons of trial and error. You have to pay, a subscription, which I did,  if you wish to access the more in-depth courses. Otherwise, the free way is to learn R in R using SWIRL. R also has a resource to learn probability and statistics, aptly called Introduction to Probability and Statistics Using R (IPSUR). RStudio, and along with it Shiny and RMarkdown, is a powerful IDE for R.

I got to know Python along the way while researching on R and Data Science.  I started learning the basic syntax from Codeacademy. Thereafter I learn from courses in Udemy. The Python for Data Science and Machine Learning in Udemy is especially a good primer to learn data science in Python. i discovered that Python is a very easy language to learn. Compared to R, its syntaxes are more elegant. Also it is more versatile when it comes to big data, and it is able to be integrated with other languages/applications.

To really learn Python to be functional, that is able to write functional scripts, I would recommend the book Automate the Boring Stuffs. I have also read many books on Python such as Python for Data Science for Dummies, and Python Programming for the Absolute Beginner.

I also got to know about Kaggle and KDnuggets. Both are very good sites to browse for data science related information.


Machine Learning

I will regard machine learning as a computer science subject although the statistical counterpart is call statistical learning. Anyway, statistical learning are normally implemented using a computer and therefore machine learning is a more familiar term.

Machine learning is basically using a computer to identify patterns and then make predictions, without the hard coding the logic. In a way it seems like the machine learns about the data. It is a broad but interesting topic.

I would strongly recommend these two resources:


Projects

I think the fastest way to learn is to really do it hands-on. Sure the courses and all will have exercises and assignments, but I do not think it is enough. I would pick up data sets that I think it would be interesting to analyse, and or I have certain personal project that I wish to implement in a programme.

I have:



Continuous Learning

Well, my reading never stops. These are the books that are in the queue:





But I am not a Data Scientist, yet.

~Huat

19 July 2017

[Investing] Netlink NBN Trust Debuts Today - Did You Huat?

So, the very hyped Netlink NBN Trust debuts today. The ticker is CJLU.
"NetLink opened trading at S$0.815 per unit before edging down to S$0.810 on its first day of trade." - From CNA (link above)
All in all, it went up half a cent, or abt 0.6%. By the way, this 0.6% will be wiped out by transaction cost during sell-offs.

Sparked by curiosity, I found out that Shareinvestor has a wealth of information. I found data on the historical IPOs and their performance. There are impressive winners, like those with stock price increase more than 100% (e.g. UnUsUal, Samurai); of course there are impressive losers too. It seems that, by eye-balling (i.e. not validated), there are more losers than winners.

I regard IPOs as a form of speculation. Firstly, there are market participants with the aim to flip some profits from that fact that IPOs generally will gain a decent amount during the first week of opening. This usually works if one has a lot of capital. It does not make sense for small timers like me. For example, if I could only afford $1,000. A 45% increase would just be $450. On the other hand, it would be $45,000 for a person with $100,000. Sometimes absolute numbers are important.

Secondly prospectus are always very optimistic, but always remember that prospectus are a way to gather capital. They are no different from a sales brochure. There will be much ha-has about how their business is going to do well, how good their leadership is, and so on. Fortunately, the only good thing that come out from the prospectus is the numbers. Numbers do not lie. Analyse the numbers carefully to see if the investment is worthy.

Finally, there is no certainty that the IPO will flip say 20% or 40% or even -10% when it opens. We can hope. But there is no certainty. I wonder how it will do for the next few days, especially there could be a massive sell-offs by opportunists who hope to profit from a spike from an IPO during opening day.

Since we are talking about certainty. In the link from CNA, there is a line I quote:
"As a business trust, the future cashflow is predictable, so there is a lack of imagination on this kind of IPO,"
Just to note: As an long term value investor, I appreciate certainty and predictabiluty; I do not need/want to be imaginative. I look for stocks that is profitable, financially strong and sustainable with a healthy cashflow.

For CJLU, my view still holds: I am not going to consider it for sometime, even it means a missed opportunity. You can read about my opinion in the previous blog.

~Huat

10 July 2017

[Investing] IPO Review - Netlink NBN Trust

There is much hype about this IPO because firstly, it is going to raise $2.3 billion, the biggest since Hutchingson Port Holdings (IPO in 2011: US$1 , current price as of writing: $0.45), and is dubbed as one of the 'blockbuster' IPOs in 2017.

Netlink is not a stranger to household. They were formerly known as OpenNet. Our broadband is set up by them. They (probably) own the optic fibre network here.

A good resource of the 'boring' points are can be found here.

The salient points are:

  • Each unit is going for S$0.82
  • About 2.9 billion shares is going to be issued
  • Total Market Capitalisation is about S$3B
  • Average EPS is 1.5 cents
  • Cash is S$92M, Debt is at S$1.6B
  • Dividend Yield is about 5%
  • Use of proceeds - 1) purchase Singtel's assets and 2) repayment of S$1.1B loan to Singtel
Normally, my analysis will cover FInancial Strength, Profitability and Sustainability. But I have many questions, and hence red flags, just by listing the points above, even if I have not read the prospectus.

A Very High P/E

Just from the points above, the P/E of the trust is going to be at 54x. One conventional way to look at P/E is that it is the number of years for our investment to break-even. In this case it will take 54years. I look at it's recipocal which is E/P, the earnings yield, it's going about 1.8%. That is, for every dollar invested in this trust, it accounts for 1.8% of the earnings.

The high P/E could also mean much hype or expectations for the trust.

To me anything that is greater than 30 is too high, although there's really no hard and fast rule for P/E.


Too Little Cash, Too Much Debt

As stated, there is only $92M of cash but $1.6B, or $1600M of debt that's almost 20x. Even if the proceeds from the IPO is used to pay $1.1B of it, there remains $0.5B, or $500M. It is still 5x of cash! 

From the EPS, I infer that there will be challenges repaying that debt just by the business operations, unless they grow there business, which is my next point.


Lack of Growth Plans

It seems that there are no growth plans (or maybe there is). It is stated that the 'fixed residential wired broadband household penetration stood at 88% as of Dec 2016'. I suppose there is still 12% potential. I did not see any other plans like R&D, or expansion overseas. 


Sustainability of Dividend 

I highly doubt the sustainabilty of the dividends at 5%. A 5% dividends  at S$0.80 translate to about $0.04. Recall that EPS is about 1.5 cents, thats S$0.015. I do not know where the trust is going to top up the money.


A Cash Bump for Singtel

It seems that most, if not all, of the proceeds are going to Singtel. Not only does Singtel receive money for the loan to the Netlink, but it is also able to 'dispose' of some of its assets for cash. If you ask me, Singtel seems to be a big winner here.


Conclusion

I believe the red flags I have listed are enough to deter me from participating from this IPO. I could still participate in this IPO, and make whatever earnings after it is launched, but I will not hold it for too long - not with a P/E that high and not when I cannot figure out how it is going to sustain the dividends. I am saying - there are other opportunities around. Remember that after 6 years, Hutchinson Port Holdings is now at 45% of its IPO value.

Other Reference:
  1. http://www.theedgemarkets.com/article/blockbuster-listings-pipeline-singapore-2h17-deloitte
  2. http://www.straitstimes.com/business/netlink-nbn-trust-set-to-be-biggest-ipo-in-singapore-in-six-years-with-pricing-at-81-cents
  3. https://www.shareinvestor.com/fundamental/factsheet.html?counter=NS8U.SI

~Huat

01 July 2017

[PSA] Battle of the Milk

In this post, I explore the costs of various milk.

My son weaned off his mother's milk about 3 months ago and we have been feeding him formula milk. I have been wondering why formula milk is so expensive. Recently, there was much controversy about the price of formula milk powder in Singapore. Older folks will remember KLIM. Even older will remember drinking condensed milk. Some people think all milk is the same, but if this is so, why the prices are different? Is it solely because of marketing?

My wife and I have been contending to switch from formula milk to fresh milk. Our concerns also include whether our boy would accept it and whether if he is sensitive/allergic to the fresh milk. Besides the formula milk, we tried giving him fresh milk. Recently, we are trying a "less-branded" formula milk. Luckily, he seems receptive to all.

Left to Right: Pura Fresh Milk (1L), Dumex Dugro (700g pack), Enfagrow (900g tin)


Mead Johnson's Enfagrow 3
900g
S$45.30
Serving size: 22
Cost per serving: S$2.06

My son has been drinking this as a supplement with mother's milk. There are actually two flavours available - original and vanilla; I learnt about this because I accidently bought the vanilla one without knowing after much later.

Pros: Well know brand (best selling internationally somemore). Nutrition is superior that the other two.
Cons: Not sure if the superior nutrition is absorbed by the boy. And pricey.

Pura Fresh Milk
1L
S$3.60
Serving size: 4
Cost per serving: S$0.90

There are many brands of fresh milk available. Milk of Australian and/or New Zealand origin is preferred. Other brands that can be considered are Farmhouse, Paul's and Marigold. Meiji, by the way, is from Thailand, although it is a Japanese brand.

Pros: Fresh. And cheap (cheapest).
Cons: Need to replenish regularly, but not too much because of the shorter shelf-life. Need to refrigerate, THEN warm it before feeding.


Dumex Dugro Stage 3
700g
S$18.90
Serving size: 18
Cost per serving: S$1.05

We are trying this only recently because of the 'hassle' of fresh milk. The nutrition level is not as high as the Enfagrow, but that's OK because our son is eating solids too, and there's always fish, pork and other goodies inside. By the way, I grew up drinking KLIM.

Pros: Seems like a balance of the other two options - cheap and convenient.
Cons: Can't think of it yet, except that it is cheap to have some psychological effect.


To me, there is a lot of psychological warfare in this formula milk powder thing. 'You mean you can't part with $2 per feed, to provide the best for your child?'.

As parents, we want the best for our children, but we have to be realistic about the price we pay and the benefits it can bring to our children.

There is the price factor and then there is the nutrition factor. Enfragrow has almost double of nutrients such as DHA and Choline, than Dumex Dugro. (Double of those things, double the price; seems legit) I am not an expert, but I have doubts that my child is able to absorb all those nutrients in 1 feed of a formula milk (educate me if I am wrong, please!).

We are lucky that our boy is eating well, not too picky yet, and is receptive to the milk 'experiments' we subject him to. And he is not allergic to both the fresh milk or formula milk. So fortunately for us, a cheaper alternative turns out to be a good middle ground.

~Huat








24 June 2017

[Data Science] Predicting Melbourne's Land Value Using Deep Learning

This series of blogs is to document my application of keras to predict the land values in Melbourne.

The Keras library in Python allows the implementation of neural network/deep learning model easily. For more information, refer to its documentation page. Most resources can be found in Datacamp (where I learnt most of my data science from) and Stackoverflow.

The data set is from Kaggle.

The implementation of Keras is by no means comprehensive. Indeed, it was intended for my own learning and practise. 

For my codes and plots, you can refer to my github page here.


The Melbourne Housing Data

The data consists of 19 columns with over 14,000 rows of data. The following screenshot (from Kaggle) summarises the data details.


Obviously, the Price will be the subject of interest. Intuitively, questions that involve the parameters with regard to the price will follow. I have discovered some of them in this exercise, which I am going to share in the next section.


Cleaning the Data

It is observed that there are missing data and outliers. Hence before fully exploring and discovering relationship between the parameters, the data needs to be cleaned.

For the treatment of missing data, I opt to impute the median of the data set, because it is a more reliable measure since there is outliers in the data. I did this by first defining a function, then groupby and transform the data using Pandas:
def impute_median(series):
 return series.fillna(series.median())

df['Landsize'] = df.groupby('Suburb')['Landsize'].transform(impute_median)

For the treatment of outliers, I omitted a value if it is greater than num standard deviations from the mean:
df = df[np.abs((df['Land Price'] - df['Land Price'].mean())) < (num * np.std(df['Land Price']))]
I defined num as 2 in my code. This treatment has to be applied to Building Area, Land Size too.

Some missing values, even after imputation, are dropped. This amounts to about 1000 data. Sounds sizeable, but I am left with 13,000+ data, which I think is still about to make a meaningful study.

I have also converted the price to Thousands.


Exploring the Data

The best way to explore the data is through visualisation. Plots group by Suburbs are plotted. However, there are 140+ of them and it is not meaningful to have them in one plot. I'm going to just share a few here. All plots can be found in my github link.

(All plots are plotted using ggplot2 in Python. As you might know, ggplot2 is a popular visualisation package in R.)


Effect of Rooms In Different Suburbs



Observations:
1. Some suburbs command higher pay, even for fewer rooms. These are generally closer to the CBD.
2. Prices seems to peak at 4 or 5 rooms, and taper down there after.

(I notice there are data points for 2.5, 3.5 rooms etc. I have to relook at the code to see if there is anything I coded wrongly).

Effect of Landsize and Property Type






Observations:
1. Prices for houses (h) are generally higher. There seem to have no relationship with the houses' price and the landsize.
2. The prices for development site (t) and unit (u) are gneerally insensitive to the land size too.

Price per unit Landsize Vs Building Area per Landsize

This seems to be quite similar to the previous plots, but this is to explore if bigger buildings (as compared to their land size) will command higher prices. Prices are also normalised to the land size. The property type are also shown in terms of the colour.





Observations:
1) Here, it seems that as long as the building area is large, it is likely to command higher price, regardless of the property type.


Implementing the Model

There are four main steps to implementing the deep neural network using Keras:
1. Define the model architecture
2. Compile the model
3. Fit the model with training dataset
4. Make predictions



Step 1: Define the Model Architecture

The simplest architecture in Keras is the Sequential model. We can define the model by using the following steps:
import numpy as np
from keras.models import Sequential

model = Sequential()

Now, layers can be added to the models. This can be done by using the .add() method. The simplest architecture will be the Dense architecture where all the nodes between adjacent layers are connected.

For the first layer of the model, number of columns of the data needs to be defined via the input_shape argument.

We also need to define the activation function of the layer.
from keras.layers import Dense
n_cols = predictors.shape[1]
model.add(Dense(100, activation = 'relu', input_shape = n_cols))
model.add(Dense(100, activation = 'relu'))
model.add(Dense(1))

For this model there is one input layer, one hidden layer and one output layer. Here we use the Rectified Linear Activation, ReLU, as the activation function. This functions retuns the value if it is positive, and 0 otherwise.


Step 2: Compile the model

Once the model architecture is defined, the model needs to be compiled. The aims of compiling the model are to:
1. specify the optimiser for backpropagation
2. specify the loss function

The optimiser can be customised to determine the learning rate, which is an important component to a neural network.
model.compile(optimiser = 'adam', loss = 'mean_squared_error')

adam is one of the optimisers built in Keras. There are many others optimisers to be used, do check out the documentation for more information.

For classification problem, the loss function is to be defined as 'categorical_crossentropy' and an additional argument metric = ['accuracy'] is added to enable easier assessment of the model.

Now the model is ready for fitting.


Step 3: Fit the model with training dataset

Fitting the model is simply:
model.fit(predictors, target)

Cross validation can be performed during the fitting process by defining the validation_split argument.

The model can be also stopped prematurely if there is no improvement in additional runs (epochs). This can be done by defining a early-stop monitor using the EarlyStopping function, then add a callback argument in the compile method.
from keras.callbacks import EarlyStopping
esm = EarlyStopping(patience = 3)

model.fit(predictors, targets, \
validation_split = 0.3, \
epochs = 20, \
callbacks = [esm])

By defining patience = 3, the fitting of the model will stop if there are 3 consecutive epochs with similar performance. Also, the default epochs is 10.

Additional points to note: The inputs, namely predictors and target, must be in Numpy arrays. Otherwise there will be errors. If the data is in a pandas dataframe, it can be converted using the as_matrix() method or the .values attribute.
predictor.as_matrix()
predictor.values



Step 3.5: Saving and Loading the Model

Use the .save() method to save the model. Note that the h5py is required because all models will be saved with the .h5 extension.
import h5py
model.save('model.h5')

To load model, simply:
from keras.models import load_model
my_model = load_model('model.h5')

Now the model can be used to make predictions.



Step 4: Make predictions

To make predictions, simply use the .predict() method.
pred = model.predict(data_to_predict_with)


Evaluating the Model

In this exercise, I used variations configuration and then the mean absolute error (MAE) to assess the models. The results are:

2 Hidden Layers
50 nodes, MAE = 33%
100 nodes, MAE = 24%
200 nodes, MAE = 23%

6 Hidden Layers
200 nodes, MAE = 22%

It appears that the best model (so far) is to use the 2 hidden layer configurations with 200 nodes. Other configurations tend to be not accurate, or computationally expensive (especially with bigger architecture).

The following can be tweaked to improve the model:
1. Change the activation function,
2. Change the optimiser,
3. Determine the best learning rate.

These are left for future exploration.

Learning Points

1. The Keras library is indeed a useful tool to build and prototype a neural network quickly. However, it helps tremendously if one has the background to the neural network. One course I would recommend is the Machine Learning by Andrew Ng.

2. While trying to be ambitious to build a large neural network, I ran into errors. This has to do with the data structure and the computation in the neural network (especially the backpropagation part). I should spend more time studying the basics again.

3. SInce Machine Learning uses much linear algebra and matrix notation (as in Point 1), knowing how to use the Numpy library is important. Indeed, the inputs to the model are to Numpy arrays, as mentioned previously.

Conclusion

This is a good exercise to predict land prices using a neural network. More importantly, I have gained a little more understanding of the Keras library. Also as important, I have learnt how to embed codes in this blogger site :)

Hope you enjoyed reading this as much as I have enjoyed myself working out with Keras and compiling this.

~Huat