A Little Here, A Little There: June 2017

24 June 2017

[Data Science] Predicting Melbourne's Land Value Using Deep Learning

This series of blogs is to document my application of keras to predict the land values in Melbourne.

The Keras library in Python allows the implementation of neural network/deep learning model easily. For more information, refer to its documentation page. Most resources can be found in Datacamp (where I learnt most of my data science from) and Stackoverflow.

The data set is from Kaggle.

The implementation of Keras is by no means comprehensive. Indeed, it was intended for my own learning and practise.

For my codes and plots, you can refer to my github page here.

The Melbourne Housing Data

The data consists of 19 columns with over 14,000 rows of data. The following screenshot (from Kaggle) summarises the data details.

Obviously, the Price will be the subject of interest. Intuitively, questions that involve the parameters with regard to the price will follow. I have discovered some of them in this exercise, which I am going to share in the next section.

Cleaning the Data

It is observed that there are missing data and outliers. Hence before fully exploring and discovering relationship between the parameters, the data needs to be cleaned.

For the treatment of missing data, I opt to impute the median of the data set, because it is a more reliable measure since there is outliers in the data. I did this by first defining a function, then groupby and transform the data using Pandas:

def impute_median(series):
 return series.fillna(series.median())

df['Landsize'] = df.groupby('Suburb')['Landsize'].transform(impute_median)

For the treatment of outliers, I omitted a value if it is greater than num standard deviations from the mean:

df = df[np.abs((df['Land Price'] - df['Land Price'].mean())) < (num * np.std(df['Land Price']))]

I defined num as 2 in my code. This treatment has to be applied to Building Area, Land Size too.

Some missing values, even after imputation, are dropped. This amounts to about 1000 data. Sounds sizeable, but I am left with 13,000+ data, which I think is still about to make a meaningful study.

I have also converted the price to Thousands.

Exploring the Data

The best way to explore the data is through visualisation. Plots group by Suburbs are plotted. However, there are 140+ of them and it is not meaningful to have them in one plot. I'm going to just share a few here. All plots can be found in my github link.

(All plots are plotted using ggplot2 in Python. As you might know, ggplot2 is a popular visualisation package in R.)

Effect of Rooms In Different Suburbs

Observations:
1. Some suburbs command higher pay, even for fewer rooms. These are generally closer to the CBD.
2. Prices seems to peak at 4 or 5 rooms, and taper down there after.

(I notice there are data points for 2.5, 3.5 rooms etc. I have to relook at the code to see if there is anything I coded wrongly).

Effect of Landsize and Property Type

Observations:
1. Prices for houses (h) are generally higher. There seem to have no relationship with the houses' price and the landsize.
2. The prices for development site (t) and unit (u) are gneerally insensitive to the land size too.

Price per unit Landsize Vs Building Area per Landsize

This seems to be quite similar to the previous plots, but this is to explore if bigger buildings (as compared to their land size) will command higher prices. Prices are also normalised to the land size. The property type are also shown in terms of the colour.

Observations:
1) Here, it seems that as long as the building area is large, it is likely to command higher price, regardless of the property type.

Implementing the Model

There are four main steps to implementing the deep neural network using Keras:
1. Define the model architecture
2. Compile the model
3. Fit the model with training dataset
4. Make predictions

Step 1: Define the Model Architecture

The simplest architecture in Keras is the Sequential model. We can define the model by using the following steps:

import numpy as np
from keras.models import Sequential

model = Sequential()

Now, layers can be added to the models. This can be done by using the .add() method. The simplest architecture will be the Dense architecture where all the nodes between adjacent layers are connected.

For the first layer of the model, number of columns of the data needs to be defined via the input_shape argument.

We also need to define the activation function of the layer.

from keras.layers import Dense
n_cols = predictors.shape[1]
model.add(Dense(100, activation = 'relu', input_shape = n_cols))
model.add(Dense(100, activation = 'relu'))
model.add(Dense(1))

For this model there is one input layer, one hidden layer and one output layer. Here we use the Rectified Linear Activation, ReLU, as the activation function. This functions retuns the value if it is positive, and 0 otherwise.

Step 2: Compile the model

Once the model architecture is defined, the model needs to be compiled. The aims of compiling the model are to:
1. specify the optimiser for backpropagation
2. specify the loss function

The optimiser can be customised to determine the learning rate, which is an important component to a neural network.

model.compile(optimiser = 'adam', loss = 'mean_squared_error')

adam is one of the optimisers built in Keras. There are many others optimisers to be used, do check out the documentation for more information.

For classification problem, the loss function is to be defined as 'categorical_crossentropy' and an additional argument metric = ['accuracy'] is added to enable easier assessment of the model.

Now the model is ready for fitting.

Step 3: Fit the model with training dataset

Fitting the model is simply:

model.fit(predictors, target)

Cross validation can be performed during the fitting process by defining the validation_split argument.

The model can be also stopped prematurely if there is no improvement in additional runs (epochs). This can be done by defining a early-stop monitor using the EarlyStopping function, then add a callback argument in the compile method.

from keras.callbacks import EarlyStopping
esm = EarlyStopping(patience = 3)

model.fit(predictors, targets, \
validation_split = 0.3, \
epochs = 20, \
callbacks = [esm])

By defining patience = 3, the fitting of the model will stop if there are 3 consecutive epochs with similar performance. Also, the default epochs is 10.

Additional points to note: The inputs, namely predictors and target, must be in Numpy arrays. Otherwise there will be errors. If the data is in a pandas dataframe, it can be converted using the as_matrix() method or the .values attribute.

predictor.as_matrix()
predictor.values

Step 3.5: Saving and Loading the Model

Use the .save() method to save the model. Note that the h5py is required because all models will be saved with the .h5 extension.

import h5py
model.save('model.h5')

To load model, simply:

from keras.models import load_model
my_model = load_model('model.h5')

Now the model can be used to make predictions.

Step 4: Make predictions

To make predictions, simply use the .predict() method.

pred = model.predict(data_to_predict_with)

Evaluating the Model

In this exercise, I used variations configuration and then the mean absolute error (MAE) to assess the models. The results are:

2 Hidden Layers
50 nodes, MAE = 33%
100 nodes, MAE = 24%
200 nodes, MAE = 23%

6 Hidden Layers
200 nodes, MAE = 22%

It appears that the best model (so far) is to use the 2 hidden layer configurations with 200 nodes. Other configurations tend to be not accurate, or computationally expensive (especially with bigger architecture).

The following can be tweaked to improve the model:
1. Change the activation function,
2. Change the optimiser,
3. Determine the best learning rate.

These are left for future exploration.

Learning Points

1. The Keras library is indeed a useful tool to build and prototype a neural network quickly. However, it helps tremendously if one has the background to the neural network. One course I would recommend is the Machine Learning by Andrew Ng.

2. While trying to be ambitious to build a large neural network, I ran into errors. This has to do with the data structure and the computation in the neural network (especially the backpropagation part). I should spend more time studying the basics again.

3. SInce Machine Learning uses much linear algebra and matrix notation (as in Point 1), knowing how to use the Numpy library is important. Indeed, the inputs to the model are to Numpy arrays, as mentioned previously.

Conclusion

This is a good exercise to predict land prices using a neural network. More importantly, I have gained a little more understanding of the Keras library. Also as important, I have learnt how to embed codes in this blogger site :)

Hope you enjoyed reading this as much as I have enjoyed myself working out with Keras and compiling this.

~Huat

[Data Science] Embedding Python Codes in Blogger Using SyntaxHighlighter

You know, I have been rendering my R codes to a html files in R Markdown (RMD), then copied the whole HTML codes to paste into blogger.

What I needed was to embed code snippets.

I came across this awesome post to do just that, using SyntaxHighlighter, credit goes to the blogger and the creator of SyntaxHighlighter. And not forgetting StackOverflow.

Here's an example:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import h5py

Now it's a success! Now I can share my data science stuffs here with proper code snippets!

Yay~

Here are the details of setting it up, as some of the steps in the links mentioned are outdated:

In Blogger:

1. Go to Template > Edit HTML (it is a small icon below the blog preview).

2. Copy and paste these codes immediately after the <head> tag:

<script src='http://alexgorbatchev.com/pub/sh/2.1.364/scripts/shCore.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/2.1.364/scripts/shBrushPython.js' type='text/javascript'></script>
<link href='http://alexgorbatchev.com/pub/sh/2.1.364/styles/styles/shCore.css' rel='stylesheet' type='text/css' />
<link href='http://alexgorbatchev.com/pub/sh/2.1.364/styles/shThemeDefault.css' rel='stylesheet' type='text/css' />

(In the link provided, the quotation marks in the links are presented as ''' - Blogger does not accept it)

3. Save the template.

4. Go to Layout.

5. Add a HTML/Javascript Gadget - Any location will do. I chose one at the side bar.

6. Add a title for the gadget. Remember this title because you have to locate it later in the HTML code in order to hide the gadget. Use SyntaxHighlighter if you can't think of one.

7. In the content box, include this code:

<script class='javascript'>//<![CDATA[
SyntaxHighlighter.config.clipboardSwf = 'http://alexgorbatchev.com/pub/sh/2.1.364/scripts/clipboard.swf';
SyntaxHighlighter.all();
//]]></script>

8. Save the gadget.

9. Now, head back to Template > Edit HTML. Locate the widget id of the gadget you just added in. It can be found within the b:widget tag. (It is HTML1 for me.)

10. Locate the <b:skin> ... </b:skin>. Expand it and before the ]]></b:skin>, include this code to hide the gadget:

#HTML1 {display:none;}

11. Save the template and all is good to go!

Note: Somehow it does not work in Preview.

~Huat

17 June 2017

[Data Science] An Approach to Predict Land Prices

An Approach to Model and Predict Land Values

Introduction

This write-up is to present my thought processes to approach the problem of modelling and predicting land value. It consists of the following parts.
Firstly, I will share what I learnt about the common land valuation methods while researching about the topic. Next, I will share about the workflow. Thereafter, a simple linear model will be discussed conceptually. Finally, a case study on the historical sale prices of landed residential sites in Singapore will be shared.
This write-up is written in R-markdown.

Common Land Valuation Methods

There are 3 common land valuation methods:

Income Method. This method estimates the value of the land by assessing its net operating income and dividing it by the capitalization rate. This is essentially a discounted cash-flow (DCF) model.
Cost Method. This method estimates the value of the land by estimating the cost of developing and maintaining the land.
Comparison Method. This is a subjective approach where the land in interest is benchmarked with a property with similar features. These features are:
- Population and demographics. this will determine if the land is crowded (if it is heavily populated) or not, and if the quality of life (vibrant, quiet, etc).
- Security e.g. crime rates.
- Utility or Planning (or Zoning) of the land, i.e. how is the land being planned to use, for e.g. residential, industrial, recreational, etc.
- Inventory status of the land with similar features, i.e. how unique will this land developed to be, or is there other properties with similar features already available.
- Holding costs associated with the property, such as taxes and fees, should the land could not be rented or sold out in the expected time frame.
- Accessibility pertaining to transport: whether the property is well-connected and easily accessible, and the applicability of airports and seaports.
- Accessibility pertaining to amenities: whether the property is well stocked up with facilities and local conveniences.
- Opinion of real estate agents, i.e. how would the professionals value the land. Computer-aided Mass Appraisal (CAMA) and Geographic Information Systems are the two most common tool used by real estate professionals.

The Work Flow

The general work flow is summarised in the figure below:
image:

The Land Value Model

The approach shall be to combine the three methods (income, cost and comparison) as described in the previous section. The data of the features, described in the Common Land Valuation Methods paragraph 3, for various plots of land shall be used to build a model, together with the tax, rent and land value.
The main idea is to model the land value with the features (zoning, population, demographics, taxes, rents, transport, amenities, etc). Cross-validation is used, that is the data collected is separated into training and testing data set.
The linear architecture (regression, least square methods) shall be considered first. Mathematically,

\(LV = x_1 \theta_1 + x_2 \theta_2 + ... +x_m \theta_m\)

or in vector form

\(LV = X^T \Theta\)

\(LV\) is the land value, \(x_m\) is the value of \(m\) feature and \(\theta_m\) is the resulting weight from the linear model for the \(m\) feature. The merit of training with a linear architecture first is for model simplicity. Furthermore, using appropriate statistical methods (such as the t-test), the effects of the features can be determined and features can be omitted if they are not deemed significant.
The model is then validated using the test data. If the model achieved the desired level of accuracy, it is then used to predict lands that have unknown values, but with the data of the features available. Otherwise, the modelling method is revisited, parameters are tweaked or other modelling architectures shall be considered.
Some considerations:

A simple model is always preferred, because it involves fewer features and hence less costly to collect data.
- However model simplicity may come with a trade-off with predictive power of the model.
Land Values may not be readily available. Hence, the model may be used to predict tax payable or rents that could be yield instead. Then a discounted cash-flow (DCF) model may be used to estimate the land value.
Additional study on the effects of some features such as the effect of properties in the vicinity may be required. Sensitivity analysis may be performed here. This is to determine if there is any interaction, including cofounding and correlation, between the features.
- For example, if the land is to be crowded but its infrastructure is not good, I would suspect that the land value will not be high. There is interaction between population and infrastructure in this case.

A Case Study

In this example, I will use historical sales data of residential site from and www.ura.gov.sg to perform data analysis. The data records the price at which different sites were successfully tendered between 1993 and 2013.

Required Libraries

library(xlsx)
library(ggplot2)
library(rpart)
library(rpart.plot)

Importing and Cleaning Data.

# Import Data
df <- read.xlsx("ura-landed-housing-sites.xlsx", 1)

# Clean Data
df <- df[, -2]  #omit second column
df <- na.omit(df)  #omit additional cells that are imported as NA's

Data Summary

summary(df)

##  Date.of.Launch       Date.of.Award       
##  Min.   :1993-06-24   Min.   :1993-11-06  
##  1st Qu.:1993-10-07   1st Qu.:1994-01-22  
##  Median :1994-11-09   Median :1995-01-12  
##  Mean   :1996-11-11   Mean   :1997-02-15  
##  3rd Qu.:1996-12-18   3rd Qu.:1997-04-07  
##  Max.   :2013-03-28   Max.   :2013-06-24  
##                                           
##                             Location    Type.of.Development.Allowed
##  Jalan Chempaka Kuning          :  2   Semi-detached  :138         
##  Ang Mo Kio Avenue 2            :  1   Terrace        : 73         
##  Ang Mo Kio Avenue 2 / Avenue 5 :  1   Bungalow       : 50         
##  Bedok Walk                     :  1   Mixed Landed   : 20         
##  Bunga Rampai Place             :  1   2 Semi-Detached:  7         
##  Chestnut Avenue / Almond Avenue:  1   1 Bungalow     :  3         
##  (Other)                        :327   (Other)        : 43         
##   Lease..yrs. Site.Area..m2.     No..of.Bids 
##  Min.   :99   Min.   :  394.1   NA     :103  
##  1st Qu.:99   1st Qu.:  465.6   3      : 40  
##  Median :99   Median :  600.7   2      : 36  
##  Mean   :99   Mean   : 2423.4   5      : 29  
##  3rd Qu.:99   3rd Qu.: 1399.3   4      : 27  
##  Max.   :99   Max.   :41883.0   6      : 19  
##                                 (Other): 80  
##                                            Name.of.Successful.Tenderer
##  City Developments Limited                               : 32         
##  Erishi Holdings Pte Ltd                                 : 26         
##  Orchard Parade Land Pte Ltd                             : 22         
##  Winspeed Investment Pte Ltd & Winwave Investment Pte Ltd: 14         
##  Bullion Properties Pte Ltd                              : 13         
##  Glamouray Development Pte Ltd                           : 11         
##  (Other)                                                 :216         
##  Successful.Tender.Price. X.psm.per.Site.Area     Planning.Area
##  Min.   :   680000        Min.   : 839.5      Bedok      :220  
##  1st Qu.:  1684395        1st Qu.:2809.6      Serangoon  : 56  
##  Median :  2024000        Median :3390.9      Sembawang  : 38  
##  Mean   :  7852921        Mean   :3398.5      Bukit Timah:  6  
##  3rd Qu.:  4595722        3rd Qu.:3936.4      Ang Mo Kio :  5  
##  Max.   :366000000        Max.   :9775.5      Hougang    :  3  
##                                               (Other)    :  6

From the summary, Bedok has had the most landed residential (220 out of 334) sites for sale, followed by Serangoon and Sembawang. Most of the developments that were allowed were mostly semi-detached developments.

Data Visualisation

g1 <- ggplot(df, aes(x = Site.Area..m2./div, y = X.psm.per.Site.Area)) + 
  geom_point(pch = 21, size=2, aes(fill = Planning.Area)) + guides(fill=FALSE) +#add points
  facet_wrap(~Planning.Area,ncol =5) + #separate into grids
  geom_hline(aes(yintercept = mean(X.psm.per.Site.Area)), linetype = "dashed", color = "red") + #add overall average
  geom_smooth(se=FALSE, method = "lm", formula = y~1, color = "blue") + #add group averages
  theme_bw() 

g1 + labs(title = 'Landed Housing Sites', subtitle = 'Data Source: www.ura.gov.sg', x = 'Area of Site (sqm) [in Thousands]', y = 'Price per sqm', caption = "The red dashed line represents the mean price of the data set, while the blue line represents the mean price of the location.") #add axis labels

Observations

A number of observations can be made from the plot:

The effect of location is evident in the plot.
It is expected that the average price in Bedok is very close to the overall mean since it has the most sites on offer.
There is a price disparity for similar site area in the top three sites - Bedok, Sembawang and Serangoon. This is most obvious in Sembawang, which has the widest spread. This could be because of the time period of the data and prices increase due to inflation, etc.
By considering the average price per unit area, Sembawang is the most expensive. This is interesting because one would have expect Serangoon to be pricer given its closer proximity to the city centre. However, scrutiny is required on the period at which the sites in Serangoon were sold.

Next Steps

The next steps to this project are:

Clean the data further to account for the time value of money to current level. That is, determine the period (e.g. year) at which the site was sold and calculate the current value of the site.
Once the current values can established, we can investigate if there are effects of the price on a site from its proximity sites, and gather data about them. For example,

Serangoon is located near Ang Mo Kio and Hougang, yet these sites where sold at a noticeably lower price.
The factors that could have let sites in Sembawang command a higher price.
Likewise, the factors that could have let sites in Bedok command prices that are somewhat competitive.

With the additional data from 1 and 2, a location-specific model can be built to predict the land value (or the tender price), based on the approach discussed in the previous section. This can be considered as a future project.

Conclusion

I hope I have presented my thought processes to approach a problem concisely. I enjoyed analysing The landed residential site data because the outcome was rather interesting. There are more work that could be done on this data set, however due to the interest of time and intent, this is better left as a future project.

Update - Additional Analyses

Continuing from the previous section, I decided to analyse the data further to see if anything fruitful could be derived, in particular, how the Planning Area, Site Area and Type of Development Allowed affects the Price Per SQM of the property. I achieve this using Recursive Partitioning, or rpart, which is a tree-based model.
However, if we study the Type of Development Allowed, the data requires further cleaning before analyses can proceed, in particular, reducing the type of development to the few main types, namely terrace, bungalow, mixed, landed, etc.
The code below cleans the data:

# Extract data
dev_type <- as.vector(df$Type.of.Development.Allowed)  #this creates a vector of characters.
# dev_type <- c(t(dev_type)) #converts into a vector, for easier
# implementation of regex

# Remove all numbers
dev_type <- stringr::str_replace_all(dev_type, "^\\d+", "")

# Replace all with appropriate key words

for (i in 1:length(dev_type)) {
    if (grepl("or", dev_type[i])) {
        dev_type[i] = "Mixed"
    } else if (grepl("&", dev_type[i])) {
        dev_type[i] = "Mixed"
    } else if (grepl("Mixed", dev_type[i])) {
        dev_type[i] = "Mixed"
    } else if (grepl("Landed", dev_type[i])) {
        dev_type[i] = "Landed"
    } else if (grepl("Terrace", dev_type[i])) {
        dev_type[i] = "Terrace"
    } else if (grepl("Bungalow", dev_type[i])) {
        dev_type[i] = "Bungalow"
    } else if (grepl("Semi-detached", dev_type[i], ignore.case = TRUE)) {
        dev_type[i] = "Semi-Detached"
    }
    
}

# Converts data type to factor
df$Development_Type <- factor(dev_type)

Modelling

With the data cleaned, we can proceed with the modelling, followed by the plot.

rpart_mod <- rpart(X.psm.per.Site.Area ~ Development_Type + Planning.Area + 
    Site.Area..m2., data = df)
prp(rpart_mod, type = 2)

The model favours Planning Area first, followed by the Site Area and finally the Type of Development. As discussed earlier, the two most lucrative properties appear to be those in Bedok and Sembawang. Larger site areas worth more, as expected.