A Little Here, A Little There: [Data Science] An Approach to Predict Land Prices

An Approach to Model and Predict Land Values

Introduction

This write-up is to present my thought processes to approach the problem of modelling and predicting land value. It consists of the following parts.
Firstly, I will share what I learnt about the common land valuation methods while researching about the topic. Next, I will share about the workflow. Thereafter, a simple linear model will be discussed conceptually. Finally, a case study on the historical sale prices of landed residential sites in Singapore will be shared.
This write-up is written in R-markdown.

Common Land Valuation Methods

There are 3 common land valuation methods:

Income Method. This method estimates the value of the land by assessing its net operating income and dividing it by the capitalization rate. This is essentially a discounted cash-flow (DCF) model.
Cost Method. This method estimates the value of the land by estimating the cost of developing and maintaining the land.
Comparison Method. This is a subjective approach where the land in interest is benchmarked with a property with similar features. These features are:
- Population and demographics. this will determine if the land is crowded (if it is heavily populated) or not, and if the quality of life (vibrant, quiet, etc).
- Security e.g. crime rates.
- Utility or Planning (or Zoning) of the land, i.e. how is the land being planned to use, for e.g. residential, industrial, recreational, etc.
- Inventory status of the land with similar features, i.e. how unique will this land developed to be, or is there other properties with similar features already available.
- Holding costs associated with the property, such as taxes and fees, should the land could not be rented or sold out in the expected time frame.
- Accessibility pertaining to transport: whether the property is well-connected and easily accessible, and the applicability of airports and seaports.
- Accessibility pertaining to amenities: whether the property is well stocked up with facilities and local conveniences.
- Opinion of real estate agents, i.e. how would the professionals value the land. Computer-aided Mass Appraisal (CAMA) and Geographic Information Systems are the two most common tool used by real estate professionals.

The Work Flow

The general work flow is summarised in the figure below:
image:

The Land Value Model

The approach shall be to combine the three methods (income, cost and comparison) as described in the previous section. The data of the features, described in the Common Land Valuation Methods paragraph 3, for various plots of land shall be used to build a model, together with the tax, rent and land value.
The main idea is to model the land value with the features (zoning, population, demographics, taxes, rents, transport, amenities, etc). Cross-validation is used, that is the data collected is separated into training and testing data set.
The linear architecture (regression, least square methods) shall be considered first. Mathematically,

\(LV = x_1 \theta_1 + x_2 \theta_2 + ... +x_m \theta_m\)

or in vector form

\(LV = X^T \Theta\)

\(LV\) is the land value, \(x_m\) is the value of \(m\) feature and \(\theta_m\) is the resulting weight from the linear model for the \(m\) feature. The merit of training with a linear architecture first is for model simplicity. Furthermore, using appropriate statistical methods (such as the t-test), the effects of the features can be determined and features can be omitted if they are not deemed significant.
The model is then validated using the test data. If the model achieved the desired level of accuracy, it is then used to predict lands that have unknown values, but with the data of the features available. Otherwise, the modelling method is revisited, parameters are tweaked or other modelling architectures shall be considered.
Some considerations:

A simple model is always preferred, because it involves fewer features and hence less costly to collect data.
- However model simplicity may come with a trade-off with predictive power of the model.
Land Values may not be readily available. Hence, the model may be used to predict tax payable or rents that could be yield instead. Then a discounted cash-flow (DCF) model may be used to estimate the land value.
Additional study on the effects of some features such as the effect of properties in the vicinity may be required. Sensitivity analysis may be performed here. This is to determine if there is any interaction, including cofounding and correlation, between the features.
- For example, if the land is to be crowded but its infrastructure is not good, I would suspect that the land value will not be high. There is interaction between population and infrastructure in this case.

A Case Study

In this example, I will use historical sales data of residential site from and www.ura.gov.sg to perform data analysis. The data records the price at which different sites were successfully tendered between 1993 and 2013.

Required Libraries

library(xlsx)
library(ggplot2)
library(rpart)
library(rpart.plot)

Importing and Cleaning Data.

# Import Data
df <- read.xlsx("ura-landed-housing-sites.xlsx", 1)

# Clean Data
df <- df[, -2]  #omit second column
df <- na.omit(df)  #omit additional cells that are imported as NA's

Data Summary

summary(df)

##  Date.of.Launch       Date.of.Award       
##  Min.   :1993-06-24   Min.   :1993-11-06  
##  1st Qu.:1993-10-07   1st Qu.:1994-01-22  
##  Median :1994-11-09   Median :1995-01-12  
##  Mean   :1996-11-11   Mean   :1997-02-15  
##  3rd Qu.:1996-12-18   3rd Qu.:1997-04-07  
##  Max.   :2013-03-28   Max.   :2013-06-24  
##                                           
##                             Location    Type.of.Development.Allowed
##  Jalan Chempaka Kuning          :  2   Semi-detached  :138         
##  Ang Mo Kio Avenue 2            :  1   Terrace        : 73         
##  Ang Mo Kio Avenue 2 / Avenue 5 :  1   Bungalow       : 50         
##  Bedok Walk                     :  1   Mixed Landed   : 20         
##  Bunga Rampai Place             :  1   2 Semi-Detached:  7         
##  Chestnut Avenue / Almond Avenue:  1   1 Bungalow     :  3         
##  (Other)                        :327   (Other)        : 43         
##   Lease..yrs. Site.Area..m2.     No..of.Bids 
##  Min.   :99   Min.   :  394.1   NA     :103  
##  1st Qu.:99   1st Qu.:  465.6   3      : 40  
##  Median :99   Median :  600.7   2      : 36  
##  Mean   :99   Mean   : 2423.4   5      : 29  
##  3rd Qu.:99   3rd Qu.: 1399.3   4      : 27  
##  Max.   :99   Max.   :41883.0   6      : 19  
##                                 (Other): 80  
##                                            Name.of.Successful.Tenderer
##  City Developments Limited                               : 32         
##  Erishi Holdings Pte Ltd                                 : 26         
##  Orchard Parade Land Pte Ltd                             : 22         
##  Winspeed Investment Pte Ltd & Winwave Investment Pte Ltd: 14         
##  Bullion Properties Pte Ltd                              : 13         
##  Glamouray Development Pte Ltd                           : 11         
##  (Other)                                                 :216         
##  Successful.Tender.Price. X.psm.per.Site.Area     Planning.Area
##  Min.   :   680000        Min.   : 839.5      Bedok      :220  
##  1st Qu.:  1684395        1st Qu.:2809.6      Serangoon  : 56  
##  Median :  2024000        Median :3390.9      Sembawang  : 38  
##  Mean   :  7852921        Mean   :3398.5      Bukit Timah:  6  
##  3rd Qu.:  4595722        3rd Qu.:3936.4      Ang Mo Kio :  5  
##  Max.   :366000000        Max.   :9775.5      Hougang    :  3  
##                                               (Other)    :  6

From the summary, Bedok has had the most landed residential (220 out of 334) sites for sale, followed by Serangoon and Sembawang. Most of the developments that were allowed were mostly semi-detached developments.

Data Visualisation

g1 <- ggplot(df, aes(x = Site.Area..m2./div, y = X.psm.per.Site.Area)) + 
  geom_point(pch = 21, size=2, aes(fill = Planning.Area)) + guides(fill=FALSE) +#add points
  facet_wrap(~Planning.Area,ncol =5) + #separate into grids
  geom_hline(aes(yintercept = mean(X.psm.per.Site.Area)), linetype = "dashed", color = "red") + #add overall average
  geom_smooth(se=FALSE, method = "lm", formula = y~1, color = "blue") + #add group averages
  theme_bw() 

g1 + labs(title = 'Landed Housing Sites', subtitle = 'Data Source: www.ura.gov.sg', x = 'Area of Site (sqm) [in Thousands]', y = 'Price per sqm', caption = "The red dashed line represents the mean price of the data set, while the blue line represents the mean price of the location.") #add axis labels

Observations

A number of observations can be made from the plot:

The effect of location is evident in the plot.
It is expected that the average price in Bedok is very close to the overall mean since it has the most sites on offer.
There is a price disparity for similar site area in the top three sites - Bedok, Sembawang and Serangoon. This is most obvious in Sembawang, which has the widest spread. This could be because of the time period of the data and prices increase due to inflation, etc.
By considering the average price per unit area, Sembawang is the most expensive. This is interesting because one would have expect Serangoon to be pricer given its closer proximity to the city centre. However, scrutiny is required on the period at which the sites in Serangoon were sold.

Next Steps

The next steps to this project are:

Clean the data further to account for the time value of money to current level. That is, determine the period (e.g. year) at which the site was sold and calculate the current value of the site.
Once the current values can established, we can investigate if there are effects of the price on a site from its proximity sites, and gather data about them. For example,

Serangoon is located near Ang Mo Kio and Hougang, yet these sites where sold at a noticeably lower price.
The factors that could have let sites in Sembawang command a higher price.
Likewise, the factors that could have let sites in Bedok command prices that are somewhat competitive.

With the additional data from 1 and 2, a location-specific model can be built to predict the land value (or the tender price), based on the approach discussed in the previous section. This can be considered as a future project.

Conclusion

I hope I have presented my thought processes to approach a problem concisely. I enjoyed analysing The landed residential site data because the outcome was rather interesting. There are more work that could be done on this data set, however due to the interest of time and intent, this is better left as a future project.

Update - Additional Analyses

Continuing from the previous section, I decided to analyse the data further to see if anything fruitful could be derived, in particular, how the Planning Area, Site Area and Type of Development Allowed affects the Price Per SQM of the property. I achieve this using Recursive Partitioning, or rpart, which is a tree-based model.
However, if we study the Type of Development Allowed, the data requires further cleaning before analyses can proceed, in particular, reducing the type of development to the few main types, namely terrace, bungalow, mixed, landed, etc.
The code below cleans the data:

# Extract data
dev_type <- as.vector(df$Type.of.Development.Allowed)  #this creates a vector of characters.
# dev_type <- c(t(dev_type)) #converts into a vector, for easier
# implementation of regex

# Remove all numbers
dev_type <- stringr::str_replace_all(dev_type, "^\\d+", "")

# Replace all with appropriate key words

for (i in 1:length(dev_type)) {
    if (grepl("or", dev_type[i])) {
        dev_type[i] = "Mixed"
    } else if (grepl("&", dev_type[i])) {
        dev_type[i] = "Mixed"
    } else if (grepl("Mixed", dev_type[i])) {
        dev_type[i] = "Mixed"
    } else if (grepl("Landed", dev_type[i])) {
        dev_type[i] = "Landed"
    } else if (grepl("Terrace", dev_type[i])) {
        dev_type[i] = "Terrace"
    } else if (grepl("Bungalow", dev_type[i])) {
        dev_type[i] = "Bungalow"
    } else if (grepl("Semi-detached", dev_type[i], ignore.case = TRUE)) {
        dev_type[i] = "Semi-Detached"
    }
    
}

# Converts data type to factor
df$Development_Type <- factor(dev_type)

Modelling

With the data cleaned, we can proceed with the modelling, followed by the plot.

rpart_mod <- rpart(X.psm.per.Site.Area ~ Development_Type + Planning.Area + 
    Site.Area..m2., data = df)
prp(rpart_mod, type = 2)

The model favours Planning Area first, followed by the Site Area and finally the Type of Development. As discussed earlier, the two most lucrative properties appear to be those in Bedok and Sembawang. Larger site areas worth more, as expected.

A Little Here, A Little There

17 June 2017

[Data Science] An Approach to Predict Land Prices