Real estate valuation is at the core of buying and selling properties. Getting the right price to sell a good, is paramount to successfully close a deal. On the contrary, getting that initial offer number wrong, can lead to either drastically undersell one’s property or simply not being able to find an acquirer.
Problems to solve
How can we predict the value of a house or an apartment? Which method can be used for a quick and objective appraisal?
Can machine learning help in these matters and how accurate can predictive models be to predict real prices?
Like humans, machines can learn to make predictions by analyzing past information (the ‘historical data’), identify patterns from this data and find a mathematical formula (algorithm or model) using the variables from this historical data.
In order to demonstrate the performance of MyDataModels’ solution for this type of problem we choose a specific case study.
Assessing the market value of real estate is a daunting task. The real estate market is exposed to many fluctuations in prices because of existing correlations with many variables, some of which cannot be controlled or might even be unknown. Housing prices can increase rapidly (or in some cases, also drop very fast).
Appraisers still manually evaluate the value of assets that are sometimes worth billions of dollars by comparing an asset to a small set of previously transacted reference buildings that are somehow comparable.
Machine-Learning holds great promise for real estate pricing models.
This case study is based on real data from a public dataset originally found in the the UCI data repository (https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set).
The objective is the evaluation of the price of a house per square meter in Taiwan dollars (In the original dataset the surface unit used is the ‘Ping’, corresponding to 3.3 square meter. However, we kept the original unit).
The figure below shows an extract of this public dataset. Each line is a house and each column (or ‘feature’) is a variable which can be used in the model.
Historical values are shown in the last column of the table (“Y_house_price of unit area”).
The model uses the features as follows:
X1=the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.)
X2=the house age (unit: year)
X3=the distance to the nearest MRT station (unit: meter)
X4=the number of convenience stores in the living circle on foot (integer)
X5=the geographic coordinate, latitude (unit: degree)
X6=the geographic coordinate, longitude (unit: degree)
The target variable
Y= house price of unit area (square meter).
These were modified in order to provide the units and the calculations referred to the square meter.
To create predictive models, MyDataModels has a performant solution called TADA. To build a model, the user needs to:
1) upload the historical dataset into TADA,
2) set the goal: select the desired variable to predict (here “Y= House price”)
3) select the other variables to use (e.g. all the other columns, or a subset of those), by clicking on each of these.
4) generate the model (by clicking ‘New’ on 'model', naming the model and clicking 'generate')
How good is this model?
The metrics yielded by TADA under the Metrics heading are shown in the table below and refer to a run of one minute.
MAPE = Mean Absolute Percentage Error
MAE = Mean Absolute Error
RMSE = Root Mean Square Error
R2 = Root Mean Squared
We can make a few observations.
- The Maximum error, defined as the difference between the actual value and the predicted one, can be negative.
- Now, for every regression task, e.g. where a numeric value is predicted or ‘fitted’, we may judge the error of the model with respect to the standard deviation 𝜎 (the ‘spread’) of the data used to construct the model. A model displaying a prediction error of the same order of magnitude, indicates a good prediction.
- For our starting data, the spread of the price per unit surface is 𝜎= 4.26 dollars. This is a considerable spread around the mean value of 11.5 dollars /sqm (or the close median value of 11.65). On the other hand, the regression results are of the same order of magnitude (with an RMSE around 3.6 dollars per square meter in fact). Thus, the model can be judged as acceptable, within the limits of the initial data quality.
The last point means that the model produces a prediction which is not more uncertain than the original data. Thus, it can only be as good as the data used to generate it. Clearly we cannot do better and, in fact, enhance the quality of the initial data - nobody can!
How can we use this model?
Once a model has been built, professionals can easily make predictions by feeding it with current house information. Hence, to use the model, a new dataset needs to be created, where all the rows are the houses whose evaluation is desired; and the columns, all the (same) variables used during the model generation. This data set file is then imported into TADA (see screen below). By clicking on “Generate score” the new data are evaluated by the model.
A new file will be generated with a new column (“prediction”) giving the predictions of the houses price.
Benefits of TADA
Real estate is the largest asset class in the world, it makes up, on average, 5.1% of any institutional portfolio (Andonov, Eichholtz, and Kok )
Finding the true market value of a property is an essential skill for appraisers and it ensures a fair negotiation. Real estate professionals and investors can use predictive models to get realistic market values.
However, they are not data scientists and may not have the skills in machine learning nor the coding experience to build models. Moreover, they mostly handle Small Data, where historical data contain few hundreds or thousands of properties, but only rarely millions (aka Big Data) in the same area. The machine learning tools that work well with Big Data may not perform as well with Small Data.
By using an automated machine learning solution like TADA, real estate professionals can now evaluate more quickly and accurately the price of their goods. Machine-Learning holds great promise for real estates.
MyDataModels allows real estate professionals to build predictive models from Small Data automatically and without training. They can use their collected data directly, without normalization and outliers management nor feature engineering. Thanks to this limited data preparation, the above results, from this specific dataset, were obtained with a few clicks in less than a minute on a regular laptop. The results obtained from MyDataModels’ predictive model are satisfying with a RMSE of 3.6.
The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan.
Number of features: 7
Size of data: 290 samples
Target: class: Y= house price of unit area
Original Owner and Donor
Name: Prof. I-Cheng Yeh
Institutions: Department of Civil Engineering, Tamkang University, Taiwan.
Email: 140910 '@' mail.tku.edu.tw
TEL: 886-2-26215656 ext. 3181
Date Donated: Aug. 18, 2018
Data changed in June 2019