Back

Real Estate

Real Estate Valuation

real_estate.jpg

Real estate valuation is at the core of buying and selling properties. Getting the right price to sell a good, is paramount to successfully close a deal. On the contrary, getting that initial offer number wrong, can lead to either drastically undersell one’s property or simply not being able to find an acquirer.

Problems to solve

How can we predict the value of a house or an apartment? Which method can be used for a quick and objective appraisal?

Can machine learning help in these matters and how accurate can predictive models be to predict real prices?

Solution

Like humans, machines can learn to make predictions by analyzing past information (the ‘historical data’), identify patterns from this data and find a mathematical formula (algorithm or model) using the variables from this historical data.

In order to demonstrate the performance of MyDataModels’ solution for this type of problem we choose a specific case study.

Case Study

Assessing the market value of real estate is a daunting task. The real estate market is exposed to many fluctuations in prices because of existing correlations with many variables, some of which cannot be controlled or might even be unknown. Housing prices can increase rapidly (or in some cases, also drop very fast).

Appraisers still manually evaluate the value of assets that are sometimes worth billions of dollars by comparing an asset to a small set of previously transacted reference buildings that are somehow comparable.

Machine-Learning holds great promise for real estate pricing models.

This case study is based on real data from a public dataset originally found in the the UCI data repository (https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set).

The objective is the evaluation of the price of a house per square meter in Taiwan dollars (In the original dataset the surface unit used is the ‘Ping’, corresponding to 3.3 square meter. However, we kept the original unit). 

The figure below shows an extract of this public dataset. Each line is a house and each column (or ‘feature’) is a variable which can be used in the model. 

Historical values are shown in the last column of the table (“Y_house_price of unit area”).

dataset.png

The model uses the features as follows:

X1=the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.) 
X2=the house age (unit: year) 
X3=the distance to the nearest MRT station (unit: meter) 
X4=the number of convenience stores in the living circle on foot (integer) 
X5=the geographic coordinate, latitude (unit: degree) 
X6=the geographic coordinate, longitude (unit: degree) 

The target variable  
Y= house price of unit area (square meter).

These were modified in order to provide the units and the calculations referred to the square meter.

Results

To create predictive models, MyDataModels has a performant solution called TADA. To build a model, the user needs to:

1) upload the historical dataset into TADA, 
2) set the goal: select the desired variable to predict (here “Y= House price”)
3) select the other variables to use (e.g. all the other columns, or a subset of those), by clicking on each of these.
4) generate the model (by clicking ‘New’ on 'model', naming the model and clicking 'generate')


How good is this model?

The metrics yielded by TADA under the Metrics heading are shown in the table below and refer to a run of one minute. 
 metrics.png

Legend

MAPE = Mean Absolute Percentage Error
MAE = Mean Absolute Error
RMSE = Root Mean Square Error
R2 = Root Mean Squared

We can make a few observations.

  • The Maximum error, defined as the difference between the actual value and the predicted one, can be negative.
  • Now, for every regression task, e.g. where a numeric value is predicted or ‘fitted’, we may judge the error of the model with respect to the standard deviation 𝜎 (the ‘spread’) of the data used to construct the model. A model displaying a prediction error of the same order of magnitude, indicates a good prediction. 
  • For our starting data, the spread of the price per unit surface is 𝜎= 4.26 dollars. This is a considerable spread around the mean value of 11.5 dollars /sqm (or the close median value of 11.65). On the other hand, the regression results are of the same order of magnitude (with an RMSE around 3.6 dollars per square meter in fact). Thus, the model can be judged as acceptable, within the limits of the initial data quality.

The last point means that the model produces a prediction which is not more  uncertain than the original data. Thus, it can only be as good as the data used to generate it. Clearly we cannot do better and, in fact, enhance the quality of the initial data - nobody can!


How can we use this model?

Once a model has been built, professionals can easily make predictions by feeding it with current house information. Hence, to use the model, a new dataset needs to be created, where all the rows are the houses whose evaluation is desired; and the columns, all the (same) variables used during the model generation. This data set file is then imported into TADA (see screen below). By clicking on “Generate score” the new data are evaluated by the model. 

new_score.png

A new file will be generated with a new column (“prediction”) giving the predictions of the houses price.

Benefits of TADA

Real estate is the largest asset class in the world, it makes up, on average, 5.1% of any institutional portfolio (Andonov, Eichholtz, and Kok [2013])

Finding the true market value of a property is an essential skill for appraisers and it ensures a fair negotiation. Real estate professionals and investors can use predictive models to get realistic market values. 

However, they are not data scientists and may not have the skills in machine learning nor the coding experience to build models. Moreover, they mostly handle Small Data, where historical data contain few hundreds or thousands of properties, but only rarely millions (aka Big Data) in the same area. The machine learning tools that work well with Big Data may not perform as well with Small Data.  

By using an automated machine learning solution like TADA, real estate professionals can now evaluate more quickly and accurately the price of their goods. Machine-Learning holds great promise for real estates.

Conclusion

MyDataModels allows real estate professionals to build predictive models from Small Data automatically and without training. They can use their collected data directly, without normalization and outliers management nor feature engineering. Thanks to this limited data preparation, the above results, from this specific dataset, were obtained with a few clicks in less than a minute on a regular laptop.  The results obtained from MyDataModels’ predictive model are satisfying with a RMSE of 3.6.

Dataset information

The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan.

Task: Regression
Number of features: 7
Size of data: 290 samples
Target: class: Y= house price of unit area
Score: RMSE

 

Acknowledgements

Original Owner and Donor
Name: Prof. I-Cheng Yeh 
Institutions: Department of Civil Engineering, Tamkang University, Taiwan. 
Email: 140910 '@' mail.tku.edu.tw 
TEL: 886-2-26215656 ext. 3181 

Date Donated: Aug. 18, 2018 
Data changed in June 2019

Get started for free

Detailed informations

General

Artificial intelligence: Theories and techniques aiming to simulate intelligence (human, animal or other).

Binary Classification: It is the problem type when you are trying to predict one of two states, e.g. yes/no, true/ false, A/B, 0/1, red/green, etc. This kind of analysis requires that the Goal variable type is of type CLASS. Binary Classification analysis also requires that there be only 2 different values in the Goal column. Otherwise, it is not a binary problem (two choices and no more).

Convolutional Neural Network: This type of network is dedicated to object recognition. They are generally composed of several layers of convolutions + pooling followed by one or more FC layers. A convolutional layer can be seen as a filter. Thus, the first layer of a CNN make it possible to filter the corners, curves and segments and the following ones, more and more complex forms.

Data Mining: Field of data science aimed at extracting knowledge and / or information from a body of data.

Deep Learning: Deep Learning is a category of so-called "layered" machine learning algorithms. A deep learning algorithm is a neural network with a large number of layers. The main interest of these networks is their ability to learn models from raw data, thus reducing pre-processing (often important in the case of classical algorithms).

Fully Convolutional Networks: An FCN is a CNN with the last FC layers removed. This type of network is currently not used much but can be very useful if it is succeeded by an RNN network allowing integration of the time dimension in a visual recognition analysis.

GRU (Gated Recurrent Unit): A GRU network is a simplified LSTM invented recently (2014) and allowing better predictions and easier parameterization.

LSTM (Long Short-Term Memory): An LSTM is an RNN to which a system has been added to control access to memory cells. We speak of "Gated Activation Function". LSTMs perform better than conventional RNNs.

Machine learning : Subfield of Artificial Intelligence (AI), Machine Learning is the scientific study of algorithms and statistical models that provides systems the ability to learn and improve any specific tasks without explicit programming.

Multi Classification: Classification when there is more than two classes in the goal variable, e.g. A/B/C/D, red/orange/green, etc.

Multilayer perceptron: This is a classic neural network. Generally, all the neurons of a layer are connected to all the neurons of the next layer. We are talking about Fully Connected (FC) layers.

RCNN (Regional CNN): This type of network compensates for the shortcomings of a classic CNN and answers the question: what to do when an image contains several objects to recognize? An RCNN makes it possible to extract several labels (each associated with a bounding box) of an image.

Regression: Set of statistical processes to predict a specific number or value. Regression analysis requires the type of Goal variable to be numeric (INTEGER or DOUBLE).

Reinforcement learning: Reinforcement learning is about supervised learning. It involves using new predicted data to improve the learning model (calculated upstream).

RNN (Recurrent Neural Networks): Recurrent networks are a set of networks integrating the temporal dimension. Thus, from one prediction to another, information is shared. These networks are mainly used for the recognition of activities or actions via video or other sensors.

Semi supervised learning: Semi-supervised learning is a special case of supervised learning. Semi-supervised learning is when training data is incomplete. The interest is to learn a model with little labeled data.

Stratified sampling: It is a method of sampling such that the distribution of goal observations in each stratum of the sample is the same as the distribution of goal observations in the population. TADA uses this method to shuffle the data set from binary and multi classification projects.

Simple random sampling: It is a method of sampling in which each observation is equally likely to be chosen randomly. TADA uses this method to shuffle the data set from regression projects.

Supervised learning: Sub-domain of machine learning, supervised learning aims to generalize and extract rules from labeled data. All this in order to make predictions (to predict the label associated with a data without label).

Transfer learning: Brought up to date by deep learning, transfer learning consists of reusing pre-learned learning models in order not to reinvent the wheel at each learning.

Unsupervised learning: Sub-domain of machine learning, unsupervised learning aims to group data that are similar and divide/separate different data. We talk about minimizing intra-class variance and maximizing inter-class variance.


Metrics

Binary

ACC (Accuracy): Percentage of samples in the test set correctly classified by the model.

Actual Negative: Number of samples of negative case in the raw source data subset.

Actual Positive: Number of samples of positive case in the raw source data subset.

AUC: Area Under the Curve (AUC) of the Receiver Operating characteristic (ROC) curve. It is in the interval [0;1]. A perfect predictive model gives an AUC score of 1. A predictive model which makes random guesses has an AUC score of 0.5.

F1 score: Single value metric that gives an indication of a Binary Classification model's efficiency at predicting both True and False predictions. It is computed using the harmonic mean of PPV and TPR.

False Negative: Number of positive class samples in the source data subset that were incorrectly predicted as negative.

False Positive: Number of negative class samples in the source data subset that were incorrectly predicted as positive.

MCC (Matthews Correlation Coefficient): Single value metric that gives an indication of a Binary Classification model's efficacy at predicting both classes. This value ranges between -1 to +1 with +1 being a perfect classifier.

PPV (Positive Predictive Value/Precision): Number of a model's True Positive predictions divided by the number of (True Positives + False Positives) in the test set.

Predicted Positive: Number of samples in the source data subset predicted as the positive case by the model.

Predicted Negative: Number of samples in the source data subset predicted as the negative case by the model.

True Positive: Number of positive class samples in the source data subset accurately predicted by the model.

True Negative: Number of negative class samples in the source data subset accurately predicted by the model.

TPR (True Positive Rate/Sensitivity/Recall): Ratio of True Positive predictions to actual positives with respect to the test set. It is calculated by dividing the true positive count by the actual positive count.

TNR (True Negative Rate/Specificity): Ratio of True Negative predictions to actual negatives with respect to the test set. It is calculated by dividing the True Negative count by the actual negative count.

 

Multi classification

ACC (Accuracy): Ratio of the correctly classified samples over all the samples.

Actual Total: Total number of samples in the source data subset that were of the given class.

Cohen’s Kappa (K): Coefficient that measures inter-rater agreement for categorical items, it tells how much better a classifier is performing over the performance of a classifier that simply guesses at random according to the frequency of each class. It is in the interval [-1:1]. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement.

False Negative: Number of positive class samples in the source data subset that were incorrectly predicted as negative.

False Positive: Number of negative class samples in the source data subset that were incorrectly predicted as positive.

Macro-PPV (Positive Predictive Value/Precision): The mean of the computed PPV within each class (independently of the other classes). Each PPV is the number of True Positive (TP) predictions divided by the total number of positive predictions (TP+FP, with FP for False Positive) within each class. PPV is in the interval [0;1]. The higher this value, the better the confidence that positive results are true.

Macro-TPR (True Positive Rate/Recall): The mean of the computed TPR within each class (independently of the other classes). Each TPR is the proportion of samples predicted Truly Positive (TP) out of all the samples that actually are positive (TP+FN, with FN for False Negative). TPR is in the interval [0;1]. The higher this value, the fewer actual samples of positive class are labeled as negative.

Macro F1 score: Harmonic mean of macro-average PPV and TPR. F1 Score is in the interval [0;1]. The F1 Score can be interpreted as a weighted average of the PPV and TPR values. It reaches its best value at 1 and worst value at 0.

MCC (Matthews Correlation Coefficient): Represents the multi class confusion matrix with a single value. Precision and recall for all the classes are computed and averaged into a single real number within the interval [-1;1]. However, in the multiclass case, its minimum value lies between -1 (total disagreement between prediction and truth) and 0 (no better than random) depending on the data distribution.

Predicted Total: Total number of samples in the source data subset that were predicted of the given class.

True Positive: Number of positive class samples in the source data subset accurately predicted by the model.

True Negative: Number of negative class samples in the source data subset accurately predicted by the model.

 

Regression

MAE (Mean Absolute Error): represents the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight. MAE is in the intervall [0;+∞]. A coefficient of 0 represents a perfect prediction, the higher this value is the more error (relative error) the model have.

MAPE (Mean Absolute Percentage Error): MAPE is computed as the average of the absolute values of the deviations of the predicted versus actual values.

Max-Error: Maximum Error. The application considers here the magnitude (absolute error when identifying the maximum error. Thus -1.5 would be consider the maximum error over +1.3. The sign of the error however is still reported in this column in case it has domain significance for the user.

R2 (R Squared): also known as the Coefficient of Determination. The application computes the R2 statistic as 1 - (SSres / SStot) where SSres is the residual sum of squares and SStot is the total sum of squares.

RMSE: Root Mean Square Error against the Dataset partition selected. RMSE is computed as the square root of the mean of the squared deviations of the predicted from actual values.

SD-ERROR (Standard Deviation Error): Standard statistical measure used to quantify the amount of variation of a set of data values.