Date: 13 July 2020
In part 2, we have talked about splitting data into a training set and testing set. In part 3, I would like to share some findings on the first data exploration.
Big Problem: Missing Data
I did not expect there are so many missing data in some of the key fields. I have used missingno package to examine the missing fields.
Build Size (feet),
Actual Size (feet),
Building age (year) have more than 60% missing values. Intuitively, they are highly correlated to the property price. This intuition is further confirmed by correlation plots:
Sometimes, when the feature contains more than 50% missing value, we may drop the feature. But, I don’t want to do that for important features. This creates a dilemma for me.
I am still working out the solution to this problem. One option is to look for other datasets. But I first need to understand how the data is collected and why there are so many missing values. Are those required field? Another option is to find some ways to fill in the missing values. I think I will try the first option first.
Problem: Limited Features
If we look at the dataset, there are actually very few features.
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 Reg. Date 5952 non-null object 1 Property Name 5318 non-null object 2 Street 5952 non-null object 3 Block 2732 non-null object 4 Floor 5918 non-null object 5 Flat 5343 non-null object 6 Price$(M) 5952 non-null float64 7 Build Size(feet) 559 non-null object 8 Actual Size(feet) 1589 non-null object 9 Price/sq.ft 559 non-null float64 10 Building age(year) 2063 non-null float64 11 Type 5952 non-null object
If we do not do any feature engineering, we may left with
Type, 6 features. And
Build Size (feet) and
Actual Size(feet) are highly correlated. For
Block, we may want to do Geocoding, converting address information to coordinates. Can we do one-hot encoding on address (maybe street)? I am not sure.
A lot of outliers in target variable
Positive skewness and multi-model distribution in target variable
Price$(M) positive correlated with Build Size, actual size, and negatively correlated with age
The observations are preliminary. I should first fix the problem of missing data first.