Prediction of House Prices in Lagos-Nigeria Using Machine Learning Models

: This paper considers the relationship between the price of houses and the features namely the number of bedrooms, parking space, and different house types. In this study, a machine learning approach was used to develop prediction models that predicted house prices in Lagos. Different machine learning techniques were used, train-test split to split the data into training sets for training and building the model and test data to test the accuracy of the model, performance metric mean absolute error to set the baseline for the model, Variance Inflation Factor (VIF) to help remove multicollinearity between features and Streamlit interactive dashboards to communicate with the model. Correlation and regression methods were used to examine the relationship and build the model. It is observed that there is a strong positive correlation between the number of bedrooms and the number of toilets, likewise the number of bedrooms and the number of bathrooms. It also shows that there is a moderate positive correlation between the number of bedrooms and price. The model shows that the number of bedrooms, parking spaces, and house types play an important role in determining the price of houses.


Introduction
Different factors affect the price of houses in Nigeria, states, towns, and localities depending on choice.An increasing factor in a particular state might be a decreasing factor in another state.Lagos known as the commercial capital of Nigeria is known for its expensive house price, according to forbes.com it is ranked the 55 th most expensive city to live in the world.Ajah a suburb in Lagos state, is considered one of the best places to start a family.Unlike other areas such as Lekki, Ikoyi, and Victoria Island, Ajah is Machine learning (ML) is the subset of artificial intelligence (AI) that focuses on building systems that learn or improve performance based on the data they consume.There are basically three types of machine learning: supervised, unsupervised, and reinforcement learning.Supervised learning is effective for a variety of business purposes, including sales forecasting, inventory optimization, and fraud detection.Some examples of use cases include: Regression is a supervised learning technique that aims to find the relationships between the dependent and independent variables.Ridge regression is a method of estimating the coefficients of multiple regression models in scenarios where the independent variables are highly correlated (Hilt, & Seegrist, 1977).It has been used in many fields including econometrics, chemistry, and engineering (Gruber, & Schucany, 2020).Uzoma, & Jeremiah, (2016) developed outlier detection and optimal variable selection techniques in regression analysis and other fascinating papers by the authors include (Anabike et al., 2023;Innocent et al., 2023;Abuh, Onyeagu, & Obulezi, 2023a;Abuh, Onyeagu, & Obulezi, 2023b;Obulezi et al., 2022;Onyekwere, & Obulezi, 2022;Onyekwere et al., 2022).

Suggested Citation
The ridge estimator is given as where y is the regressand, X is the design matrix, I is the identity matrix and λ ≥ 0 serves as the constant shifting diagonals of the moment matrix.The rest of this paper is organized as follows; In section 3, we present the material and method.

Literature review
Here, the big data is cleaned for subsequent use.
In section 4, we analyze the data and discussed the results in section 5. We then conclude the paper in section 6.

Material and Method
The dataset used for this analysis is the Nigerian housing dataset retrieved from Kaggle containing 25 unique states, 189 unique towns, and 24326 rows with 8 columns.After cleaning the data Ajah-Lagos was chosen as the town of interest.

Model
In the course of the analysis, certain libraries in Python were employed to model and analyze the data.Panda was used to read, clean, and manipulate the data, Scikit-learn (Sklearn) is an important library in Python that provide tools for machine learning algorithm for regression problem, classification problem, and clustering.Ridge regression was used for prediction to help reduce overfitting and multicollinearity, (Akinwande, Dikko & Gulumbe 2015).Multicollinearity arises when there is a correlation between the independent variables' percentages to help reduce this VIF was employed.The computational formula for VIF is given as Correlation analysis is primarily concerned with finding out whether a relationship exists between variables and then determining the magnitude and action of that relationship. (3) Matplotlib and plotly express were used for data visualization.

Data cleaning
Figure 1 Shows the prices of houses in different states in Nigeria, It shows that Lagos state has the most expensive house followed by Abuja.
We can also see that in different towns in Nigeria, Ikoyi houses seem more expensive than in any other town, not only that but the gap between Ikoyi and any other house was too much, houses in Ifako-Ijaiye can't be in the league of houses in Maitama-District, which could be a result of outliers.
A low standard deviation shows that the data points tend to be close to their mean and viceversa.However, from this it is noticed that the value of std is high are far away from the mean.This is to say we have outliers that contributed to this.Ajah ranks 24th from the chart with Ikoyi leading the chart followed by Lagos island, VI this is true because we know that these towns have houses where the elites, celebrity lives.The diagram below shows the relationship between price and number of bedrooms,the correlation coefficient is 0.48 which shows that they are moderately correlated.From this plot and correlation matrix, we can see that some variables are highly correlated they can affect our analysis.itcan be seen that bedrooms, bathrooms, toilets are highly correlated, this is possible as the number of bedroom increases, the number of bathrooms and toilets tend to increase.We can use VIF to test multicollinearity.(see Etaga et al [19]).As predicted from the correlation matrix, it can be observed that in Variance Inflation Factors Results (VIFs), 5 of the independent variable's VIF's exceeded 10, which indicate very strong presence of multicollinearity, dropping some of these columns reduce the value of the vif.A boxplot for the distribution of Ajah house price was plotted and it was discovered that there were outliers and it was trimmed using quantiles.The research model is specified thus price = f (number of bedrooms, number of parking space, house-type).

Figure 16. Boxplot of Ajah House Price
After cleaning this data, I built the model, to build a model for prediction there are three stages which include: 1.
Splitting the data: This involves splitting the data into features and target.In python we call the predictors (independent variable) Features while the dependent variable is referred to as Target.The predictor involved in this analysis is bathrooms, parking-space, house types.The dependent variable is price.These data are further splitted into TRAIN-TEST SPLIT Fitting the model: This involves making a pipeline, in pipeline we can have many transformers but one regressor which comes at the end of the pipeline and then fitting it with our data.

Figure 17. Fitting a Model
The pipeline has a:

•
Transformer which takes a dataset as an input and creates an augmented dataset as output.In this analysis OneHotEncoder is used to encode categorical data (House-types) to numerical data

•
Estimator: An estimator in machine learning is an algorithm that fits on the input dataset to generate a model, which is a transformer.Regression is an estimator that trains on a dataset with labels and features and produces a ridge regression model.

2.
Baselines: A baseline model is essentially a simple model that acts as a reference in a machine learning project used for comparison purpose.for regression problems, the common rule is to create baseline models that predict the mean or median of the training data output.In the model the mean-apt-price which is the mean of y train is 58429682.66and Baseline MAE which is the mae of y-pred-baseline and y train is 12644596.9.from the baseline model: it shows that if we predicted that the price of an apartment is N58429682.66then our prediction we would be off by an average N12644596.9.
Mean absolute error(MAE) is a performance metric in the context of machine learning, absolute error refers to the magnitude of difference between the prediction of an observation and the true value of that observation given as: The Training MAE 8986866.53,thisshows that training mae beats the baseline mae by N3657730 it is useful to predict house price.The next step is to use it on set of data that it has not seen before which is our test set.Test MAE: 8525486.28The test mae also beats our baseline mae.

3.
Evaluating and communicating results: Ability to communicate results to the stake holders.Feature refers to technique that calculates a score for all the input features for a given model-the scores simply represent the "importance" of each feature.A higher score means that the specific feature will have a larger effect on the model that is being used to predict a certain variable.This shows that when the house is a detached duplex the price increases by N12.5 million but when it is a semidetached duplex the price decreases by N7.5million, it also shows that parking space add little amount to the price of an apartment in Ajah and the number of bedrooms also increases the house price by N2.5 million.

Discussion
The outcome of this study was reached utilizing a regression model with three stages, including data splitting, baseline modeling for comparison, and result evaluation, as seen in figures 18.0 and 19.0 above.The price of the house increases by N12.5 million when it is a detached duplex, but decreases by N1 million when it is a semidetached duplex.

Conclusion
In this article, machine learning models have been deployed to predict house prices in Lagos Nigeria.Ikoyi Lagos state has the most expensive house in Nigeria followed by Maitama district Abuja.From the feature importance, it shows that house type plays an important role in predicting the price of a building.This shows that when the house is a detached duplex the price increases by N12.5 million but when it is a semidetached duplex the price decreases by a million, it also shows that parking space adds a little amount to the price of an apartment in Ajah and the number of bedrooms also increases the house price by 2.5 million.

Figure 7 .
Figure 7. Stabilizing the Data with RecordsGreater than 100

Figure
Figure 8.Average Price of Houses by State

Figure
Figure 12.Correlation Between Bedroom and Price

Figure
Figure 14.Heat Map Showing the Correlation Between Variables split: A train test split is when you split your data into a training set and a testing set.The training set is used for training the model, and the testing set is used to test your model.This allows you to train your models on the training set, and then test their accuracy on the unseen testing set.The data was splitted into two sets.80%for training and 20% for testing.Having (1292, 3) for training and (323, 3) for testing.This ensures that both sets are representative of the entire dataset, and gives a way to measure the accuracy of the models.(b)

Fig18:
Fig18 shows the predicted value of price yˆ against y test price.The regression equation is given as: