You can predict the price of Ice?

rob siwicki
5 min readAug 27, 2020

--

The following analysis was created to investigate 3 questions regarding the Diamond business:

  1. Oftentimes people without knowledge assume the larger a diamond (carat) the more valuable it is, this is not always true and can it be demonstrated easily in a visual that can be interpreted by a layman?
  2. Are there any other surprising patterns?
  3. Can a predictive model of price be created with a good level of accuracy that utilises both numerical and categorical qualities of Diamonds?

Note: The underlying data was obtained from Kaggle in order to prove the potential of such a model; however, its likely the prices stated do not reflect the market conditions of 2020.

Preparation

First its useful to get an overview of the data. The following code was used to load it.

# Fix up the imports and load the dataimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
%matplotlib inline
df = pd.read_csv('./diamonds.csv')# It looks like the data contains an unamed column which only appears to be an identifier of the row. Lets remove thisdf = df.drop(['Unnamed: 0'], axis=1)
df.head()

We can see that we have a straight forward data set to begin our analysis after inspecting for missing values (non-found), though a row identifier column was removed.

Several Seaborn pair plots can be used to get an overview of the relationships in the data. In this case examining pairwise plots categorised by clarity.

sns.pairplot(df, hue='clarity');

So there are clearly interesting relationships here between size indicators and price, that at first class appear to be influenced by other factors such as clarity.

Its not all about size

The following plot was generated to help easily visualise some of the qualities of Diamonds (carat, clarity and colour) and how it can easily be demonstrated that much smaller diamonds with higher qualities can out price larger stones. It shows us that there are clearly a number of higher quality stones between 1–1.5 carats that can achieve the price of 4–5 carat stones by virtue of their qualities.

It’s interesting to note that there appear to be strata in the data at 1, 1.5, 2 and 3 carats.

Surprising Distribution of Carat

The stratification of carats appears to be an interesting and unforeseen relationship.

To try and sees these strata more clearly lets try a density plot:

sns.distplot(df['carat'], rug=False)

Perhaps we can infer from this that the diamonds are attempted to be cut to match popular market grades, whilst the smaller and larger stones are left closer to their natural size to make the most of their natural uncut carat. It’s also interesting to note that the larger stones tend to be of a lower quality, perhaps indicating that the probability of a stone not having inclusions lowers as the size of the stone increases, or even perhaps that larger stones can be cut in such a way as to remove inclusions and thereby in some cases favour a higher price at a smaller carat.

The strata are clear, lets see if we can categories a price density distribution by clarity for example.

# Iterate through the clarities
for clarity in clarities:
# Subset to the clarity
subset = df[df['clarity'] == clarity]

# Draw the density plot
sns.distplot(subset['carat'], hist = False, kde = True, label = clarity)

Whats this showing us?

There’s a higher density of higher clarity diamonds with smaller carats. Interestingly it looks like the lower quality diamonds I1 and SI2 are cut to 1 carat, perhaps to yield a higher price.

Price Prediction Modelling

Now we can see there is clearly more than just the carat (size) to understanding the price of the stones. Lets see if we can build a good price prediction model.

First we use get_dummies() from pandas to encode the categorical variables.

#Split into explanatory and response variables
X = df_new.drop('price', axis=1)
y = df_new['price']
#Split into training and tests data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=30)lm_model = LinearRegression(normalize=True) # Instantiatelm_model.fit(X_train, y_train) #Fit#Predict and score the modely_test_preds = lm_model.predict(X_test)"The r-squared score for the model using only quantitative variables was {} on {} values.".format(r2_score(y_test, y_test_preds), len(y_test))

We then can run the code and examine the output.

The r-squared score for the model using only quantitative variables was 0.918952496774 on 16182 values.

The r-squared obtained demonstrates that we have created a good predictor for the pricing of Diamonds based on both numerical and categorical data.

Conclusion

This brief study has outlined that we can build a visualisation that can be used to demonstrate the relationship between the qualities of diamonds and their prices. That price is not only related to the size of the stone, though superb qualities of diamonds can yield much smaller diamonds that attain the prices of their larger counterparts.

We have found an interesting relationship whereby certain carat sizes have a higher representation than others, assuming that the this is a man made effect to produce stones of more marketable sizes.

We have also proven the capability to build a highly accurate price predictor that utilises both numerical and categorical qualities of the stones that demonstrates with a more contemporary data set we could build an accurate pricing model.

--

--

No responses yet