Table of content

Data Problem Definition
Data Understanding Steps
Histogram and Distribution
Data Preparation
Selecting the model
Executive Summary

Data Problem Definition:

Identify the key predictive factors that influence the sale price of used cars using a dataset of 426,000 vehicle records. Specifically, develop a regression model that explains the variation in sale price based on various attributes such as manufacturer, model, year, odometer, condition, and other relevant features.
The goal is to extract insights on the relationships between these attributes and sale price, ultimately informing feature importance and driving business recommendations for the used car dealership.

Key Objectives:

Identify significant predictors of used car sale price.
Quantify the relationships between predictors and sale price.
Rank features by importance to inform business decisions.

Technical Requirements:

Data preprocessing and feature engineering
Regression modeling (e.g. linear)
Feature selection and importance analysis
Model evaluation and validation

This reframed definition provides a clear direction for the data analysis task, focusing on identifying key drivers of used car prices using regression modeling and feature importance analysis.

Data Understanding Steps

I will use the following steps to get familiar with the dataset and identify quality issues:

Step 1: Initial Data Review
- Check the dataset's dimensions (number of rows, columns)
- Review the data types for each column (numeric, categorical, date, etc.)
Step 2: Summary Statistics
- Calculate summary statistics for numeric columns (mean, median, mode, min, max, std dev)
- Generate frequency distributions for categorical columns
Step 3: Data Visualization
- Plot histograms or density plots for numeric columns
- Create bar charts for categorical columns
- Use scatter plots or correlation matrices to explore relationships between columns
Step 4: Missing Value Analysis
- Identify columns with missing values
- Calculate the percentage of missing values for each column
Step 5: Data Quality Checks
- Check for invalid or inconsistent values
- Identify duplicates or redundant records
- Verify data formats (e.g., date, time, categorical)

Step 6: Data Profiling

Create a data profile report summarizing the findings
Document data quality issues, inconsistencies, and areas for improvement

Here is the information of the raw data:

Column	Non-Null Count	Null Counts	Percent	Dtype
id	426880	0	0.000000	int64
region	426880	0	0.000000	object
price	426880	0	0.000000	int64
year	425675	1205	0.282281	float64
manufacturer	409234	17646	4.133714	object
model	421603	5277	1.236179	object
condition	252776	174104	40.785232	object
cylinders	249202	177678	41.622470	object
fuel	423867	3013	0.705819	object
odometer	422480	4400	1.030735	float64
title_status	418638	8242	1.930753	object
transmission	424324	2556	0.598763	object
VIN	265838	161042	37.725356	object
drive	296313	130567	30.586347	object
size	120519	306361	71.767476	object
type	334022	92858	21.752717	object
paint_color	296677	130203	30.501078	object
state	426880	0	0.000000	object

Based on the table above, it is clear that some features contain null values. Notably, some of these null values exceed the actual non-null values, such as in the case of the "size" feature. As a result, these null values are not worth retaining, as they will skew the data.
As a general guideline, we will retain features with fewer than 50% null values. Later, we will decide whether to fill in the null values with the highest value for each feature, or with a placeholder such as "missing" or "unknown".

Histogram and Distribution

The following graphs are presented in two separate categories:

Numerical Features: Histograms

These graphs display the distribution of numerical features through histograms, providing a visual representation of the data's central tendency, dispersion, and shape.

Categorical Features: Grouped Distributions

These graphs illustrate the relationship between categorical features and the target variable (price) by showing the grouped distribution of the feature versus the price. This allows for an examination of how different categories impact the price, enabling insights into potential outliers and patterns.
By visualizing the grouped distribution, we can identify:

Categories with exceptionally high or low prices
Categories with limited or excessive price variability

This analysis facilitates a deeper understanding of the complex relationships between categorical features and the target variable, ultimately supporting more informed decision-making.

Initial Observations and Data Quality Concerns

Upon examining the grouped distributions, it is readily apparent that the data contains:

Outliers and erroneous information - like the 10 digit chevrolet or the 1234567890 volvo -, which can significantly compromise the accuracy and reliability of any subsequent analysis
A significantly uneven distribution between manufacturers, with some having a disproportionately large or small number of data points

Need for Data Cleansing

To ensure the integrity of our analysis, it is essential to perform a thorough data cleansing exercise to:

Identify and remove outliers that may skew the results
Correct or eliminate erroneous or invalid data points
Handle missing values or inconsistencies
Eliminate duplicates

By doing so, we can enhance the quality and reliability of the data, thereby supporting more robust and meaningful insights. This crucial step will enable us to build a stronger foundation for our analysis and mitigate potential risks associated with flawed data.

Data Preparation

Following an in-depth analysis of the data, I identified several initiatives to cleanse and preprocess the data for further analysis. The steps I took are outlined below:

Feature Selection and Removal

I dropped the following features to minimize redundancy and optimize the dataset:

id
VIN
size
region

Handling Missing Values

To ensure data quality, I removed rows with null values in the following critical features:

year
manufacturer

Imputation of Missing Values

For the remaining features with null values, I applied the following imputation strategies:

Fuel: replaced with "gas"
title_status: replaced with "missing"
drive: replaced with "unknown"
paint_color: replaced with "unknown"
condition: replaced with "unknown"
type: replaced with "unknown"
model: replaced with "missing"

Outlier Removal

To prevent skewness and ensure robust analysis, I removed the bottom 10% and top 5% of values in the price feature, effectively eliminating outliers.

Concatenation

Analyses revealed that separate manufacturer and model features can introduce errors due to shared models across manufacturers.
To resolve this issue, I concatenated these features, creating unique combined values and ultimately enhancing the model's performance.

These data preparation steps enabled me to create a cleaner, more reliable dataset for subsequent analysis and modeling. The results can be seen in the following graphs:

Result Analysis

Improved Data Distribution

As evident from the graphs, our data now exhibits a more even distribution. This is consistently observed across all graphs, with the manufacturer vs. price graph serving as a prime example. Here, we can see that the prices for each manufacturer are now more evenly distributed, indicating the removal of the most prominent outliers.

Enhanced Price Analysis

The price graph is now displaying different information. Previously, it displayed a simple price histogram. Now, it showcases a price vs. year distribution, revealing the average prices across all years. This new representation uncovers a cyclical pattern and a discernible trend, offering valuable insights into price fluctuations over time.

Comparative Analysis

The graph below provides a direct comparison between the original data price vs. year and the cleansed data price vs. year. At a glance, we can appreciate the significant improvement in data distribution. The cleansed data, now more evenly distributed, will serve as the foundation for further analysis, enabling more accurate and reliable conclusions.

Data Preparation

Now that we have thoroughly cleansed the data, we can proceed to prepare it for further analysis, focusing particularly on the categorical features.

Encoding Categorical Features

To facilitate analysis, I applied the following encoding techniques:

One Hot Encoding: For the features with multiple categories, I used One Hot Encoding to create binary vectors:
1. model
2. Transmission
3. paint_color
4. state
5. fuel
6. title_status
7. drive
8. type
9. condition

Data Preparation and Encoding

The data preparation and encoding process was successfully executed utilizing the make_column_transformer function, capable of handling multiple encoding schemes concurrently.

Next Steps

With the transformed data, we can now proceed to select a suitable algorithm and forecast the price with increased accuracy.

Selecting the model

Model Selection and Evaluation

To identify the most accurate model, I evaluated three distinct regression models, each with its strengths and weaknesses. The selected models were:

Ridge Regression: A linear regression model that uses L2 regularization to reduce overfitting by minimizing the magnitude of model coefficients.
Lasso Linear regression with L1 regularization for feature selection and reduced multicollinearity.
Linear Regression Basic linear regression for simple relationships and initial analysis.
Elastic Net Hybrid linear regression combining L1 and L2 regularization for balanced regularization.

The objective of this evaluation was twofold:

Achieve a balance between Mean Squared Error (MSE), ensuring accurate predictions while minimizing overfitting.
Identify the coefficients of each feature, enabling the analysis of their impact on car prices.

By comparing the performance of these models, we can determine which one provides the most insightful and accurate predictions, ultimately informing data-driven decisions.

The result of the evaluation can be found in the following table:

Model	Best Score	Mean Squared Error (MSE)	Elapsed Time
Ridge	0.561262	40,246,517.02	30 seconds
Linear Regression	0.552364	41,062,787.54	50 seconds
Lasso	0.541976	42,015,664.01	2 hours 56 minutes
Elastic Net	0.382332	56,660,232.62	15 minutes

Conclusions

Model Performance Comparison

Ridge Regression outperforms Linear Regression, Lasso and Elastic Net in terms of Best Score (0.561262) and Mean Squared Error (MSE) (40,246,517.02).
Linear regression comes second with a score of 0.552364 an MSE of 41062787.54 and 50 seconds
Lasso Regression has the longest training time (2 hours 56 minutes), despite its relatively poor performance.
Elastic Net Regression has the worst performance among the three models, with the lowest Best Score (0.382332) and highest MSE (56,660,232.62).

Time-Efficiency Tradeoff

Ridge Regression: Best performance and fastest training time (30 seconds).
Linear Regression has a very good performance (50 seconds) but is not better than Ridge Regression
Elastic Net Regression: Relatively slow training time (15 minutes) considering its poor performance.
Lasso Regression: Longest training time (2 hours 56 minutes), making it the least efficient.

Error Analysis

All models struggle with large errors, as indicated by Mean Squared Error (MSE) values.

Ridge Regression has the lowest MSE (40,246,517.02).

Used Car Price Prediction Report

Executive Summary

This report presents the findings of a data analysis aimed at identifying key factors influencing used car prices. Using Ridge model, we determined the most relevant features and their coefficients. Our results provide actionable insights for used car dealers to fine-tune their inventory and pricing strategies.

Methodology

This predictive modeling project employed a structured approach to analyze the vehicle dataset and estimate car prices.

Dataset

The dataset utilized in this study consisted of 426,000 vehicle records, containing a mix of numerical and categorical features:

Numerical features: price, year, odometer
Categorical features: region, manufacturer, model, condition, cylinders, fuel, transmission, drive, size, type, paint_color, state
Identifier: id, VIN

Data Preprocessing

The dataset underwent rigorous data cleansing and preparation:

Handling missing values and outliers
Data normalization and feature scaling
Encoding categorical variables
Transforming numerical features

Model

To predict car prices, the following models were evaluated:

Ridge Regression: Ridge Regression is a regularized linear regression technique that minimizes overfitting by adding a penalty term to the cost function.
Lasso (Least Absolute Shrinkage and Selection Operator): Lasso is another regularized linear regression technique.
Linear Regression: Linear Regression is a basic linear regression model.
Elastic Net: Elastic Net is a hybrid regularized linear regression technique that combines Ridge and Lasso.

Evaluation Metrics

Model performance was assessed using:

Mean Squared Error (MSE): measures prediction accuracy
Score: Evaluates the model's ability to make accurate predictions.
Time/Performance: Measures the computational efficiency and speed of the model.

Model Selection and Insights

Following a thorough evaluation, we selected the Ridge model as the most suitable algorithm due to its accuracy and performance. To glean actionable insights, we examined two key characteristics: permutation importance and coefficients. These metrics enabled us to determine vital information from the model, which informed the subsequent benefits for car dealerships, categorised into low-end, mid-range, and high-end.

Permutation Importance Insights

Permutation importance measures the contribution of each feature to a model's predictions. It calculates the decrease in model performance when a feature is randomly shuffled, indicating its relative importance.

Permutation Importance: Which features matter most?

The number of cylinders (5 cylinders) and the car's condition (excellent) are the most important features, influencing 89.5% of the model's predictions.
The condition of the car (fair, unknown, new, like new, good) has a significant impact on the prediction, indicating that the model is sensitive to the car's state.
The number of cylinders (10, 4, 12, 3) has a relatively lower importance score, suggesting that the model is less sensitive to these features.

Coefficients Insights

Coefficients represent the change in the predicted outcome for a one-unit change in the feature, while holding all other features constant. They indicate:

Coefficients: How much do individual features impact the prediction?

Luxury brands: Negative coefficients (e.g., Ferrari -11165.953304, Audi -11055.727044) signify lower prices or reduced demand, indicating a niche market.
Classic Ford models: High positive coefficients (e.g., Ford_Roadster 43548.884763) suggest elevated demand and prices, presenting lucrative opportunities.
Mid-range models: Moderate coefficients (e.g., Chevrolet_Coupe 32694.947709) indicate stable demand and prices, ensuring a consistent revenue stream.

Benefits for Car Dealerships

High-End Car Dealers

Stock high-performance vehicles: Focus on 5-cylinder models from luxury brands.
Emphasise condition: Highlight excellent condition to command premium prices.
Target luxury buyers: Focus marketing on affluent customers seeking high-end features.
Potential profit margins increase by $32,000-$43,000 per luxury unit sold.

Mid-Range Car Dealers

Focus on 5-cylinder models, which have high importance scores (cylinders_5 cylinders: 0.51755).
Focus on selling Chevrolet Coupe and Ford Convertible, highlighting value-for-money.
Emphasize the value of mid-range models like Ford F-350 SD Lariat and GMC Sierra SLT 6.2L Crew Cab, considering their negative coefficients (-11575.422905 and -11442.310872).
Highlight excellent condition and maintenance to appeal to practical buyers (condition_excellent importance score: 0.37687).

Low-End Car Dealers

Target budget-conscious buyers: Emphasise affordability, fuel efficiency, and reliability.
Focus on affordable models with fewer cylinders (3 or 4), which have lower importance scores (cylinders_3 cylinders: 0.00614, cylinders_4 cylinders: 0.02032).
Highlight good condition and maintenance to appeal to budget-conscious buyers (condition_good importance score: 0.00598).
Price competitively: Set prices considering low-end coefficients and market demand.

Common Benefits

Data-driven pricing: Utilise coefficients to inform pricing strategies.
Inventory optimisation: Stock vehicles with in-demand features.
Targeted marketing: Focus on specific customer segments.

Luxury and Classic Models

Ford's iconic models (Roadster, Coupe, T, T-Bucket, Tudor, Model A) increase prices by $32,000-$43,000. Chevrolet's Coupe increases prices by $32,694.

Commercial and Heavy-Duty Vehicles

Models like Audi Q8, Ferrari California T, and GMC Sierra SLT 6.2L Crew Cab reduce prices by $11,000-$15,000.

By leveraging these insights, car dealerships can refine their business strategies, enhance customer satisfaction, and boost profitability.

Action Items:

Review inventory and pricing strategies.
Develop targeted marketing campaigns.
Adjust sales training to emphasize value propositions.

Conclusion

This analysis provides valuable insights for used car dealers to refine their pricing strategies and inventory management. By understanding the impact of key features on car prices, dealers can make informed decisions to optimize their business.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
images		images
LICENSE		LICENSE
README.md		README.md
g4.py		g4.py
prompt_II.ipynb		prompt_II.ipynb

License

rocharodo/What-Drives-the-Price-of-a-Car-

Folders and files

Latest commit

History

Repository files navigation

Table of content

Data Problem Definition:

Key Objectives:

Technical Requirements:

Data Understanding Steps

Histogram and Distribution

Numerical Features: Histograms

Categorical Features: Grouped Distributions

Initial Observations and Data Quality Concerns

Need for Data Cleansing

Data Preparation

Feature Selection and Removal

Handling Missing Values

Imputation of Missing Values

Outlier Removal

Concatenation

Result Analysis

Improved Data Distribution

Enhanced Price Analysis

Comparative Analysis

Data Preparation

Encoding Categorical Features

Data Preparation and Encoding

Next Steps

Selecting the model

Model Selection and Evaluation

Conclusions

Model Performance Comparison

Time-Efficiency Tradeoff

Error Analysis

Used Car Price Prediction Report

Executive Summary

Methodology

Dataset

Data Preprocessing

Model

Evaluation Metrics

Model Selection and Insights

Permutation Importance Insights

Coefficients Insights

Benefits for Car Dealerships

High-End Car Dealers

Mid-Range Car Dealers

Low-End Car Dealers

Common Benefits

Luxury and Classic Models

Commercial and Heavy-Duty Vehicles

Action Items:

Conclusion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages