- Data Problem Definition
- Data Understanding Steps
- Histogram and Distribution
- Data Preparation
- Selecting the model
- Executive Summary
Identify the key predictive factors that influence the sale price of used cars using a dataset of 426,000 vehicle records.
Specifically, develop a regression model that explains the variation in sale price based on various attributes such as manufacturer, model, year, odometer, condition, and other relevant features.
The goal is to extract insights on the relationships between these attributes and sale price, ultimately informing feature importance and driving business recommendations for the used car dealership.
- Identify significant predictors of used car sale price.
- Quantify the relationships between predictors and sale price.
- Rank features by importance to inform business decisions.
- Data preprocessing and feature engineering
- Regression modeling (e.g. linear)
- Feature selection and importance analysis
- Model evaluation and validation
This reframed definition provides a clear direction for the data analysis task, focusing on identifying key drivers of used car prices using regression modeling and feature importance analysis.
I will use the following steps to get familiar with the dataset and identify quality issues:
-
Step 1: Initial Data Review
- Check the dataset's dimensions (number of rows, columns)
- Review the data types for each column (numeric, categorical, date, etc.)
-
Step 2: Summary Statistics
- Calculate summary statistics for numeric columns (mean, median, mode, min, max, std dev)
- Generate frequency distributions for categorical columns
-
Step 3: Data Visualization
- Plot histograms or density plots for numeric columns
- Create bar charts for categorical columns
- Use scatter plots or correlation matrices to explore relationships between columns
-
Step 4: Missing Value Analysis
- Identify columns with missing values
- Calculate the percentage of missing values for each column
-
Step 5: Data Quality Checks
- Check for invalid or inconsistent values
- Identify duplicates or redundant records
- Verify data formats (e.g., date, time, categorical)
-
Step 6: Data Profiling
- Create a data profile report summarizing the findings
-
Document data quality issues, inconsistencies, and areas for improvement
Here is the information of the raw data:
Column Non-Null Count Null Counts Percent Dtype id 426880 0 0.000000 int64 region 426880 0 0.000000 object price 426880 0 0.000000 int64 year 425675 1205 0.282281 float64 manufacturer 409234 17646 4.133714 object model 421603 5277 1.236179 object condition 252776 174104 40.785232 object cylinders 249202 177678 41.622470 object fuel 423867 3013 0.705819 object odometer 422480 4400 1.030735 float64 title_status 418638 8242 1.930753 object transmission 424324 2556 0.598763 object VIN 265838 161042 37.725356 object drive 296313 130567 30.586347 object size 120519 306361 71.767476 object type 334022 92858 21.752717 object paint_color 296677 130203 30.501078 object state 426880 0 0.000000 object Based on the table above, it is clear that some features contain null values. Notably, some of these null values exceed the actual non-null values, such as in the case of the "size" feature. As a result, these null values are not worth retaining, as they will skew the data.
As a general guideline, we will retain features with fewer than 50% null values. Later, we will decide whether to fill in the null values with the highest value for each feature, or with a placeholder such as "missing" or "unknown".
The following graphs are presented in two separate categories:
These graphs illustrate the relationship between categorical features and the target variable (price) by showing the grouped distribution of the feature versus the price. This allows for an examination of how different categories impact the price, enabling insights into potential outliers and patterns.
By visualizing the grouped distribution, we can identify:
- Categories with exceptionally high or low prices
- Categories with limited or excessive price variability
Upon examining the grouped distributions, it is readily apparent that the data contains:
- Outliers and erroneous information - like the 10 digit chevrolet or the 1234567890 volvo -, which can significantly compromise the accuracy and reliability of any subsequent analysis
- A significantly uneven distribution between manufacturers, with some having a disproportionately large or small number of data points
- Identify and remove outliers that may skew the results
- Correct or eliminate erroneous or invalid data points
- Handle missing values or inconsistencies
- Eliminate duplicates
Following an in-depth analysis of the data, I identified several initiatives to cleanse and preprocess the data for further analysis. The steps I took are outlined below:
I dropped the following features to minimize redundancy and optimize the dataset:- id
- VIN
- size
- region
- year
- manufacturer
- Fuel: replaced with "gas"
- title_status: replaced with "missing"
- drive: replaced with "unknown"
- paint_color: replaced with "unknown"
- condition: replaced with "unknown"
- type: replaced with "unknown"
- model: replaced with "missing"
To prevent skewness and ensure robust analysis, I removed the bottom 10% and top 5% of values in the price feature, effectively eliminating outliers.
Analyses revealed that separate manufacturer and model features can introduce errors due to shared models across manufacturers.
To resolve this issue, I concatenated these features, creating unique combined values and ultimately enhancing the model's performance.
These data preparation steps enabled me to create a cleaner, more reliable dataset for subsequent analysis and modeling. The results can be seen in the following graphs:
As evident from the graphs, our data now exhibits a more even distribution. This is consistently observed across all graphs, with the manufacturer vs. price graph serving as a prime example. Here, we can see that the prices for each manufacturer are now more evenly distributed, indicating the removal of the most prominent outliers. The price graph is now displaying different information. Previously, it displayed a simple price histogram. Now, it showcases a price vs. year distribution, revealing the average prices across all years. This new representation uncovers a cyclical pattern and a discernible trend, offering valuable insights into price fluctuations over time. The graph below provides a direct comparison between the original data price vs. year and the cleansed data price vs. year. At a glance, we can appreciate the significant improvement in data distribution. The cleansed data, now more evenly distributed, will serve as the foundation for further analysis, enabling more accurate and reliable conclusions. Now that we have thoroughly cleansed the data, we can proceed to prepare it for further analysis, focusing particularly on the categorical features. To facilitate analysis, I applied the following encoding techniques:
-
One Hot Encoding: For the features with multiple categories, I used One Hot Encoding to create binary vectors:
- model
- Transmission
- paint_color
- state
- fuel
- title_status
- drive
- type
- condition
The data preparation and encoding process was successfully executed utilizing the make_column_transformer function, capable of handling multiple encoding schemes concurrently.
With the transformed data, we can now proceed to select a suitable algorithm and forecast the price with increased accuracy.
To identify the most accurate model, I evaluated three distinct regression models, each with its strengths and weaknesses. The selected models were:
- Ridge Regression: A linear regression model that uses L2 regularization to reduce overfitting by minimizing the magnitude of model coefficients.
- Lasso Linear regression with L1 regularization for feature selection and reduced multicollinearity.
- Linear Regression Basic linear regression for simple relationships and initial analysis.
- Elastic Net Hybrid linear regression combining L1 and L2 regularization for balanced regularization.
The objective of this evaluation was twofold:
- Achieve a balance between Mean Squared Error (MSE), ensuring accurate predictions while minimizing overfitting.
- Identify the coefficients of each feature, enabling the analysis of their impact on car prices.
By comparing the performance of these models, we can determine which one provides the most insightful and accurate predictions, ultimately informing data-driven decisions.
The result of the evaluation can be found in the following table:
Model | Best Score | Mean Squared Error (MSE) | Elapsed Time |
---|---|---|---|
Ridge | 0.561262 | 40,246,517.02 | 30 seconds |
Linear Regression | 0.552364 | 41,062,787.54 | 50 seconds |
Lasso | 0.541976 | 42,015,664.01 | 2 hours 56 minutes |
Elastic Net | 0.382332 | 56,660,232.62 | 15 minutes |
- Ridge Regression outperforms Linear Regression, Lasso and Elastic Net in terms of Best Score (0.561262) and Mean Squared Error (MSE) (40,246,517.02).
- Linear regression comes second with a score of 0.552364 an MSE of 41062787.54 and 50 seconds
- Lasso Regression has the longest training time (2 hours 56 minutes), despite its relatively poor performance.
- Elastic Net Regression has the worst performance among the three models, with the lowest Best Score (0.382332) and highest MSE (56,660,232.62).
- Ridge Regression: Best performance and fastest training time (30 seconds).
- Linear Regression has a very good performance (50 seconds) but is not better than Ridge Regression
- Elastic Net Regression: Relatively slow training time (15 minutes) considering its poor performance.
- Lasso Regression: Longest training time (2 hours 56 minutes), making it the least efficient.
All models struggle with large errors, as indicated by Mean Squared Error (MSE) values.
Ridge Regression has the lowest MSE (40,246,517.02).
This report presents the findings of a data analysis aimed at identifying key factors influencing used car prices. Using Ridge model, we determined the most relevant features and their coefficients. Our results provide actionable insights for used car dealers to fine-tune their inventory and pricing strategies.
This predictive modeling project employed a structured approach to analyze the vehicle dataset and estimate car prices.
The dataset utilized in this study consisted of 426,000 vehicle records, containing a mix of numerical and categorical features:
- Numerical features: price, year, odometer
- Categorical features: region, manufacturer, model, condition, cylinders, fuel, transmission, drive, size, type, paint_color, state
- Identifier: id, VIN
The dataset underwent rigorous data cleansing and preparation:
- Handling missing values and outliers
- Data normalization and feature scaling
- Encoding categorical variables
- Transforming numerical features
- Ridge Regression: Ridge Regression is a regularized linear regression technique that minimizes overfitting by adding a penalty term to the cost function.
- Lasso (Least Absolute Shrinkage and Selection Operator): Lasso is another regularized linear regression technique.
- Linear Regression: Linear Regression is a basic linear regression model.
- Elastic Net: Elastic Net is a hybrid regularized linear regression technique that combines Ridge and Lasso.
Model performance was assessed using:
- Mean Squared Error (MSE): measures prediction accuracy
- Score: Evaluates the model's ability to make accurate predictions.
- Time/Performance: Measures the computational efficiency and speed of the model.
Following a thorough evaluation, we selected the Ridge model as the most suitable algorithm due to its accuracy and performance. To glean actionable insights, we examined two key characteristics: permutation importance and coefficients. These metrics enabled us to determine vital information from the model, which informed the subsequent benefits for car dealerships, categorised into low-end, mid-range, and high-end.
Permutation importance measures the contribution of each feature to a model's predictions. It calculates the decrease in model performance when a feature is randomly shuffled, indicating its relative importance.
Permutation Importance: Which features matter most?
- The number of cylinders (5 cylinders) and the car's condition (excellent) are the most important features, influencing 89.5% of the model's predictions.
- The condition of the car (fair, unknown, new, like new, good) has a significant impact on the prediction, indicating that the model is sensitive to the car's state.
- The number of cylinders (10, 4, 12, 3) has a relatively lower importance score, suggesting that the model is less sensitive to these features.
Coefficients represent the change in the predicted outcome for a one-unit change in the feature, while holding all other features constant. They indicate:
Coefficients: How much do individual features impact the prediction?
- Luxury brands: Negative coefficients (e.g., Ferrari -11165.953304, Audi -11055.727044) signify lower prices or reduced demand, indicating a niche market.
- Classic Ford models: High positive coefficients (e.g., Ford_Roadster 43548.884763) suggest elevated demand and prices, presenting lucrative opportunities.
- Mid-range models: Moderate coefficients (e.g., Chevrolet_Coupe 32694.947709) indicate stable demand and prices, ensuring a consistent revenue stream.
- Stock high-performance vehicles: Focus on 5-cylinder models from luxury brands.
- Emphasise condition: Highlight excellent condition to command premium prices.
- Target luxury buyers: Focus marketing on affluent customers seeking high-end features.
- Potential profit margins increase by $32,000-$43,000 per luxury unit sold.
- Focus on 5-cylinder models, which have high importance scores (cylinders_5 cylinders: 0.51755).
- Focus on selling Chevrolet Coupe and Ford Convertible, highlighting value-for-money.
- Emphasize the value of mid-range models like Ford F-350 SD Lariat and GMC Sierra SLT 6.2L Crew Cab, considering their negative coefficients (-11575.422905 and -11442.310872).
- Highlight excellent condition and maintenance to appeal to practical buyers (condition_excellent importance score: 0.37687).
- Target budget-conscious buyers: Emphasise affordability, fuel efficiency, and reliability.
- Focus on affordable models with fewer cylinders (3 or 4), which have lower importance scores (cylinders_3 cylinders: 0.00614, cylinders_4 cylinders: 0.02032).
- Highlight good condition and maintenance to appeal to budget-conscious buyers (condition_good importance score: 0.00598).
- Price competitively: Set prices considering low-end coefficients and market demand.
- Data-driven pricing: Utilise coefficients to inform pricing strategies.
- Inventory optimisation: Stock vehicles with in-demand features.
- Targeted marketing: Focus on specific customer segments.
Ford's iconic models (Roadster, Coupe, T, T-Bucket, Tudor, Model A) increase prices by $32,000-$43,000. Chevrolet's Coupe increases prices by $32,694.
Models like Audi Q8, Ferrari California T, and GMC Sierra SLT 6.2L Crew Cab reduce prices by $11,000-$15,000.
By leveraging these insights, car dealerships can refine their business strategies, enhance customer satisfaction, and boost profitability.
- Review inventory and pricing strategies.
- Develop targeted marketing campaigns.
- Adjust sales training to emphasize value propositions.
This analysis provides valuable insights for used car dealers to refine their pricing strategies and inventory management. By understanding the impact of key features on car prices, dealers can make informed decisions to optimize their business.