Skip to content

you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

License

Notifications You must be signed in to change notification settings

rocharodo/What-Drives-the-Price-of-a-Car-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of content

Data Problem Definition:

Identify the key predictive factors that influence the sale price of used cars using a dataset of 426,000 vehicle records. Specifically, develop a regression model that explains the variation in sale price based on various attributes such as manufacturer, model, year, odometer, condition, and other relevant features.
The goal is to extract insights on the relationships between these attributes and sale price, ultimately informing feature importance and driving business recommendations for the used car dealership.

Key Objectives:

  • Identify significant predictors of used car sale price.
  • Quantify the relationships between predictors and sale price.
  • Rank features by importance to inform business decisions.

Technical Requirements:

  • Data preprocessing and feature engineering
  • Regression modeling (e.g. linear)
  • Feature selection and importance analysis
  • Model evaluation and validation

This reframed definition provides a clear direction for the data analysis task, focusing on identifying key drivers of used car prices using regression modeling and feature importance analysis.

Data Understanding Steps

I will use the following steps to get familiar with the dataset and identify quality issues:

  • Step 1: Initial Data Review
    • Check the dataset's dimensions (number of rows, columns)
    • Review the data types for each column (numeric, categorical, date, etc.)
  • Step 2: Summary Statistics
    • Calculate summary statistics for numeric columns (mean, median, mode, min, max, std dev)
    • Generate frequency distributions for categorical columns
  • Step 3: Data Visualization
    • Plot histograms or density plots for numeric columns
    • Create bar charts for categorical columns
    • Use scatter plots or correlation matrices to explore relationships between columns
  • Step 4: Missing Value Analysis
    • Identify columns with missing values
    • Calculate the percentage of missing values for each column
  • Step 5: Data Quality Checks
    • Check for invalid or inconsistent values
    • Identify duplicates or redundant records
    • Verify data formats (e.g., date, time, categorical)
  • Step 6: Data Profiling
    • Create a data profile report summarizing the findings
    • Document data quality issues, inconsistencies, and areas for improvement

    Here is the information of the raw data:


    Column Non-Null Count Null Counts Percent Dtype
    id 426880 0 0.000000 int64
    region 426880 0 0.000000 object
    price 426880 0 0.000000 int64
    year 425675 1205 0.282281 float64
    manufacturer 409234 17646 4.133714 object
    model 421603 5277 1.236179 object
    condition 252776 174104 40.785232 object
    cylinders 249202 177678 41.622470 object
    fuel 423867 3013 0.705819 object
    odometer 422480 4400 1.030735 float64
    title_status 418638 8242 1.930753 object
    transmission 424324 2556 0.598763 object
    VIN 265838 161042 37.725356 object
    drive 296313 130567 30.586347 object
    size 120519 306361 71.767476 object
    type 334022 92858 21.752717 object
    paint_color 296677 130203 30.501078 object
    state 426880 0 0.000000 object

    Based on the table above, it is clear that some features contain null values. Notably, some of these null values exceed the actual non-null values, such as in the case of the "size" feature. As a result, these null values are not worth retaining, as they will skew the data.
    As a general guideline, we will retain features with fewer than 50% null values. Later, we will decide whether to fill in the null values with the highest value for each feature, or with a placeholder such as "missing" or "unknown".

Histogram and Distribution

The following graphs are presented in two separate categories:

Numerical Features: Histograms

These graphs display the distribution of numerical features through histograms, providing a visual representation of the data's central tendency, dispersion, and shape.

Categorical Features: Grouped Distributions

These graphs illustrate the relationship between categorical features and the target variable (price) by showing the grouped distribution of the feature versus the price. This allows for an examination of how different categories impact the price, enabling insights into potential outliers and patterns.
By visualizing the grouped distribution, we can identify:
  1. Categories with exceptionally high or low prices
  2. Categories with limited or excessive price variability
This analysis facilitates a deeper understanding of the complex relationships between categorical features and the target variable, ultimately supporting more informed decision-making.

Initial Observations and Data Quality Concerns

Upon examining the grouped distributions, it is readily apparent that the data contains:

  1. Outliers and erroneous information - like the 10 digit chevrolet or the 1234567890 volvo -, which can significantly compromise the accuracy and reliability of any subsequent analysis
  2. A significantly uneven distribution between manufacturers, with some having a disproportionately large or small number of data points

Need for Data Cleansing

To ensure the integrity of our analysis, it is essential to perform a thorough data cleansing exercise to:
  1. Identify and remove outliers that may skew the results
  2. Correct or eliminate erroneous or invalid data points
  3. Handle missing values or inconsistencies
  4. Eliminate duplicates
By doing so, we can enhance the quality and reliability of the data, thereby supporting more robust and meaningful insights. This crucial step will enable us to build a stronger foundation for our analysis and mitigate potential risks associated with flawed data.

Data Preparation

Following an in-depth analysis of the data, I identified several initiatives to cleanse and preprocess the data for further analysis. The steps I took are outlined below:

Feature Selection and Removal

I dropped the following features to minimize redundancy and optimize the dataset:
  1. id
  2. VIN
  3. size
  4. region

Handling Missing Values

To ensure data quality, I removed rows with null values in the following critical features:
  1. year
  2. manufacturer

Imputation of Missing Values

For the remaining features with null values, I applied the following imputation strategies:
  1. Fuel: replaced with "gas"
  2. title_status: replaced with "missing"
  3. drive: replaced with "unknown"
  4. paint_color: replaced with "unknown"
  5. condition: replaced with "unknown"
  6. type: replaced with "unknown"
  7. model: replaced with "missing"

Outlier Removal

To prevent skewness and ensure robust analysis, I removed the bottom 10% and top 5% of values in the price feature, effectively eliminating outliers.

Concatenation

Analyses revealed that separate manufacturer and model features can introduce errors due to shared models across manufacturers.
To resolve this issue, I concatenated these features, creating unique combined values and ultimately enhancing the model's performance.


These data preparation steps enabled me to create a cleaner, more reliable dataset for subsequent analysis and modeling. The results can be seen in the following graphs:


Result Analysis

Improved Data Distribution

As evident from the graphs, our data now exhibits a more even distribution. This is consistently observed across all graphs, with the manufacturer vs. price graph serving as a prime example. Here, we can see that the prices for each manufacturer are now more evenly distributed, indicating the removal of the most prominent outliers.

Enhanced Price Analysis

The price graph is now displaying different information. Previously, it displayed a simple price histogram. Now, it showcases a price vs. year distribution, revealing the average prices across all years. This new representation uncovers a cyclical pattern and a discernible trend, offering valuable insights into price fluctuations over time.

Comparative Analysis

The graph below provides a direct comparison between the original data price vs. year and the cleansed data price vs. year. At a glance, we can appreciate the significant improvement in data distribution. The cleansed data, now more evenly distributed, will serve as the foundation for further analysis, enabling more accurate and reliable conclusions.

Data Preparation

Now that we have thoroughly cleansed the data, we can proceed to prepare it for further analysis, focusing particularly on the categorical features.

Encoding Categorical Features

To facilitate analysis, I applied the following encoding techniques:
  1. One Hot Encoding: For the features with multiple categories, I used One Hot Encoding to create binary vectors:
    1. model
    2. Transmission
    3. paint_color
    4. state
    5. fuel
    6. title_status
    7. drive
    8. type
    9. condition

Data Preparation and Encoding

The data preparation and encoding process was successfully executed utilizing the make_column_transformer function, capable of handling multiple encoding schemes concurrently.

Next Steps

With the transformed data, we can now proceed to select a suitable algorithm and forecast the price with increased accuracy.

Selecting the model

Model Selection and Evaluation

To identify the most accurate model, I evaluated three distinct regression models, each with its strengths and weaknesses. The selected models were:

  1. Ridge Regression: A linear regression model that uses L2 regularization to reduce overfitting by minimizing the magnitude of model coefficients.
  2. Lasso Linear regression with L1 regularization for feature selection and reduced multicollinearity.
  3. Linear Regression Basic linear regression for simple relationships and initial analysis.
  4. Elastic Net Hybrid linear regression combining L1 and L2 regularization for balanced regularization.

The objective of this evaluation was twofold:

  • Achieve a balance between Mean Squared Error (MSE), ensuring accurate predictions while minimizing overfitting.
  • Identify the coefficients of each feature, enabling the analysis of their impact on car prices.

By comparing the performance of these models, we can determine which one provides the most insightful and accurate predictions, ultimately informing data-driven decisions.

The result of the evaluation can be found in the following table:

Model Best Score Mean Squared Error (MSE) Elapsed Time
Ridge 0.561262 40,246,517.02 30 seconds
Linear Regression 0.552364 41,062,787.54 50 seconds
Lasso 0.541976 42,015,664.01 2 hours 56 minutes
Elastic Net 0.382332 56,660,232.62 15 minutes

Conclusions

Model Performance Comparison

  • Ridge Regression outperforms Linear Regression, Lasso and Elastic Net in terms of Best Score (0.561262) and Mean Squared Error (MSE) (40,246,517.02).
  • Linear regression comes second with a score of 0.552364 an MSE of 41062787.54 and 50 seconds
  • Lasso Regression has the longest training time (2 hours 56 minutes), despite its relatively poor performance.
  • Elastic Net Regression has the worst performance among the three models, with the lowest Best Score (0.382332) and highest MSE (56,660,232.62).

Time-Efficiency Tradeoff

  • Ridge Regression: Best performance and fastest training time (30 seconds).
  • Linear Regression has a very good performance (50 seconds) but is not better than Ridge Regression
  • Elastic Net Regression: Relatively slow training time (15 minutes) considering its poor performance.
  • Lasso Regression: Longest training time (2 hours 56 minutes), making it the least efficient.

Error Analysis

All models struggle with large errors, as indicated by Mean Squared Error (MSE) values.

Ridge Regression has the lowest MSE (40,246,517.02).

Used Car Price Prediction Report

Executive Summary

This report presents the findings of a data analysis aimed at identifying key factors influencing used car prices. Using Ridge model, we determined the most relevant features and their coefficients. Our results provide actionable insights for used car dealers to fine-tune their inventory and pricing strategies.

Methodology

This predictive modeling project employed a structured approach to analyze the vehicle dataset and estimate car prices.

Dataset

The dataset utilized in this study consisted of 426,000 vehicle records, containing a mix of numerical and categorical features:

  • Numerical features: price, year, odometer
  • Categorical features: region, manufacturer, model, condition, cylinders, fuel, transmission, drive, size, type, paint_color, state
  • Identifier: id, VIN

Data Preprocessing

The dataset underwent rigorous data cleansing and preparation:

  • Handling missing values and outliers
  • Data normalization and feature scaling
  • Encoding categorical variables
  • Transforming numerical features

Model

To predict car prices, the following models were evaluated:
  • Ridge Regression: Ridge Regression is a regularized linear regression technique that minimizes overfitting by adding a penalty term to the cost function.
  • Lasso (Least Absolute Shrinkage and Selection Operator): Lasso is another regularized linear regression technique.
  • Linear Regression: Linear Regression is a basic linear regression model.
  • Elastic Net: Elastic Net is a hybrid regularized linear regression technique that combines Ridge and Lasso.

Evaluation Metrics

Model performance was assessed using:

  • Mean Squared Error (MSE): measures prediction accuracy
  • Score: Evaluates the model's ability to make accurate predictions.
  • Time/Performance: Measures the computational efficiency and speed of the model.

Model Selection and Insights

Following a thorough evaluation, we selected the Ridge model as the most suitable algorithm due to its accuracy and performance. To glean actionable insights, we examined two key characteristics: permutation importance and coefficients. These metrics enabled us to determine vital information from the model, which informed the subsequent benefits for car dealerships, categorised into low-end, mid-range, and high-end.

Permutation Importance Insights

Permutation importance measures the contribution of each feature to a model's predictions. It calculates the decrease in model performance when a feature is randomly shuffled, indicating its relative importance.

Permutation Importance: Which features matter most?

  • The number of cylinders (5 cylinders) and the car's condition (excellent) are the most important features, influencing 89.5% of the model's predictions.
  • The condition of the car (fair, unknown, new, like new, good) has a significant impact on the prediction, indicating that the model is sensitive to the car's state.
  • The number of cylinders (10, 4, 12, 3) has a relatively lower importance score, suggesting that the model is less sensitive to these features.

Coefficients Insights

Coefficients represent the change in the predicted outcome for a one-unit change in the feature, while holding all other features constant. They indicate:

Coefficients: How much do individual features impact the prediction?

  • Luxury brands: Negative coefficients (e.g., Ferrari -11165.953304, Audi -11055.727044) signify lower prices or reduced demand, indicating a niche market.
  • Classic Ford models: High positive coefficients (e.g., Ford_Roadster 43548.884763) suggest elevated demand and prices, presenting lucrative opportunities.
  • Mid-range models: Moderate coefficients (e.g., Chevrolet_Coupe 32694.947709) indicate stable demand and prices, ensuring a consistent revenue stream.

Benefits for Car Dealerships

High-End Car Dealers

  • Stock high-performance vehicles: Focus on 5-cylinder models from luxury brands.
  • Emphasise condition: Highlight excellent condition to command premium prices.
  • Target luxury buyers: Focus marketing on affluent customers seeking high-end features.
  • Potential profit margins increase by $32,000-$43,000 per luxury unit sold.

Mid-Range Car Dealers

  • Focus on 5-cylinder models, which have high importance scores (cylinders_5 cylinders: 0.51755).
  • Focus on selling Chevrolet Coupe and Ford Convertible, highlighting value-for-money.
  • Emphasize the value of mid-range models like Ford F-350 SD Lariat and GMC Sierra SLT 6.2L Crew Cab, considering their negative coefficients (-11575.422905 and -11442.310872).
  • Highlight excellent condition and maintenance to appeal to practical buyers (condition_excellent importance score: 0.37687).

Low-End Car Dealers

  • Target budget-conscious buyers: Emphasise affordability, fuel efficiency, and reliability.
  • Focus on affordable models with fewer cylinders (3 or 4), which have lower importance scores (cylinders_3 cylinders: 0.00614, cylinders_4 cylinders: 0.02032).
  • Highlight good condition and maintenance to appeal to budget-conscious buyers (condition_good importance score: 0.00598).
  • Price competitively: Set prices considering low-end coefficients and market demand.

Common Benefits

  • Data-driven pricing: Utilise coefficients to inform pricing strategies.
  • Inventory optimisation: Stock vehicles with in-demand features.
  • Targeted marketing: Focus on specific customer segments.

Luxury and Classic Models

Ford's iconic models (Roadster, Coupe, T, T-Bucket, Tudor, Model A) increase prices by $32,000-$43,000. Chevrolet's Coupe increases prices by $32,694.

Commercial and Heavy-Duty Vehicles

Models like Audi Q8, Ferrari California T, and GMC Sierra SLT 6.2L Crew Cab reduce prices by $11,000-$15,000.

By leveraging these insights, car dealerships can refine their business strategies, enhance customer satisfaction, and boost profitability.

Action Items:

  • Review inventory and pricing strategies.
  • Develop targeted marketing campaigns.
  • Adjust sales training to emphasize value propositions.

Conclusion

This analysis provides valuable insights for used car dealers to refine their pricing strategies and inventory management. By understanding the impact of key features on car prices, dealers can make informed decisions to optimize their business.

About

you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published