This project explores what makes red wine "good" by analyzing the publicly available Red Wine Quality DataSet. The dataset, originally sourced from Cortez et al. (2008) at the University of Minho, contains physicochemical measurements (e.g., acidity, alcohol content) of various red wines, along with quality scores assigned by wine tasters.
Using statistical modeling and regression techniques, this project identifies key factors contributing to wine quality and builds predictive models to estimate wine ratings based on chemical properties.
- Identify the most influential chemical properties affecting red wine quality.
- Develop a multi-linear regression model to predict wine quality.
- Evaluate the performance of the regression model on unseen data.
- Address assumptions of linear regression, including normality and homoscedasticity.
- Provide actionable insights for winemakers to enhance wine quality.
The dataset contains 1,599 observations with 12 variables:
- Fixed Acidity: Non-volatile acids contributing to wine's total acidity (g(tartaric acid)/dm³).
- Volatile Acidity: Acetic acid levels (g(acetic acid)/dm³).
- Citric Acid: A natural preservative and flavor enhancer (g/dm³).
- Residual Sugar: Sugar remaining after fermentation (g/dm³).
- Chlorides: Salt content (g(sodium chloride)/dm³).
- Free Sulfur Dioxide: Free form of SO2, a preservative (mg/dm³).
- Total Sulfur Dioxide: Bound and free forms of SO2 (mg/dm³).
- Density: Density relative to water (g/cm³).
- pH: Acidity level.
- Sulphates: Adds to wine’s bitterness and acts as an antioxidant (mg/dm³).
- Alcohol: Percentage of alcohol by volume (% vol).
- Quality: Quality score (integer values between 0 and 10).
- Inspected the distribution of each variable and its relationship to wine quality.
- Identified potential multicollinearity and interaction effects among predictors.
- Divided the dataset into training (80%) and testing (20%) subsets for model building and evaluation.
- Built a linear regression model using all independent variables.
- Used stepwise selection (via the
olsrr
package) to identify the most significant predictors.
- Included interaction terms and polynomial terms for key predictors (e.g., sulphates and alcohol).
- Evaluated models using adjusted R-squared and residual standard error.
- Checked for linearity, homoscedasticity, and normality of residuals.
- Addressed violations using transformations but retained the original scale due to better interpretability.
- Assessed model performance on the test set using:
- R-squared
- Mean Absolute Percentage Error (MAPE)
-
The top three factors influencing wine quality are:
- Alcohol: A 1% increase in alcohol content raises the quality score by 0.292.
- Volatile Acidity: A decrease of 1 g(acetic acid)/dm³ increases the quality score by 0.962.
- Sulphates: A 1 mg/dm³ increase raises the quality score by 0.547.
-
Interaction terms involving volatile acidity, sulphates, and total sulfur dioxide also significantly affect wine quality.
-
The final model explains 38.13% of the variability in wine quality, with a mean absolute percentage error (MAPE) of 8.87% on the test set. While not highly predictive, the model provides valuable insights into the key drivers of wine quality.
- Discrete Dependent Variable: Wine quality is an integer, whereas the model predicts continuous values, leading to some mismatch.
- Dataset Imbalance: Most wines have quality ratings of 5 or 6, limiting the model's ability to predict higher-quality wines (e.g., quality = 8).
- Unexplained Variance: The model explains only 38.13% of the variability, suggesting other unmeasured factors (e.g., taste preferences, vineyard practices) contribute to wine quality.
- R Programming: Data manipulation, modeling, and visualization.
- Libraries:
ggplot2
andGGally
: Data visualization.olsrr
: Stepwise regression.lmtest
: Assumption testing.plotly
: Interactive plots.
This project highlights the chemical properties that most influence red wine quality, providing a foundation for further research and refinement. While the model has limitations in predictive power, it offers actionable insights for winemakers to optimize wine quality through targeted adjustments to alcohol content, volatile acidity, and sulphates.
- Explore non-linear models (e.g., Random Forest, Gradient Boosting) for better predictive accuracy.
- Investigate additional features, such as grape variety, fermentation duration, or sensory attributes.
- Address dataset imbalance to improve model performance for high-quality wines.
- Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547-553.