Exploring Red Wine Quality by Allan Reyes

About

In this exercise, I will explore a data set on wine quality and physicochemical properties. The objective is to explore which chemical properties influence the quality of red wines. I’ll start by exploring the data using the statistical program, R. As interesting relationships in the data are discovered, I’ll produce and refine plots to illustrate them. The data is available for download here and background information is available at this link.

Summary Statistics

Let’s run some basic functions to examine the structure and schema of the data set.

str(df)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
summary(df)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Since we’re primarily interested in quality, it would also be interesting to see basic statistics on that as well.

summary(df$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Some initial observations here:

  • There are 1599 observations of 13 numeric variables.
  • X appears to be the unique identifier.
  • quality is an ordered, categorical, discrete variable. From the literature, this was on a 0-10 scale, and was rated by at least 3 wine experts. The values ranged only from 3 to 8, with a mean of 5.6 and median of 6.
  • All other variables seem to be continuous quantities (w/ the exception of the .sulfur.dioxide suffixes).
  • From the variable descriptions, it appears that fixed.acidity ~ volatile.acidity and free.sulfur.dioxide ~ total.sulfur.dioxide may possible by dependent, subsets of each other.

Univariate Plots

To first explore this data visually, I’ll draw up quick histograms of all 12 variables. The intention here is to see a quick distribution of the values.

## Warning: position_stack requires constant width: output may be incorrect

Univariate Analysis

Wine Quality

I first looked at wine quality. Although it has a discrete range of only 3-8, we can roughly see that there is some amount of normal distribution. A large majority of the wines examined received ratings of 5 or 6, and very few received 3, 4, or 8. There’s not much more we can do with this histogram, as both decreasing or increasing bin sizes would distort the data.

Given the ratings and distribution of wine quality, I’ll instantiate another categorical variable, classifying the wines as ‘bad’ (rating 0 to 4), ‘average’ (rating 5 or 6), and ‘good’ (rating 7 to 10).

##     bad average    good 
##      63    1319     217

Distributions and Outliers

  • It appears that density and pH are normally distributed, with few outliers.
  • Fixed and volatile acidity, sulfur dioxides, sulphates, and alcohol seem to be long-tailed.
  • Qualitatively, residual sugar and chlorides have extreme outliers.
  • Citric acid appeared to have a large number of zero values. I’m curious whether this is truly zero, or if it is a case of non-reporting.

When plotted on a base 10 logarithmic scale, fixed.acidity and volatile.acidity appear to be normally-distributed. This makes sense, considering that pH is normally distributed, and pH, by definition, is a measure of acidity and is on a logarithmic scale. Curiously, however, citric.acid, did not appear to be normally-distributed on a logarithmic scale. Upon further investigation:

length(subset(df, citric.acid == 0)$citric.acid)
## [1] 132

It is apparent that 132 observations had a value of zero. This yields some concerns on whether or not these 132 values were reported or not, considering that the next ‘bin’ higher contains only 32 observations.

Short questions

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

While exploring the univariate histogram distributions, there did not appear to be any bimodal or multimodal distributions that would warrant sub-classification into categorical variables. I considered potentially splitting residual.sugar into ‘sweet wine’ and ‘dry wine’, but the Wikipedia source cited a residual sugar of greater than 45 g/L or g/m^3 to classify as a sweet wine.

Did you create any new variables from existing variables in the dataset?

I instantiated an ordered factor, rating, classifying each wine sample as ‘bad’, ‘average’, or ‘good’.

Upon further examination of the data set documentation, it appears that fixed.acidity and volatile.acidity are different types of acids; tartaric acid and acetic acid. I decided to create a combined variable, TAC.acidity, containing the sum of tartaric, acetic, and citric acid.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I addressed the distributions in the ‘Distributions’ section. Boxplots are better suited in visualizing the outliers.

In univariate analysis, I chose not to tidy or adjust any data, short of plotting a select few on logarithmic scales. Bivariate boxplots, with X as rating or quality, will be more interesting in showing trends with wine quality.

Bivariate Plots and Analysis

To get a quick snapshot of how the variables affect quality, I generated box plots for each.

From exploring these plots, it seems that a ‘good’ wine generally has these trends:

  • higher fixed acidity (tartaric acid) and citric acid, lower volatile acidity (acetic acid)
  • lower pH (i.e. more acidic)
  • higher sulphates
  • higher alcohol
  • to a lesser extend, lower chlorides and lower density

Residual sugar and sulfur dioxides did not seem to have a dramatic impact on the quality or rating of the wines. Interestingly, it appears that different types of acid affect wine quality different; as such, TAC.acidity saw an attenuated trend, as the presence of volatile (acetic) acid accompanied decreased quality.

By utilizing cor.test, I calculated the correlation for each of these variables against quality:

##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
##          TAC.acidity log10.residual.sugar      log10.chlordies 
##           0.10375373           0.02353331          -0.17613996 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##          -0.05065606          -0.18510029          -0.17491923 
##                   pH      log10.sulphates              alcohol 
##          -0.05773139           0.30864193           0.47616632

Quantitatively, it appears that the following variables have relatively higher correlations to wine quality:

  • alcohol
  • sulphates (log10)
  • volatile acidity
  • citric acid

Let’s see how these variables compare, plotted against each other and faceted by wine rating: