Processing math: 100%

1 Analysis

In this work it will be analyzed the impact in quality of several parameters describing red wine. The dataset is curated by Udacity and comes from the UCI repository https://archive.ics.uci.edu/ml/datasets/Wine+Quality and consists of 1599 sample data for Red wine https://docs.google.com/document/d/1qEcwltBMlRYZT-l699-71TzInWfk4W9q5rTCSvDVMpc/pub?embedded=true.

In [1] it is shown that the most imporant features for assessing Red Wine quality are:

This project will try to verify whether these or other features are the most important to influence red wine quality.

2 Variable summary

Variable attributes

The full description of the dataset can be found here https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt [2]:

The desciption of the variables attributes (extracted from [2]) is the following:

1 - fixed acidity (tartaric acid - g / dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid (g / dm^3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides (sodium chloride - g / dm^3): the amount of salt in the wine

6 - free sulfur dioxide (mg / dm^3) : the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density (g / dm^3): the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates (potassium sulphate - g / dm^3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol (% by volume): the percent alcohol content of the wine

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

Variable summary

The summary of the dataset is the following:

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality      quality.cat  
##  Min.   : 8.40   Min.   :3.000   bad   :  63  
##  1st Qu.: 9.50   1st Qu.:5.000   medium:1319  
##  Median :10.20   Median :6.000   good  : 217  
##  Mean   :10.42   Mean   :5.636                
##  3rd Qu.:11.10   3rd Qu.:6.000                
##  Max.   :14.90   Max.   :8.000
  • X: is the row id
  • quality: range is between 3 and 8
  • quality.cat: generated to group quality in ranges: 0-4 -> BAD, 4-7 -> MEDIUM, 7-10 -> GOOD. More information on the section “Analysis”.
  • All variables except total.sulfur.dioxide and free.sulfur.dioxide (integers), are continuous.
  • total.sulfur.dioxide is the sum of free.sulfur.dioxide and bound forms: hence the two sulfur variables are related.
  • volitale.acidity is acetic acid, different from tartaric or fixed.acidity and citric.acid. Acetic acid gives wine vinegar taste, while fixed acids do not easily evaporate. Citric acid is added to some wines for freshness or to increase acidity [3].

3 Univariate analysis

In this section it will be analyzed each of the variables describing the wines.

3.1 Plots Section

Quality

The distribution of wine shows that most of wines have a quality between 5-6 points. The mean value is 5.64 and the median 6.

Quality distribution

Figure 3.1: Quality distribution

Fixed and volatile acidity, citric acidity

The plot 3.2 shows the distribution of fixed acidity, volatile acidity and citric acidity.

  • Fixed acidity is negatively skewed with mean 8.32 and median 7.9.

  • The volatile acidity is positively skewed with mean 0.53 and median 0.52.

  • The citric acidity appears to be a bimodal or even a trimodal distribution with overall mean 0.27 and median 0.26.

Fixed acidity, volatile acidity and citric acid distribution

Figure 3.2: Fixed acidity, volatile acidity and citric acid distribution

Residual sugar

  • The plot of residual suggar (3.3 ) is positively skewed with mean 2.54 and median 2.2.
  • There was a noticeable presence of outliers in the upper part of the graphic. These outliers were removed by applying a filter and keeping the values of the residual sugar in the range x>μ±3σ.
Residual sugar distribution

Figure 3.3: Residual sugar distribution

Chlorides, free sulfur dioxide and total sulfur dioxide

The plot 3.4 shows chlorides, free sulfur dioxide and total sulfur dioxide levels distribution.

  • The distribution of chlorides is positively skewed with mean 0.087 and median 0.079. Outliers 3 standard deviation away from the mean were removed as in the case of residual sugar.

  • The free sulfur dioxide is positively skewed with mean 16 and median 14.

  • The total sulfur dioxide is positively skewed with mean 46 and median 38.

Chlorides and sulfur dioxide distributions

Figure 3.4: Chlorides and sulfur dioxide distributions

Density, pH, sulphates and alcohol

The plot 3.5 shows density, pH, sulphates and alcohol distributions.

  • The distribution of density seems to be a symmetrical normal distribution with mean 0.9967 and median 0.9968.

  • The pH distribution seems to be negatively skewed with mean 3.31 and median 3.31.

  • The sulphates distribution is positively skewed with mean 0.66 and median 0.62.

  • The alcohol distribution is positively skewed with mean 10.4 and median 10.2.
    Density, pH, sulphates and alcohol distributions

    Figure 3.5: Density, pH, sulphates and alcohol distributions

3.2 Analysis

What is the structure of your dataset?

There are 1599 red wines with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality).

Other observations:

Table 3.1: Summary of redwines.quality
Min. 1st Qu. Median Mean 3rd Qu. Max.
redwines.quality 3 5 6 5.636 6 8

Table 3.1 shows that the mean quality of the red wines is 5.636 and the median is 6. Q1 division corresponds to 5 and Q3 to 6, hence, 50% of the data lays within 5-6 range of quality, this is the level MEDIUM.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in this dataset is the quality of the wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

In [1] it is shown that the most imporant features for assessing Red Wine quality are: sulphates, pH and total sulfur dioxide. An analysis will be performed in order to certify whether these are the most importart features for the analysis or others are.

Did you create any new variables from existing variables in the dataset?

I did not create any new variables to support the analysis since the amount of information available is enough to assess the quality of the wine.

However, the variable quality was converted in a factor variable (adding a new variable named quality.cat) with the following levels:

Quality range table.
BAD MEDIUM GOOD
quality.cat [0,4] (4,7) [7,10]

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most of the features had a normal distribution. Some of the them had quite skewed distributions and many oultiers.

In the histograms, some of the varaibles, like residual sugar or chlorides show long tails, what indicates the presence of outliers. For a better view, I tried to remove those outliers located 3 standard deviations away from the mean (x>μ±3σ).

Residual sugar and Chlorides distributions with original and log10() transformed axis

Figure 3.6: Residual sugar and Chlorides distributions with original and log10() transformed axis

It could also be also useful to perform a log10 tranformation in order to have a better view. The efect of this transformation can be observed in figure 3.7, where charateristics of plots in the left side are seen better after the x axis is transformed.

Residual sugar and Chlorides distributions with original and log10() transformed axis

Figure 3.7: Residual sugar and Chlorides distributions with original and log10() transformed axis

4 Bivariate Analysis

4.1 Correlation

A good way to see how two variables are related is calculating the correlation between them. In this section it will be calculated the correlation between the features of red wine and the strongest will be analyzed.

Correlation betweeen variables

In the next figure (fig.4.1) it is shown the correlation factor between the different variables in the Red Wines dataset. Significant correlation values (|ρ|>0.3) are highlighted with stronger color density the higher the correlation value is.

Correlation bewteen Red Wine's dataset features

Figure 4.1: Correlation bewteen Red Wine’s dataset features

For the purpose of this analysis, the focus will be placed in the relationship between quality and other variables.

Correlation between variables by quality range

Let us now analyze the correlation between the variables for each quality range in figure 4.2. It can be seen that some features are more correlated between them for lower quality wine. Some strong relationships between variables are kept for all quality ranges, as pH-density, pH-citric acidity and ph-fixed acidity, or fixed acidity-volatile acidity and fixed acid-fixed acidity.

Correlation vs Quality

Figure 4.2: Correlation vs Quality

From the general correlation matrix (fig.4.1), three main variables can be selected due to their high correlation with quality: alcohol (ρ= 0.48 ), volatile.acidity (ρ= -0.39 ) and sulphates (ρ= 0.25 ). In order to see if they are suitable to perform an analysis let us explore the correlation between them and also the distribution of samples in bivariate plots in figure 4.3.

Correlation between alcohol, volatile acidity and sulphates

Figure 4.3: Correlation between alcohol, volatile acidity and sulphates

From plot 4.3 it is observed that there is a strong relationship between the sulphates and volatile acidity, and less strong between alcohol and volatile acidity.

Let us now explore the three variables suggested in [1] as the most imporant features for assessing Red Wine quality are: sulphates, pH and total sulfur dioxide.

Correlation between total sulfur dioxide, pH and sulphates

Figure 4.4: Correlation between total sulfur dioxide, pH and sulphates

Figure 4.3 shows that the strongest relationship is between sulphates and pH (ρ=0.2) followed by pH-total sulfur dioxide.

From figure 4.4 and 4.3 it can be concluded the varibles in the second figure, sulphates, pH and total sulfur dioxide (the ones suggested by [1]), are less correlated between them and lead to a more accurate analysis.

4.2 Plots Section

In this section it will be perfomed an analysis of the features vs the quality ranges. For each variable it will be plotted a histogram per quality range and also a bloxplot showing the interquartile range for each quality range, also showing the outliers as dots. In the boxplot the mean value per quality range is marked with a red asterisk.

Fixed and volatile acidity, citric acid

In the next plot (fig.4.5) it is deduced that:

  • the quality of the wine is directly proportional to the fixed acidity and citric acidity levels.

  • the quality of the wine is inversely proportional to the volatile acidity.

  • Better wines tend to have an acid hint but with low vinegar reminiscence. They tend to have a frehser taste, related to citric notes.

  • Worse wines have dominant pressence of vinegar taste.

 Fixed and volatile acidity distributions per quality segment

Figure 4.5: Fixed and volatile acidity distributions per quality segment

Residual sugar

The plot of residual suggar (4.6) shows that:

  • the better the wine the higher the residual sugar levels.

Outliers further than ±3σ from the mean have been removed.

Residual sugar distribution per quality segment

Figure 4.6: Residual sugar distribution per quality segment

Chlorides, free sulfur dioxide and total sulfur dioxide

  • In the plot of the chlorides, it can be observed that the better the wine, the lower the chloride levels.
  • For the sulfur dioxide, either the free or the total sulfur dioxide, high levels are indicator of medium quality, whereas bad and good wine have the same low amount of sulfure dioxide. There is not a linear relationship between sulfur dioxide (SO2) and the wine quality.

Outliers further than ±3σ from the mean have been removed for chlorides.

  • Better wines are not salty

  • Worse wines have a salty taste

Chlorides and sulfur dioxide distributions per quality segment

Figure 4.7: Chlorides and sulfur dioxide distributions per quality segment

Density, pH, sulphates and alcohol

The next plot shows that:

  • generally, the lower the density, the better the quality of the wine.

  • low pH levels are sign of better quality.

  • the higher the sulphates level, the better quality of the wine.

  • the higher amount of alcohol the better the wine.

  • Better wines tend to be less dense (density is related to sugar and alcohol), more acid (lower pH, we already commented this), with more sulphates (related to SO2) and more alcohol pressence.

  • Worse wines are denser, and have less alcohol percentage per volume. They also tend to be more basic (higher pH, less acid) and have lower levels of sulphates

Density, pH, suplhates and alcohol distributions per quality segment

Figure 4.8: Density, pH, suplhates and alcohol distributions per quality segment

4.3 Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation.How did the feature(s) of interest vary with other features in the dataset?

The correlation between three main variables (fig.4.9) and the quality of wine is:

  • sulphates vs quality: ρ= 0.25.
  • pH vs quality: ρ= -0.06.
  • total sulfur dioxide vs quality: ρ= -0.19.
Sulphates, pH and total sulfur dioxide vs quality

Figure 4.9: Sulphates, pH and total sulfur dioxide vs quality

From the plot it can be concluded that:

  • Better wines tend to be more acid (lower pH, we already commented this), with more sulphates (related to SO2) and more total sulfur dioxide levels (also related to SO2).

  • Worse wines tend to be more basic (higher pH, less acid) and have lower levels of sulphates and total sulfur dioxide

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)? What was the strongest relationship you found?

Figure 4.1 shows the correlation between the variables in the dataset. The strongest are these pairs: citric acid -fixed acidity, total sulfur dioxide - free sulfur dioxide, density- fixed acidity and pH - fixed acidity.

It is logical that citric acidity influences fixed acidity, as does free sulfur dioxide with total sulfur dioxide. pH is known to be related to the acid levels, so no surprise here. I did not know there was such a strong relationship between density and fixed acidity.

Figure 4.10 shows the relationship between fixed acidity and density, and between fixed acidity and pH. Fixed acidity and density have a strong positive correlation (ρ= 0.67) and fixed acidity and pH show a strong negative correlation (ρ= -0.68).

Density and pH vs fixed acidity

Figure 4.10: Density and pH vs fixed acidity

5 Multivariate Plots Section

 Sulphates, pH and total sulfur dioxide relationship vs quality

Figure 5.1: Sulphates, pH and total sulfur dioxide relationship vs quality

Fixed acidity vs pH, fixed acidity vs citric acid and volatile acidity vs citric acid per quality range

Figure 5.2: Fixed acidity vs pH, fixed acidity vs citric acid and volatile acidity vs citric acid per quality range

6 Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Figure 5.1 shows the 2D density distribution of the three variables (compared in pairs) that influence the most the quality of the wine. It is interesting how the centers seem to displace in a linear manner.

  • For the case of sulphates-pH there is a negative dependence, coincident with the value of the correlation between these two variables ρ= -0.2.

  • For the case of sulphates-total sulfur dioxide there seems to be no dependence, what is coincident with the value of the correlation between these two variables ρ= 0.04.

  • For the case of pH-total sulfur dioxide there seems to be no dependence,
    what is coincident with the value of the correlation between these two variables ρ= -0.07.

Figure 5.2 shows the relationship between the highest values of correlation in figure the matrix of correlation in figure 4.1. In this case the variables pairs selected were: fixed acidity vs pH, fixed acidity vs citric acid and volatile acidity vs citric acid, al with different colors per quality range.

  • For the case of fixed acidity vs pH there is a negative dependence, coincident with the value of the correlation between these two variables ρ= -0.68.

  • For the case of fixed acidity vs citric acid there seems to be a strong dependence, what is coincident with the value of the correlation between these two variables ρ= 0.67.

  • For the case of volatile acidity vs citric acid there seems to be no dependence, what is coincident with the value of the correlation between these two variables ρ= -0.55.

Were there any interesting or surprising interactions between features?

I found that there is no significative correlation between pH-total sulfur dioxide. Also, I found that usually, higher quality wine is related to higher levels of citric acid.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a multiple linear model with the three main features: sulphates, pH and total sulfur dioxide.

The linear model used takes log10 of each variable to make the model. In figure 6.1 it is shown how the overall error is reduced. However, when looking to the error per quality range (fig. 6.2), the error is higher for extreme values of quality. This makes sense, since the most abundant group is the one with medium quality and hence, the model is more accurate with this segment.

 Error by linear model

Figure 6.1: Error by linear model

 Error by linear model

Figure 6.2: Error by linear model

The model has the following strengths and limitations:

  • strengths: simple and limited to most influencing variables.

  • limitations: not all variables are included, the combination of variables is linear and the quality is set by humans with different criteria.

7 Final Plots and Summary

7.1 Plot One

Sulphates vs quality

Figure 7.1: Sulphates vs quality

Description One

Figure 7.1 shows the correlation between sulphates and quality in the Red Wines dataset. Higher amounts of sulphates are related to higher quality.

7.2 Plot Two

Alcohol vs quality

Figure 7.2: Alcohol vs quality

Description Two

Figure 7.2 shows the correlation between alcohol and quality in the Red Wines dataset. Higher levels of alcohol are related to higher quality.

7.3 Plot Three

 Citric acidity vs quality

Figure 7.3: Citric acidity vs quality

Description Three

Figure 7.3 shows the distribution of citric acid vs quality. It can be observed how higher quality wines have more levels of citic acid than lower quality wine. I really liked this conclusion, and that is why it is included here, because I always thought it would be the opposite, that lower quality wine would be the more acid.

8 Reflection

With this work I have realized how important is to investigate a dataset before starting to make a model or just to extract conclusions. Making an Exploratory Data Analysis (EDA) is an iterative work trying to find the best way to explain relationships between variables and distribution of data and then communicate the result in a clear and concise way.

The main struggle I faced is to format the plots so the conclusions could be drawn properly and data could be compared from one plot to another. It is also importat to select the most relevant plots in order to perform an analysis.

I found that the correlation matrix is crucial for continuous variables, giving the lead on where you have to look closer at. It is also important to know the data, each variable and for that, the plotting the distributions is important.

Sulphates, pH and total sulfur dioxide are the most important features to make a model to predict the Red Wine quality, as suggested by [1]. However, the correlation matrix shows that it is alcohol, sulphates and volatile acidity the most correlated with quality. After exploring the correlation coefficient between them I could see that “alcohol, sulphates and volatile acidity” set is more correlated between the variables than the “sulphates, pH and total sulfur dioxide” set. From quality exploration it is noticeable that there are no records for wines with quality below 3 or above 8.

Other interesting conclusion is that acid is highly related to the quality of the red wine. It is desireable that wine has presence of acid taste, but not of any kind, citric acid and other fixed acids are preferred to volatile acids (more vinegar taste).

This EDA of the Red Wine quality has given me the opportunity to approach a dataset using R and draw conclusions from data. This experience will be helpful in the future when approaching simiar problems.

References

[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis, “Modeling wine preferences by data mining from physicochemical properties,” Decision Support Systems, vol. 47, no. 4, pp. 547–553, 2009.

[2] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis, “Red Wine Quality dataset description- Modeling wine preferences by data mining from physicochemical properties.” [Online]. Available: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt; http://www3.dsi.uminho.pt/pcortez/winequality09.pdf. [Accessed: –]

[3] J. Medina, “P4 Explore and Summarize Data: Red Wine EDA.” [Online]. Available: https://rpubs.com/jasonmedina/220283. [Accessed: –]

[4] “Wine Quality Analysis.” [Online]. Available: https://rpubs.com/sanmen/24803. [Accessed: –]

[5] I. Thomas, “EDA of Red Wine Quality Dataset.” [Online]. Available: https://github.com/IwanThomas/Udacity-Data-Analysis-Nanodegree/blob/master/Project-4/RedWineRMD.Rmd