# Pairwise correlation between variables

May 10, 2021by Lin Quan and Lily Wang

#### Correlation

This blog post is about examining and understanding the correlation between the survey variables. Correlation is used to test relationships between quantitative variables or categorical variables. It’s a common tool for describing simple relationships without making a statement about cause and effect. Correlations can range from -1 to 1. -1 implies a perfect negative correlation between the two variables: a one-unit decrease in variable x coincides with a one-unit decrease in variable y. 1 implies a perfect positive correlation between the two variables: a one-unit decrease in variable x coincides with a one-unit increase in variable y. If the value of a correlation is close to 0, it means there is no relationship between two variables.

Figure: Pairwise correlation. ‘b’ represents binary variable, ‘c’ represents continuous variable.

#### Results

This is a pairwise correlation graph. The plot shows the correlation between each pair of variables we included in the analysis. The red color presents a positive correlation, and the blue color presents a negative correlation. The darker the color, the stronger the correlation. If the correlation close to 0, the color appears as white. Based on the results shown in the graph, most correlation values between these variables range from 0.3 to (-0.3). It means these variables have weak positive (or weak negative) relationship.

The highest positive correlation value is 0.541, which is between "Missed rent" and "Offered payment plan". It means that the relationship between "Missed rent" and "Offered payment plan" is moderate positive.. This is to be expected. Property owners are more likely to offer payment plans to tenants when they are struggling to make rent payments. It's likely that property owners offer payment plans as a strategy to help tenants who have missed rental payments to become current.

The highest negative correlation value is -0.284, which is between "Missed rent" and "Average portfolio rent". The negative relationship between "Missed rent" and "Average portfolio rent" is not strong. This relationship indicates that property owners whose portfolios include relatively low-cost units are more likely to have tenants who missed rental payments. The economic impact of the pandemic has fallen disproportionately on low-income families. Since low-income families are more likely to be renters and more likely to live in low-cost units, we should thus expect that owners of low-rent units will be more likely to have tenants who have missed rent payments.

The correlation coefficient between "Anticipated decline in operating income" and "Vacancy rate" is + 0.34. This means that a landlord who is anticipating a decline in operating income is also experiencing higher vacancy rates in her residential rental portfolio.

#### Variables

Here we list the variables are included in the analysis:

• Missed rent: Have any of your residential tenants missed rent payments since the start of the pandemic? (yes = 1). It’s a categorical variable.
• Program participation: Answered yes to either a federal or state and local program. (yes = 1). It’s a categorical variable.
• Missed mortgage: Are you currently past-due on any of your mortgage payments? (yes = 1). It’s a categorical variable.
• Deferred maintenance: Have you postponed any planned maintenance since the start of the pan- demic? (yes = 1). It’s a categorical variable.
• Expects lower operating income: How do you anticipate this year’s operating income (2020) will compare with last year’s (2019)? (yes = 1). It’s a categorical variable.
• % negative cash flow: Proportion of portfolio with negative cash flows. It’s a numeric variable.
• Higher vacancy: How does the current vacancy in your portfolio right now compare to the same time last year? (slightly or much higher = 1). It’s a categorical variable.
• Male: Male = 1. It’s a categorical variable.
• Black: Black = 1. It’s a categorical variable.
• Has other job: Has other job = 1. It’s a categorical variable.
• College educated: College educated = 1. It’s a categorical variable.
• Owner income over 125K: Respondents total income > 125k = 1. It’s a categorical variable.
• Total units: Total units in landlord’s portfolio (self reported). It’s a numeric variable.
• Average portfolio rent: Average rent charged across portfolio. It’s a numeric variable.
• Offered free rent: Do you give tenants a free rent or a partial reduction in rent? (yes = 1). It’s a categorical variable.
• Offered payment plan: Have you worked out a payment plan to bring tenants current? (yes = 1). It’s a categorical variable.
• Started eviction: Since the start of the pandemic, have you started the eviction process for any of your tenants? (yes = 1). It’s a categorical variable.
• Allowed broken leases: Have you allowed tenants to break leases without penalty? (yes = 1). It’s a categorical variable.

We summarize each variable and list the brief information of them : The summaries of each variable.

#### Correlation process

##### Pearson Correlation coefficient calculation

We used the Pearson correlation coefficient to calculate the correlation between two continuous variables. This is the covariance of the two variables divided by the product of their standard deviations. Given a pair of random variables (X, Y), the formula for the Pearson correlation coefficient \rho_{X,Y} is:

\rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X \sigma_Y} = \frac{E[XY]-E[X]E[Y]}{\sigma_X \sigma_Y}, \rho_{X,Y} in [-1,1]

where cov(.) is the corvariance, E(.) is the expectation, \sigma_X and sigma_Y are the standard deviations of X and Y. If \rho_{X,Y} is close to 0, it means there is no relationship between two variables. If \rho_{X,Y} is positive, it means that as one variable gets larger the other gets larger. If \rho_{X,Y} is negative it means that as one gets larger, the other gets smaller (often called an “inverse” correlation).

##### Point-biserial Correlation coefficient calculation

The method we used to calculate the correlation between a dichotomous variable and a continuous variable is a point-biserial correlation coefficient, which is a special case of the Pearson correlation. To calculate \rho_{X,Y}, assume that the dichotomous variable Y has the two values 0 and 1 and the continuous variable X. If we divide the data set into two groups, group 1 which received the value “1” on Y and group 2 which received the value “0” on Y , then the point-biserial correlation coefficient is calculated as follows:

\rho_{X,Y} = \frac{M_1-M_2}{s_n} \sqrt{\frac{n_1n_0}{n^2}},

where s_n is the standard deviation:

s_n = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}(X_i-\overline{X})^2},

M_1 is the mean value on the continuous variable X for all data points in group 1, and M_0 is the mean value on the continuous variable X for all data points in group 2. Further, n_1 is the number of data points in group 1, n_0 is the number of data points in group 2 and n is the total sample size.

##### Phi coefficient calculation

A Phi Coefficient (sometimes called a mean square contingency coefficient) is a measure of the association between two binary variables. If we have a 2×2 table for two random variables X and Y :

 y = 1 y = 0 Total x = 1 n_{11} n_{10} n_{1.} x = 0 n_{01} n_{00} n_{0.} Total n_{.1} n_{.0} n

where n_{11}, n_{10}, n_{01},n_{00} are the counts of numbers of observations that sum to n, the total number of observations. The phi coefficient that describes the association of X and Y  is

\phi_{X,Y} = \frac{n_{11}n_{00}-n_{10}n_{01}}{\sqrt{n_{1.}n_{0.}n_{.1}n_{.0}}},

The function we used to calculate the correlation between two variables (one is the continuous variable) is cor in R package stats (version 3.6.2). The correlation between two binary variables can be used by phi in R package psych (version 2.1.3).