Look back to the raw data you collected in week 1. There are 7 variables listed:

Vehicle type/class

Year

Make

Model

Price

MPG (city)

MPG (highway)

Choose TWO variables that you feel are correlated and explain why you feel that they are correlated. Do you suspect the relation is positive or negative? Why? Which would be considered the independent variable, which the dependent variable? Why?

Run a regression analysis in Excel and provide the results in your post along with your raw data. Looking at the *R2* value, explain what this indicates about the strength of the relation. Then write out your Regression Equation, state if your p-value and conclusion.

I encourage you to review the * Week 7 Regression PDF *at the bottom of the discussions. This will give you a step by step example on how to calculate a correlation and run a Regression using Excel. I DO NOT recommend doing this by hand. Let Excel do the heavy lifting for you. You can also use this PDF in Quizzes section.

There are additional PDFs that were created to help you with the Homework, Lessons and Tests in Quizzes section. I encourage you to review these ASAP! These PDFs are also located at the bottom of the discussion.

Once you have posted your initial discussion, you must reply to at least two other learner's post. Each post must be a different topic. So, you will have your initial post from one topic, your first follow-up post from a different topic, and your second follow-up post from one of the other topics. Of course, you are more than welcome to respond to more than two learners.”

**Instructions:** Make sure you include your data set in your initial post as well.

Recall, our Car Price data Car Price: Year Years

Old

Observation 1 $ 20,000 2015 4

Observation 2 $ 25,000 2016 3

Observation 3 $ 30,000 2018 1

Observation 4 $ 31,000 2018 1

Observation 5 $ 22,500 2016 3

Observation 6 $ 25,000 2016 3

Observation 7 $ 29,500 2018 1

Observation 8 $ 24,000 2015 4

Observation 9 $ 24,500 2017 2

Observation 10 $ 25,000 2017 2

With the Regression output,

Next, we want to test the hypothesis and see if the results are significant. The hypothesis scenario looks like: Ho: 𝜌 = 0 Ha: 𝜌 ≠ 0 If we look at the p-value or the Significance F we see the p-value = .000673. .000673 < .05, Yes this is significant. This means Years Old is a significant predictor of the Price of a Car.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.884606501

R Square 0.782528661

Adjusted R Square 0.755344744

Standard Error 1725.490814

Observations 10

ANOVA

df SS MS F Significance F

Regression 1 85706451.61 85706451.61 28.78646 0.000673381

Residual 8 23818548.39 2977318.548

Total 9 109525000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 31959.67742 1296.435244 24.65196589 7.83E-09 28970.09239 34949.26245 28970.09239 34949.26245

Years Old -2629.032258 490.0064638 -5.365301179 0.000673 -3758.98919 -1499.075326 -3758.98919 -1499.075326

We can also compare r, the correlation to a critical value. If r < negative critical value or r > positive critical value, then r is significant. If r is significant and the line may be used for prediction. We know r = -.8846. There is a correlation critical value table in RealizeIT under Testing the Hypothesis in Week 7 but here is a link to a more detailed table. Correlation CV Table We know alpha = .05, this is two tailed test from the hypothesis scenario and n = 10. The critical value that corresponds to this in the table is r CV = 0.632. We know our correlation is negative, so we will use the negative value of this. -.8846 < -0.632, this tells us that r is significant and you can use the line for prediction. This is the same conclusion we got with the p-value from above. Lastly, we can run a t-test to see if the data is significant. From the regression output the t-Stat for the slope is -5.3653. But if we didn’t have the regression output we can calculate this value using this equation.

t = 𝑟√𝑛−2

√1−𝑟2

Plugging in our correlation and sample size we get:

t = −.8846√10−2

√1−(−.8846)2 =

− 2.50202

.4663505 = −5.3651

t – Test Stat we calculated by hand is very close to the t-stat in the output. It is a little off because I did round some of my values. Then we can use the =T.DIST.2T function to find the p-value. This Excel function should look familiar. =T.DIST.2T(ABS(-5.3651),8) Remember if you have a negative value you will need to use the ABS function to take the absolute value of it. p-value = 0.000673544 < .05, Yes, this is significant. This is the same conclusion as we got above, and this is the same p-value from the Regression Output. It does not matter what way you use to Test the Hypothesis of a Simple Linear Regression example, if done correctly you will get the same conclusion every time.

,

Recall, our Car Price data Car Price: Year Years

Old

Observation 1 $ 20,000 2015 4 Observation 2 $ 25,000 2016 3

Observation 3 $ 30,000 2018 1 Observation 4 $ 31,000 2018 1

Observation 5 $ 22,500 2016 3 Observation 6 $ 25,000 2016 3

Observation 7 $ 29,500 2018 1

Observation 8 $ 24,000 2015 4 Observation 9 $ 24,500 2017 2

Observation 10 $ 25,000 2017 2 With the Regression output,

Lastly, I want to use my Regression Equation to predict prices. And then we want to find a 95% prediction interval for that predicted price. What would I expect to pay for a car that was manufactured in 2014? Remember 2019 – 2014 = 5. This means the car is 5 Years Old. This is the value you want to substitute into the Regression Equation. DO NOT put 2019 into the equation.

𝑃𝑟𝑖𝑐�̂� = −2,629.03 (𝑌𝑒𝑎𝑟𝑠 𝑂𝑙𝑑) + 31,959.68 𝑃𝑟𝑖𝑐�̂� = −2,629.03 (5) + 31,959.68 𝑃𝑟𝑖𝑐�̂� = −13,145.16 + 31,959.68

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.884606501

R Square 0.782528661

Adjusted R Square 0.755344744

Standard Error 1725.490814

Observations 10

ANOVA

df SS MS F Significance F

Regression 1 85706451.61 85706451.61 28.78646 0.000673381

Residual 8 23818548.39 2977318.548

Total 9 109525000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 31959.67742 1296.435244 24.65196589 7.83E-09 28970.09239 34949.26245 28970.09239 34949.26245

Years Old -2629.032258 490.0064638 -5.365301179 0.000673 -3758.98919 -1499.075326 -3758.98919 -1499.075326

𝑃𝑟𝑖𝑐�̂� = $18,814.52 In the Year 2014, when the car is 5 Years Old, we will expect to pay $18,814.52 for a car. Now that we know what are expected to pay for a 5-Year-Old car, lets calculate a 95% prediction interval for 5 Years. We need to use this equation:

�̂� ± 𝑡∗(𝑆𝐸)√1 + 1

𝑛 +

(𝑥0 − �̅�)2

(𝑛 − 1)𝑆𝐷𝑥 2

We will use the =T.INV.2T function to find the T-Critical Value. This value should look familiar. DF = n – 2 = 10 – 2 = 8. Which is the same DF for the Residual in the Regression output. =T.INV.2T(0.05,8) 2.306004135 Next, we will need to calculate the mean and SD for the x-variable. You should recall how to calculate descriptive statistics from Week 2. Mean = 2.4 SD = 1.1737878 SE is the Standard Error from the Regression Output which is 1725.4908 Now we can plug in what we know

18814.52 ± 2.306(1725.4908)√1 + 1

10 +

(5 − 2.4)2

(10 − 1)1.17378782

18814.52 ± 3978.9817√1 + .1 + 6.76

12.4

18814.52 ± 3978.9817√1.64516129 18814.52 ± 3978.9817(1.28263841) 18814.52 ± 5103.59476 ($13,710.93, $23,918.11)

The 95% prediction interval for a 5-Year-Old car will go from $13,710.93 to $23,918.11.

,

MLR Recall: Linear Regression is a data analysis technique that tries to find a linear pattern in the data. We use all the data to calculate a straight line which may be used to predict the values. The equation of line for a Simple Linear Regression (SLR) is: �̂� = 𝛽1𝑥 + 𝛽0 Where 𝛽1 is the slope coefficient or the coefficient, 𝛽0 is the y-intercept and �̂� is the predicted y value. For Multiple Linear Regression (MLR) the equation of a line will look like: �̂� = 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥3 + 𝛽4𝑥4+ 𝛽5𝑥5 + … + 𝛽0 Where 𝛽1, 𝛽2, 𝛽3, 𝛽4, 𝛽5, … are the slopes coefficient or the coefficients, 𝛽0 is the y-intercept and �̂� is the predicted y value. For Multiple Linear Regression you will have more than 1 slope, but you will still only have 1 y-intercept. The number of slopes will depend on the number of x- variables you have. In the example equation above, I only wrote out 5 slopes but know you can have any amount as along as it is 2 and above. Because if there is only 1-slope then this is a SLR NOT a MLR. Example: Let’s move away some our car price data and look at home prices. Is the MLR model a good predictor of home prices? If so, what variables help predict the price of a home? Data: Price Area (Sq Ft) Floor Bedrooms Bathrooms

$ 69,000 600 1 2 1

$ 118,500 1000 2 2 2 $ 125,000 1100 1 3 2

$ 139,300 1300 2 3 2 $ 147,900 1700 2 3 2

$ 169,900 1800 1 3 2.5 $ 134,900 1300 1 4 2.5

$ 169,900 1700 2 4 3

$ 194,500 2000 2 5 3.5 $ 209,900 2100 3 5 4

Looking at our example, we want to use Area, Floors, Bedrooms and Bathrooms to try and predict home prices. This means the x-variables are Area, Floors, Bedrooms, and Bathrooms and the y-variable is Price. Because there are 4 x-variables, this means there are 4 slopes in the MLR model. Next, we will run a Regression using Excel. We will use the Data Analysis ToolPak to run the Regression. Go to Data – > Data Analysis When the new window pops ups, scroll to where it says “Regression”, highlight it and Click “OK”

Then it will say “Input” Input Y Range: Click in the box and highlight the y values or the price column. Input x Range: Click in the box and highlight the x values or the area, floors, bedrooms, and bathrooms columns. Check the box that say “Labels” this will tell you that the first row has labels in it. Output Options Make sure the second bubble is highlighted. “New Worksheet Ply” Make sure you check the box for Residuals and Standardized Results Then Click “OK”

(Remember, the x-values predicts the y-value. Area, Floors, Bedrooms and Bathrooms will predict what the Price of a Home. This is very important to understand and remember) It should look like this:

Once you click OK, here is the Regression Output:

Looking at the output we see the estimated regression line is: 𝑃𝑟𝑖𝑐�̂� = .053181(𝐴𝑟𝑒𝑎) − .111766(𝐹𝑙𝑜𝑜𝑟) − 5.382699(𝐵𝑒𝑑) + 24.79927(𝐵𝑎𝑡ℎ) + 27.9676 To interpret the Bathroom Slope we state: For holding all other slopes constant, if the number of Bathrooms in a house increases by 1, then the price of the home will increase by $24,799. This makes sense because bathrooms are something everyone uses, and a good selling point for the home that we will discuss below some more. The y-intercept is $27,968. This means if you have 0 Sq. Ft., 0 Floors, 0 Bedrooms and 0 Bathrooms then the price of a home will be $27,968. This doesn’t make sense in the context of our problem because if you don’t have anything in a home, then you will have a plot of land. And plots of land sell differently than homes that are already done, and you would use different x-variables as predictors for plots of land.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.994699234

R Square 0.989426567

Adjusted R Square 0.98096782

Standard Error 5.602053083

Observations 10

ANOVA

df SS MS F Significance F

Regression 4 14683.58101 3670.895252 116.9708249 3.99315E-05

Residual 5 156.9149937 31.38299874

Total 9 14840.496

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 27.96762856 7.053708269 3.964953964 0.010689849 9.835494214 46.09976291

Area (Sq Ft) 0.053181736 0.008231449 6.460798716 0.001322498 0.032022122 0.074341349

Floor -0.11176635 3.801943471 -0.029397162 0.977685138 -9.884973176 9.661440476

Bedrooms -5.382699832 4.706373686 -1.143704302 0.304530968 -17.48081854 6.715418878

Bathrooms 24.79927334 7.643770423 3.24437705 0.022837965 5.150335932 44.44821074

Is this model a good predictor for Price of a Home? Looking at the overall Significance F value, we get 3.99315E-05. The “E” means scientific notation in Excel. This can be rewritten as 3.99315 x 10-5, moving the decimal place 5 places to the LEFT the overall p-value is: .0000399315. .0000399215 < .05, Yes, this model is a good predictor for Price of a Home. Now that we know the overall model is significant, which variables, or slopes, are a good predictor for Home Price? We want to look at the individual p-values for each slope. – Area: .001322 < .05, Yes this is significant and a good predictor for Home Price.

– Floor: .97769 > .05, No, this is not significant and not a good predictor for Home Price.

– Beds: .30453 > .05, No, this is not significant and not a good predictor for Home Price.

– Baths: .02284 < .05, Yes this is significant and a good predictor for Home Price.

We see that Area and Bathrooms are significant predictors for Home Price because those p-values are less than .05.

From the Regression output we notice that the Standard Error is 5.602 and the Adjusted R-squared is 98.1%.

The last thing we want to look at are the standardize residuals to see if there are any outliers.

RESIDUAL OUTPUT

Observation Predicted Price Residuals Standard Residuals Price

1 73.79877725 -4.79877725 -1.149263527 69,000$

2 119.7589785 -1.25897848 -0.301513901 118,500$

3 119.8062186 5.193781442 1.24386344 125,000$

4 130.3307993 8.969200671 2.148042024 139,300$

5 151.6034936 -3.703493572 -0.886953043 147,900$

6 169.4330702 0.466929849 0.111825454 169,900$

7 137.4595025 -2.559502515 -0.612977585 134,900$

8 171.0200671 -1.120067077 -0.268245883 169,900$

9 193.9915246 0.508475405 0.121775237 194,500$

10 211.5975685 -1.697568474 -0.406552217 209,900$

Any Standardize Residuals outside of the range -2 to 2 can be considered an outlier. Anything outside of -3 to 3, is also an outlier and you might want to do additional research to investigate this data point. When the Home Price is $139,300 the Predicted Price was $130,331. The Standardize Residual is 2.14. This data point can be considered an outlier. This means that the model underpredicted this data point for the house. If you are buying this home, then this is good news for you, because you can make a lower offer. If you are selling this home, that might not be such good news for you because you want to get as much money for your home as possible. When the Home Price is $209,900 the Predicted Price was $211,596. The Standardize Residual is -.406. This data point is not an outlier. But this means that the model overpredicted this data point for the house. If you are selling this home, this might be good news for you because then you can list the home for higher than you originally thought. But you don’t want to list it too high because you do want offers. But it is something to consider. Make sure you look at the context of the problem to see if the outliers play into your favor or not. But regardless, you will want to look at a range from -2 to 2.

,

The correlation is the direction and strength of association between 2 variables is

often expressed in a single number called the correlation coefficient. This is

denoted by the variable r.

• r can only be between -1 and 1, -1 ≤ r ≤ 1. • If r = 0, then there is no linear relationship at all. • If r = -1, then there is a perfect linear relationship that slopes down. • If r = 1, then there is a perfect linear relationship that slopes up.

The Coefficient of Determination refers to how much percent Variation is around the model. This is denoted by R². Note: If you have one you can find the other. Simple Linear Regression is a data analysis technique that tries to find a linear pattern in the data. In linear regression, we use all the data to calculate a straight line which may be used to predict the values. We will also discuss if the linear regression is significant and if the independent variable (x) is a significant predictor of the dependent variable (y). The equation of line for a Simple Linear Regression (SLR) is: �̂� = 𝛽1𝑥 + 𝛽0 Where 𝛽1 is the slope coefficient or the coefficient, 𝛽0 is the y-intercept and �̂� is the predicted y value. Let review our car price example. From the car price data, we also found out what year these cars where manufactured in.

Car Price: Year

Observation 1 $ 20,000 2015 Observation 2 $ 25,000 2016

Observation 3 $ 30,000 2018 Observation 4 $ 31,000 2018

Observation 5 $ 22,500 2016 Observation 6 $ 25,000 2016

Observation 7 $ 29,500 2018

Observation 8 $ 24,000 2015 Observation 9 $ 24,500 2017

Observation 10 $ 25,000 2017

Having this information, we first want to see if there is a correlation between Year and Car Price. Usually the older the car, cheaper the car is. As the age goes up, the price will go down. The Price of the car depends on what Year it was manufactured. This describes a negative correlation, but I want to see if my assumption is correct and what the actual correlation value is. Before we can do any calculations on the data, we will need to convert the Year to a numeric value. Keeping the physical Year is going to skew the data and it doesn’t make sense when we will get into analyzing and interpreting the data. If the car was made in 2018, then this means the car will be 1 year old. 2019 – 2018 = 1. I am rounding all these to full years for ease of the example. Converting all these Years will look like:

Car Price: Year Years Old

Observation 1 $ 20,000 2015 4

Observation 2 $ 25,000 2016 3 Observation 3 $ 30,000 2018 1

Observation 4 $ 31,000 2018 1 Observation 5 $ 22,500 2016 3

Observation 6 $ 25,000 2016 3

Observation 7 $ 29,500 2018 1 Observation 8 $ 24,000 2015 4

Observation 9 $ 24,500 2017 2 Observation 10 $ 25,000 2017 2

Now that we have our data we can start analyzing it. To find the correlation we will use the =CORREL( ) function in Excel. In Excel type in the “=” and the CORREL(;put a left parentheses, then highlight the first column; type in a comma, highlight the second column; close the parentheses ) and hit Enter. Note: it does not matter which column you highlight first.

Here we see that the Correlation = -.8846. This is in fact negative correlation and agrees with our assumption. As the Age of the Car goes up, the Price of the Car will go down. Now that we have the Correlation we can find R2. -.8846 * – .8846 = 78.25%

Next, we will run a Regression using Excel. We will use the Data Analysis ToolPak to run the Regression. Go to Data – > Data Analysis When the new window pops ups, scroll to where it says “Regression”, highlight it and Click “OK”

Once you Click “OK”, a new window pops up

Where it will say “Input” Input Y Range: Click in the box and highlight the y values. Input x Range: Click in the box and highlight the x values. Check the box that say “Labels” this will tell you that the first row has labels in it. Output Options Make sure the second bubble is highlighted. “New Worksheet Ply” Residuals Make sure you check the box for Residuals and Standardized Results Then Click “OK” (Remember, the x-value predicts the y-value. The Year of the Car will predict what the Price of the Car is. This tells us that Years Old is the x-value and Price is the y-value. This is very important to understand and remember) It should look like this:

Once you click OK, here is the Regression Output:

Looking at the output we see the Multiple R is the correlation. We know the

Correlation is negative, but the regression will give us the positive value. Make

sure you look at the coefficients for validation. We also see that the R-squared is

78.25%. Which is what we calculated before. R-squared tells us that:

78.25% of variation in the data between Age and Price, can we have accounted

for by this model. The best R-squared value is 100%, our value is less than that,

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.884606501

R Square 0.782528661

Adjusted R Square 0.755344744

Standard Error 1725.490814

Observations 10

ANOVA

df SS MS F Significance F

Regression 1 85706451.61 85706451.61 28.78646 0.000673381

Residual 8 23818548.39 2977318.548

Total 9 109525000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 31959.67742 1296.435244 24.65196589 7.83E-09 28970.09239 34949.26245 28970.09239 34949.26245

Years Old -2629.032258 490.0064638 -5.365301179 0.000673 -3758.98919 -1499.075326 -3758.98919 -1499.075326

but it is still high enough to give us a good indication of what the data will look

like and it tells us that we want to interpret the model further.

Next, we want to see if this model is significant and if Years is a significant

predictor for Price. We will look at the Significance F value. Recall: if the p-value

is < alpha, then Yes this is significant.

The p-value associated with this model is .00067338.

.00067338 < .05. The p-value is in fact less than alpha. We can state that Yes,

Years is in fact a significant predictor for Price. This can be valuable information

to know if we are going out to buy a new car. Now that we know the model is

significant let’s write out the Regression Equation and interpret the values.

In the Regression Output if we look like the under Coefficients, this is where we

will find the values to write out the Regression Equation. I highlighted them in

Yellow below.

Next to those value we see the word “Intercept”, this corresponds to the y- intercept value. And we see the words “Years Old”, this corresponds to the slope coefficient value. Using this equation �̂� = 𝛽1𝑥 + 𝛽0, we will write out the regression equation and replace “x” and “y” with the actual variable names.

𝑃𝑟𝑖𝑐�̂� = −2,629.03 (𝑌𝑒𝑎𝑟𝑠 𝑂𝑙𝑑) + 31,959.68 We see that the y-intercept is $31,959.68. This means when Years Old equals 0, the Price of a Car should be $31,959.68.

𝑃𝑟𝑖𝑐�̂� = −2,629.03 (0) + 31,959.68 𝑃𝑟𝑖𝑐�̂� = 31,959.68 This makes sense because Year 0 is 2019. So, if you bought this type of car in the Year 2019, you will expect to pay $31,959.68. Please note: The y-intercept while in this case does make sense does not always have a practical meaning. The y-intercept WILL NOT make sense in every scenario. It is OK for the y-intercept not make sense with certain problems. For example, if you wanted to use the Weight of a Car to predict the Price, the Weight

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

of a Car will NEVER be 0 pounds, so the y-intercept is not meaningful and would not have a practical meaning in the problem. Next, we want to interpret the Slope. As the Years Old increases by 1 year, then the Price of the Car will go down by $2,629.03. Or as the car gets older and older the price will keep decreasing by $2,629.03 every year. Lastly, I want to use my Regression Equation to predict prices. What would I expect to pay for a car that was manufactured in 2014? Remember 2019 – 2014 = 5. This means the car is 5 Years Old. This is the value you want to substitute into the Regression Equation. DO NOT put 2019 into the equation.

𝑃𝑟𝑖𝑐�̂� = −2,629.03 (𝑌𝑒𝑎𝑟𝑠 𝑂𝑙𝑑) + 31,959.68 𝑃𝑟𝑖𝑐�̂� = −2,629.03 (5) + 31,959.68 𝑃𝑟𝑖𝑐�̂� = −13,145.16 + 31,959.68 𝑃𝑟𝑖𝑐�̂� = $18,814.52 In the Year 2014, when the car is 5 Years Old, we will expect to pay $18,814.52 for a car. This is a good analysis for a SLR. But if we wanted to analysis the data further? We could run a Multiple Linear Regression (MLR). Multiple Linear Regression is just like it sounds. Instead of having only 1 x-variable, we have multiple x- variables. In our example the x-variable was Years Old, and it did a good job at predicting Price. But what other values could you use to predict the Price of a Car? One this that comes to mind is Total Miles. When you are looking to buy a car, you also want to look at Total Miles. Usually you want a car with fewer miles on it. The fewer miles on the car, the higher the price. Or the more miles you have a car, the lower the price. This appears to be another negative correlation, or relationship. Another variable that comes to mind is a 5-star safety rating. The safer the car the more people are willing to pay for safety. If a car has 5 stars it will be more expensive than if a car only had 2 or 3 stars. This appears to be a positive correlation or relationship.

You would want to run a MLR to justify and verify your claims, but these are just a few variables you could include to turn this SLR to a MLR.