# Statistic 7

Open Posted By: ahmad8858 Date: 14/12/2020 High School Homework Writing

## W7: Regression and Correlation

Look back to the raw data you collected in week 1. There are 7 variables listed:

Vehicle type/class

Year

Make

Model

Price

MPG (city)

MPG (highway)

Choose TWO variables that you feel are correlated and explain why you feel that they are correlated. Do you suspect the relation is positive or negative? Why? Which would be considered the independent variable, which the dependent variable? Why?

Run a regression analysis in Excel and provide the results in your post along with your raw data. Looking at the R2 value, explain what this indicates about the strength of the relation.  Then write out your Regression Equation, state if your p-value and conclusion.

I encourage you to review the Week 7 Regression PDF  at the bottom of the discussions.  This will give you a step by step example on how to calculate a correlation and run a Regression using Excel. I DO NOT recommend doing this by hand.  Let Excel do the heavy lifting for you.  You can also use this PDF in Quizzes section.

There are additional PDFs that were created to help you with the Homework, Lessons and Tests in Quizzes section.  I encourage you to review these ASAP!  These PDFs are also located at the bottom of the discussion.

Once you have posted your initial discussion, you must reply to at least two other learner's post. Each post must be a different topic. So, you will have your initial post from one topic, your first follow-up post from a different topic, and your second follow-up post from one of the other topics. Of course, you are more than welcome to respond to more than two learners.”

Instructions: Make sure you include your data set in your initial post as well.  You must also respond to at least 2 other students. Responses may include direct questions.

Peer response #1 - Looking at your peer's Excel output, and the Regression Equation they wrote out, interpret the slope of their Regression Equation.  Use their Regression Equation to make a prediction and show the work for your predicted value based on your expression.   For Example, if your peer used Year to predict Price, plug in a Year value into the regression equation and use it to predict the Price of a vehicle.  Does this predicted Price value make sense with their data?

Peer response #2 - It is important to remember that typically a two-factor regression model cannot accurately describe the entire situation. Look at the dependent variable that your peer chose. Name at least 2 independent factors you would use to run a Multiple Linear Regression (MLR) and explain why you feel they are related.  Then use those factors to run a Multiple Linear Regression (MLR) on your peer's data and see if the variables you chose are related to the dependent variable they chose.  What is your MLR equation?  Is your MLR significant?  Are any of the Independent factors significant?  What is the R2 value?  Explain and interpret this value and how it relates to the MLR.  Make sure you include your MLR Excel output as an attachment in your response post.

Category: Mathematics & Physics Subjects: Algebra Deadline: 12 Hours Budget: \$100 - \$150 Pages: 2-3 Pages (Short Assignment)

## Attachment 1

The correlation is the direction and strength of association between 2 variables is

often expressed in a single number called the correlation coefficient. This is

denoted by the variable r.

• r can only be between -1 and 1, -1 ≤ r ≤ 1. • If r = 0, then there is no linear relationship at all. • If r = -1, then there is a perfect linear relationship that slopes down. • If r = 1, then there is a perfect linear relationship that slopes up.

The Coefficient of Determination refers to how much percent Variation is around the model. This is denoted by R². Note: If you have one you can find the other. Simple Linear Regression is a data analysis technique that tries to find a linear pattern in the data. In linear regression, we use all the data to calculate a straight line which may be used to predict the values. We will also discuss if the linear regression is significant and if the independent variable (x) is a significant predictor of the dependent variable (y). The equation of line for a Simple Linear Regression (SLR) is: �̂� = 𝛽1𝑥 + 𝛽0 Where 𝛽1 is the slope coefficient or the coefficient, 𝛽0 is the y-intercept and �̂� is the predicted y value. Let review our car price example. From the car price data, we also found out what year these cars where manufactured in.

Car Price: Year

Observation 1 \$ 20,000 2015 Observation 2 \$ 25,000 2016

Observation 3 \$ 30,000 2018 Observation 4 \$ 31,000 2018

Observation 5 \$ 22,500 2016 Observation 6 \$ 25,000 2016

Observation 7 \$ 29,500 2018

Observation 8 \$ 24,000 2015 Observation 9 \$ 24,500 2017

Observation 10 \$ 25,000 2017

Having this information, we first want to see if there is a correlation between Year and Car Price. Usually the older the car, cheaper the car is. As the age goes up, the price will go down. The Price of the car depends on what Year it was manufactured. This describes a negative correlation, but I want to see if my assumption is correct and what the actual correlation value is. Before we can do any calculations on the data, we will need to convert the Year to a numeric value. Keeping the physical Year is going to skew the data and it doesn’t make sense when we will get into analyzing and interpreting the data. If the car was made in 2018, then this means the car will be 1 year old. 2019 – 2018 = 1. I am rounding all these to full years for ease of the example. Converting all these Years will look like:

Car Price: Year Years Old

Observation 1 \$ 20,000 2015 4

Observation 2 \$ 25,000 2016 3 Observation 3 \$ 30,000 2018 1

Observation 4 \$ 31,000 2018 1 Observation 5 \$ 22,500 2016 3

Observation 6 \$ 25,000 2016 3

Observation 7 \$ 29,500 2018 1 Observation 8 \$ 24,000 2015 4

Observation 9 \$ 24,500 2017 2 Observation 10 \$ 25,000 2017 2

Now that we have our data we can start analyzing it. To find the correlation we will use the =CORREL( ) function in Excel. In Excel type in the “=” and the CORREL(;put a left parentheses, then highlight the first column; type in a comma, highlight the second column; close the parentheses ) and hit Enter. Note: it does not matter which column you highlight first.

Here we see that the Correlation = -.8846. This is in fact negative correlation and agrees with our assumption. As the Age of the Car goes up, the Price of the Car will go down. Now that we have the Correlation we can find R2. -.8846 * - .8846 = 78.25%

Next, we will run a Regression using Excel. We will use the Data Analysis ToolPak to run the Regression. Go to Data - > Data Analysis When the new window pops ups, scroll to where it says “Regression”, highlight it and Click “OK”

Once you Click “OK”, a new window pops up

Where it will say “Input” Input Y Range: Click in the box and highlight the y values. Input x Range: Click in the box and highlight the x values. Check the box that say “Labels” this will tell you that the first row has labels in it. Output Options Make sure the second bubble is highlighted. “New Worksheet Ply” Residuals Make sure you check the box for Residuals and Standardized Results Then Click “OK” (Remember, the x-value predicts the y-value. The Year of the Car will predict what the Price of the Car is. This tells us that Years Old is the x-value and Price is the y-value. This is very important to understand and remember) It should look like this:

Once you click OK, here is the Regression Output:

Looking at the output we see the Multiple R is the correlation. We know the

Correlation is negative, but the regression will give us the positive value. Make

sure you look at the coefficients for validation. We also see that the R-squared is

78.25%. Which is what we calculated before. R-squared tells us that:

78.25% of variation in the data between Age and Price, can we have accounted

for by this model. The best R-squared value is 100%, our value is less than that,

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.884606501

R Square 0.782528661

Standard Error 1725.490814

Observations 10

ANOVA

df SS MS F Significance F

Regression 1 85706451.61 85706451.61 28.78646 0.000673381

Residual 8 23818548.39 2977318.548

Total 9 109525000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 31959.67742 1296.435244 24.65196589 7.83E-09 28970.09239 34949.26245 28970.09239 34949.26245

Years Old -2629.032258 490.0064638 -5.365301179 0.000673 -3758.98919 -1499.075326 -3758.98919 -1499.075326

but it is still high enough to give us a good indication of what the data will look

like and it tells us that we want to interpret the model further.

Next, we want to see if this model is significant and if Years is a significant

predictor for Price. We will look at the Significance F value. Recall: if the p-value

is < alpha, then Yes this is significant.

The p-value associated with this model is .00067338.

.00067338 < .05. The p-value is in fact less than alpha. We can state that Yes,

Years is in fact a significant predictor for Price. This can be valuable information

to know if we are going out to buy a new car. Now that we know the model is

significant let’s write out the Regression Equation and interpret the values.

In the Regression Output if we look like the under Coefficients, this is where we

will find the values to write out the Regression Equation. I highlighted them in

Yellow below.

Next to those value we see the word “Intercept”, this corresponds to the y- intercept value. And we see the words “Years Old”, this corresponds to the slope coefficient value. Using this equation �̂� = 𝛽1𝑥 + 𝛽0, we will write out the regression equation and replace “x” and “y” with the actual variable names.

𝑃𝑟𝑖𝑐�̂� = −2,629.03 (𝑌𝑒𝑎𝑟𝑠 𝑂𝑙𝑑) + 31,959.68 We see that the y-intercept is \$31,959.68. This means when Years Old equals 0, the Price of a Car should be \$31,959.68.

𝑃𝑟𝑖𝑐�̂� = −2,629.03 (0) + 31,959.68 𝑃𝑟𝑖𝑐�̂� = 31,959.68 This makes sense because Year 0 is 2019. So, if you bought this type of car in the Year 2019, you will expect to pay \$31,959.68. Please note: The y-intercept while in this case does make sense does not always have a practical meaning. The y-intercept WILL NOT make sense in every scenario. It is OK for the y-intercept not make sense with certain problems. For example, if you wanted to use the Weight of a Car to predict the Price, the Weight

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 31959.67742 1296.435244 24.65196589 7.83E-09 28970.09239 34949.26245 28970.09239 34949.26245

Years Old -2629.032258 490.0064638 -5.365301179 0.000673 -3758.98919 -1499.075326 -3758.98919 -1499.075326

of a Car will NEVER be 0 pounds, so the y-intercept is not meaningful and would not have a practical meaning in the problem. Next, we want to interpret the Slope. As the Years Old increases by 1 year, then the Price of the Car will go down by \$2,629.03. Or as the car gets older and older the price will keep decreasing by \$2,629.03 every year. Lastly, I want to use my Regression Equation to predict prices. What would I expect to pay for a car that was manufactured in 2014? Remember 2019 – 2014 = 5. This means the car is 5 Years Old. This is the value you want to substitute into the Regression Equation. DO NOT put 2019 into the equation.

𝑃𝑟𝑖𝑐�̂� = −2,629.03 (𝑌𝑒𝑎𝑟𝑠 𝑂𝑙𝑑) + 31,959.68 𝑃𝑟𝑖𝑐�̂� = −2,629.03 (5) + 31,959.68 𝑃𝑟𝑖𝑐�̂� = −13,145.16 + 31,959.68 𝑃𝑟𝑖𝑐�̂� = \$18,814.52 In the Year 2014, when the car is 5 Years Old, we will expect to pay \$18,814.52 for a car. This is a good analysis for a SLR. But if we wanted to analysis the data further? We could run a Multiple Linear Regression (MLR). Multiple Linear Regression is just like it sounds. Instead of having only 1 x-variable, we have multiple x- variables. In our example the x-variable was Years Old, and it did a good job at predicting Price. But what other values could you use to predict the Price of a Car? One this that comes to mind is Total Miles. When you are looking to buy a car, you also want to look at Total Miles. Usually you want a car with fewer miles on it. The fewer miles on the car, the higher the price. Or the more miles you have a car, the lower the price. This appears to be another negative correlation, or relationship. Another variable that comes to mind is a 5-star safety rating. The safer the car the more people are willing to pay for safety. If a car has 5 stars it will be more expensive than if a car only had 2 or 3 stars. This appears to be a positive correlation or relationship.

You would want to run a MLR to justify and verify your claims, but these are just a few variables you could include to turn this SLR to a MLR.

## Sheet1

 Vehicle type/class Year Make Model Price MPG (City) MPG (hwy) Max Seating SUV 2020 BMW X3 \$42,945 25 29 5 Sedan 2019 Chevrolet Cruze \$18,870 28 38 5 SUV 2021 Chevrolet Tahoe \$55,095 16 20 9 Coupe 2020 Chevrolet Camaro \$26,495 20 30 4 SUV 2020 Ford Edge \$32,345 21 29 5 SUV 2020 Ford Explorer \$34,010 21 28 7 Luxury 2020 Porsche 911 \$111,550 18 24 4 Van/Minivan 2020 Toyota Sienna \$32,760 19 26 7 Sedan 2020 Dodge Charger \$31,390 19 30 5 Van/Minivan 2019 Dodge Grand Caravan \$27,595 17 25 7 Qualitative Qualitative Qualitative Qualitative Quantitative Quantitative Quantitative Quantitative