What is R in linear regression

How do I do a simple linear regression in R? Today's Post, the first in the “Statistics” category, is supposed to answer this question. Since the R programming language was originally created for statistical analysis, we're in luck: many of the statistical functions are very easy to use and easy to remember.

Linear regression in R

Both simple and multiple linear regressions can be easily performed in R with the lm-Calculate function. Then we have a statistical model and can look at all sorts of information about it, e.g. coefficients, residuals, predicted values, and more. Let's start again with the basics of linear regression and then look at how we calculate a regression in R.

Basics of linear regression

Linear regression is a statistical method in which a target variable (also: dependent variable, explained variable, regressand) is explained by one or more predictors (also: independent variables, explanatory variables, regressors). The variables are metric in the linear regression model; categorical variables can be made suitable using dummy coding. One speaks of one linear Regression, since the relationship between the dependent variable and predictors is mapped by a linear function (linear combination of the coefficients). All predictors are determined by the Regression coefficients weighted so that the error (difference between the predicted value and the real value of the dependent variable) is as small as possible; more precisely: the sum of the squared deviations is minimized. Fortunately, we don't have to worry about the details as the respective functions in R do all of this automatically. Let's turn to practice ...

Calculate regression in R.

Two variables are sufficient for a simple linear regression: our predictor and the dependent variable. Let's build a data frame with the appropriate vectors:

We have two numeric variables that we are using x the dependent y want to predict.

Create a linear model

In the first step we create a linear model:. The notation is the formula notation in R; in this case it says that y is dependent on x. Let's run the code is mdl now our regression object with all important information.

View model information

We can now display a summary of the regression:. Above we see the line with the given formula again. Next we get a short '5 number summary' of the residuals (min, max, median, first and third quartile). Below are the coefficients: our predictors including constants, Intercept, their Bs, standard errors, t-value, and p-value. If there are asterisks at the end of a line, the corresponding predictor is significant for a certain alpha level. At the bottom of the summary there are overall statistics for the model such as R², F and p-value.

Extract model information

Particularly when there is a need to automate a system, it can be helpful to extract model information from the lm object. Here I am just giving a small example: we want to get information about the coefficient of x - the B and the associated p-value. The coefficient matrix can be called up via. Then we can extract certain fields in this matrix:

So we extract B and p and use it to assemble a small statement that we reproduce with print.

Plot regression analysis in R.

We can illustrate the information obtained in R well. We have various plots in our repertoire, which allow us to check the model assumptions at the same time.

Scatter plot between two numeric variables

First we create a typical scatter plot to get a first impression of the data (ideally this step should be part of the exploratory analysis before performing the regression). . Obviously we plot x on the x-axis and y on the y-axis, label the axes, give a title, and use dark blue as the color for the points. With the parameter pch let's select the type of point: the standard setting is a circle, I have chosen a point here because it is easier to identify it. With another command we can then display our regression line:. R makes it particularly easy for us here: We only have to insert the lm object into the function abline and the corresponding regression line is automatically plotted.

An ornate regression plot

As a little extra, we can decorate our regression plot a little so that we are a little more aware of the difference between model prediction and real value. To do this, we write a loop that has one iteration per point and draws the corresponding line in the plot for each point:

The function lines draws a line and has a vector with the x-coordinates of the points as the first argument and one of the y-coordinates as the second argument. This creates a line for each point between the y-value and the 'fitted values'.

Plot of residuals against independent variables

Whenever we are doing a regression analysis, it is always a good idea to look at the residuals. In this way we can identify outliers and generally check whether the assumptions for a regression can be made. One possibility is to plot the residuals against the values ​​of x. We can easily get the residuals of a regression model. So the plot looks like this:

If the residuals are unsystematically distributed around 0 "like a cloud of equal width", then the assumptions of linearity and homoscedasticity are given (i.e. residuals are not spread more widely, e.g. for higher x values ​​compared to the residuals deeper on the x-axis) .

Distribution of the residuals

We now check whether the residuals are normally distributed. We can do this simply with a histogram of the residuals:. Confirming the assumption of a normal distribution is obviously difficult here - a sample size of N = 10 should be viewed with caution anyway. Nevertheless: A look at the distribution of the residuals with the help of a histogram is always useful.

QQ plot of the residuals

Alternatively (or in addition) you can also have a QQ plot displayed in R. A look at this shows whether the residuals are normally distributed (or approximately normally distributed):

If the points lie on the line, the assumption of a normal distribution of the residuals is given in any case.

Summary

Knowing this, you should be able to do a simple linear regression in R. This essentially includes the lm function,, the plot for the regression analysis and the analysis of the residuals. In a future post I will cover multiple regression and other statistics, such as those that identify points of influence. Until then, good luck!

So much for the basics of a regression in R. Do you have any further questions or do you already have questions about other regression types? Just write me an email: [email protected]
Also stay up to date with the r-coding newsletter. You will receive information on new blog entries, as well as little tips and tricks about R. Register now: http://r-coding.de/newsletter.