What is x n in Python

Logistic regression with Python and exploratory data analysis

A similar concept was presented in the 2nd blog post as “Linear Model for Classification” and is expanded in this blog post.

Logistic regression is a model for regression analysis where the dependent variable is categorical. That is, we can use this model for classification. Another advantage of logistic regression is that it also provides the probability that a sample belongs to the selected class.

Useful examples

A typical application of logistic regression can be found in the financial world, where age, salary, occupation and many other parameters are used to calculate the probability of whether a customer is creditworthy or not.

Formal definition

As with linear regression, one looks for suitable coefficients (w), so that the model can be described with the following formula:

This provides a number that is arbitrarily large or small that has nothing to do with probabilities. To get a probability we need a number between 0 and 1, which we calculate with the logistic function.

Logistic function

import numpy as np import matplotlib.pyplot as plt def logistic (x): return 1 / (1 + np.exp (-x)) # plotting logistic function t = np.arange (-6,6,0.05) plt.plot (t, logistic (t)) plt.yticks ([0, 0.2, 0.5, 0.8, 1]) plt.grid (linewidth = 1) plt.title ("Logistic function") plt.xlabel ("Variable 'z' ") plt.ylabel (" Variable 'y' ") plt.show ()

The logistic function is defined as, where Euler's number is (~ 2.71828 ...). This number is the base of the natural logarithm.

It can be seen that the output of the logistic function is constrained between 0 and 1. It is also interesting that with, the output value is exact.

Now you can z Replace with in the formula and you get the logistic regression model.

Adjusting the weights w is mathematically complex and not the scope of this blog post. Instead, the implementation and application of logistic regression is shown here.

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split data = load_iris () # generate training and test data train_x, test_x, train_y, test_y = train_test_split (data.data, data.target) import matplotlib.pyplot as plt # plot data set fig, (ax1, ax2) = plt.subplots (ncols = 2, sharex = True, sharey = True) ax1.scatter (data.data [:, 0], data.data [:, 1], c = data. target) ax1.set (title = "Iris", ylabel = 'Petalum width', xlabel = 'Petalum length') ax2.scatter (data.data [:, 2], data.data [:, 3], c = data.target) ax2.set (title = "Iris", ylabel = 'Sepalum width', xlabel = 'Sepalum length') plt.setp (ax2.get_yticklabels (), visible = False) plt.show ()

implementation

You can create a very simple model with. After it is adjusted, one can classify the test data and determine the accuracy.

from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score clf = LogisticRegression () clf.fit (train_x, train_y) # Calculate classification quality print ("Accuracy:% f"% accuracy_score (test_y, clf.predict (test_x)))

Accuracy: 0.947368

However, this method gives us little insight into how logistic regression works and how it is used.

The simplest form of this model is binary logistic regression. That means you can only distinguish between two classes. If there are several classes, a classifier must be created for each class, which only checks whether a sample belongs to this class or not. This will too one-vs.-all Called classification. does this implicitly if is entered as a constructor parameter.

Scikit-learn contains many optimizations that can provide better accuracy and performance in certain cases. But they are more suitable for advanced users. You should check out the documentation to learn more about it.

Logistic regression for data analysis

One of the uses of the regression model is to analyze data and examine the relationships between the independent and dependent variables. The library is required for the example shown here. This facilitates both table manipulation and their display.

The dataset comes from the General Assembly Datascience course in San Francisco. The independent variables are student performance from the United States.

Link to the dataset

import pandas as pd df = pd.read_csv ("https://github.com/ga-students/sf-dat-21/raw/master/unit-projects/dataset/admissions.csv") df.head ()
 admitgregpaprestigious
00380.03.613.0
11660.03.673.0
21800.04.001.0
31640.03.194.0
40520.02.934.0

Description of the data set

  • numerically - between 220 and 800 - result on the standardized test Graduate Record Examinations (GRE). 800 is the highest
  • numerically - between 0.0 and 4.0 - corresponds roughly to a grade point average. 4.0 is the highest
  • ordinal-categorical - 1,2,3 or 4 - The prestige of the applicant's school. 1 is the highest
  • the dependent variable - 0 means the student was not admitted, 1 - the student was admitted

Another useful function is that it provides information about each column in the table.

 admitgregpaprestigious
count400.000000398.000000398.00000399.000000
mean0.317500588.0402013.390932.486216
hours0.466087115.6285130.380630.945333
min0.000000220.0000002.260001.000000
25%0.000000520.0000003.130002.000000
50%0.000000580.0000003.395002.000000
75%1.000000660.0000003.670003.000000
Max1.000000800.0000004.000004.000000

Row names

  • - The number of samples with this characteristic.
  • - The mean of the elements of each characteristic.
  • - Standard deviation - it can be used to determine how strong the spread of the values ​​around a mean value is
  • - The lowest values ​​of each feature.
  • and - The mean of the lowest 25% and the highest 75% of the data, respectively.
  • - The same as the mean.
  • - The highest values ​​of each trait.

Another interesting feature is off. It offers the possibility to carry out a multidimensional frequency analysis. The following code shows the number of admitted and ineligible students for each prestige class:

pd.crosstab (df ['admit'], df ['prestige'], rownames = ['admit'])
prestigious1.02.03.04.0
admit    
028979355
133532812

The histogram has a similar function. Each bar in the chart shows the number of samples that have a particular value (e.g. how many students have one in 3.5).

df.hist () plt.show ()

The feature prestigious is ordinal-categorical. That means that you can compare the values. 1 is better than 2, 2 is better than 3, and so on. Even so, the difference between the values ​​cannot be quantified. In order to be able to better determine the relationships to the dependent variable, it would make more sense to use the characteristic as several substitute variables (Dummy variables) to encode. The function facilitates this process.

dummies = pd.get_dummies (df ['prestige'], prefix = 'prestige') dummies.head ()
 prestige_1.0prestige_2.0prestige_3.0prestige_4.0
00010
10010
21000
30001
40001
# Create a clean table only with the required characteristics data = df [['admit', 'gpa', 'gre']]. join (dummies.loc [:, 'prestige_2.0':]) # Show rows with missing data data [data.isnull (). any (axis = 1)]
 admitgpagreprestige_2.0prestige_3.0prestige_4.0
1870NaNNaN100
21202.87NaN100
2361NaN660.0000

Missing data is a common problem in Data science and there are different ways to deal with it. The easiest way is to simply delete the lines with missing features. And that's exactly what we're doing because there are only 3 problematic rows, less than 2% of the total data set.

Other ways of dealing with missing data depend on the type of data. Missing numeric characteristics can be replaced by the mean or median of the other samples, and missing categorical data can be replaced by the class most frequently present in the data set.

You may also be able to adapt a kNN or regression model to predict the missing values.

Overall, dealing with missing data is a complex topic, so the following article is recommended for more information. (Working with Missing Data - Towards Data Science)

# remove all lines where there is a NaN data = data.dropna (axis = 0, how = 'any') data.head ()
 admitgpagreprestige_2.0prestige_3.0prestige_4.0
003.61380.0010
113.67660.0010
214.00800.0000
313.19640.0001
402.93520.0001

It should be noted that prestige_1 is not in the table. We want to use it as a baseline for the analysis and not take it into account when fitting the model. The reason for this is the multicollinearity that can arise when using several variables that are strongly correlated to one another. The consequence of this is that the analysis of the regression coefficients is then imprecise.

clf = LogisticRegression () # generate training and test data train_x, test_x, train_y, test_y = train_test_split (data.loc [:, 'gpa':], np.ravel (data [['admit']])) # table to Convert numpy array and then convert from column vector # to 1d array clf.fit (train_x, train_y) # Analysis of coefficients and chances coef = pd.DataFrame ({'features': train_x.columns, 'coef': clf.coef_ [ 0], 'odds_ratio': np.exp (clf.coef_ [0])}) coef [['features', 'coef', 'odds_ratio']]
 featurescoefodds_ratio
0gpa-0.0505840.950674
1gre0.0019421.001944
2prestige_2.0-0.4821620.617447
3prestige_3.0-1.0842790.338145
4prestige_4.0-1.4632940.231472

We adjusted the model and analyzed the coefficients. In the column coef are the weightings of the model. It can be seen that there is a strong negative dependency if the student is in prestige group 4.

In the column odds_ratio you can see how the chances of admission change by one unit when the characteristic is changed. If the applicant's school has a Prestige of 2, the chances of admission are 50% lower.

Finally, we calculate the quality of classification:

# Calculate classification quality based on the test data from sklearn.metrics import accuracy_score accuracy_score (clf.predict (test_x), test_y)

conclusion

Logistic regression is an excellent classification algorithm that can also be used for detailed analysis. The algorithm is not as powerful as e.g. Support vector machines, but already provides important insights into relationships in the data set.

resources

  • Git repository with the complete code link