Iris Classification using Logistic Regression

Logistic regression is the best regression approach to utilize when the dependent variable is dichotomous (binary). Like other regression studies, logistic regression is a predictive analysis. Logistic regression is a statistical approach for defining and explaining the relationship between one dependent binary variable and one or more independent variables that are nominal, ordinal, interval, or ratio-level.

The Iris flower data set is a multivariate data set created in 1936 by British statistician and biologist Ronald Fisher in his paper The use of multiple measures in taxonomic concerns. Because Edgar Anderson gathered the information to quantify the morphologic variation of three related species of Iris blooms, it is commonly known to as Anderson’s Iris data set. Each of the three Iris species is represented by 50 samples in the data set (Iris Setosa, Iris virginica, and Iris versicolor). The length and width of the sepals and petals in centimeters were measured for each sample.

This dataset serves as a common test case for a variety of machine learning statistical classification techniques.

Download Dataset: https://www.kaggle.com/datasets/arshid/iris-flower-dataset?select=IRIS.csv

Program Used:

Google Colab (Python)

or you can use Jupyter Notebook

Procedures

1. Open Google Collaboratory

2. Click open “New notebook”

3. After making new notebook. The code is provided as reference below:

STEPS OF CODING:

a. Import the important packages

For this exercise it require the Pandas package for loading the data, the matplotlib package for plotting as well as scitkit-learn for creating the Logistic Regression model. Import all of the required packages and relevant modules for these tasks.

import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.linear_model import LogisticRegression

b. Load the data

Note: Upload your dataset first in the file section. Make sure the file name of your dataset are same in the code.

data = pd.read_csv('IRIS.csv')
data.head()

c. Feature Engineering

We must choose the characteristics that will produce the most powerful categorization model. Plot a variety of characteristics against the assigned species categories, for example. Sepal Length vs. Petal Length and Species. Examine the charts visually for any patterns that might suggest separation of the species.

markers = {
    'Iris-setosa' : {'marker' : 'x'}, 
    'Iris-versicolor' : {'marker' : '*'}, 
    'Iris-virginica' : {'marker' : 'o'},
}
plt.figure(figsize=(10, 7))  

for name, group in data.groupby('species'): 
    plt. scatter(group['sepal_width'], group['petal_length'],
                 label = name, marker = markers[name]['marker'],) 
    
    plt.title('Species Classification: Sepal Width vs. Petal Length')
    plt.xlabel('Sepal Width (mm)'); plt.ylabel('Petal Length (mm)') 
    plt.legend();

Output:

Select the features by writing the column names in the list below:

selected_features = ['sepal_width', 'petal_length'] 

d. Constructing Logistic Regression

Before we can construct the model we must first convert the species values into labels that can be used within the model. Replace:

  1. The species string Iris-setosa with the value 0
  2. The species string Iris-versicolor with the value 1
  3. The species string Iris-virginica with the value 2
species = [ 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] 
output = [species. index(spec) for spec in data.species]

Create the model using the selected_features and the assigned species labels

model=LogisticRegression(multi_class='auto', solver='lbfgs') 
model.fit(data [selected_features], output)

Compute the accuracy of the model against the training set:

model.score(data[selected_features], output) 

Output: 0.9533333333333334

Construct another model using your second choice selected_features and compare the performance:

selected_features = ['sepal_length', 'petal_width']
model.fit(data [selected_features], output) 
model.score(data [selected_features], output)

Output: 0.96

Construct another model using all available information and compare the performance:

selected_features = ['sepal_width', 'sepal_length', 'petal_width', 'petal_length']
model.fit(data [selected_features], output) 
model.score(data [selected_features], output)

Output: 0.9733333333333334

selected_features = ['petal_width', 'petal_length']
model.fit(data [selected_features], output) 
model.score(data [selected_features], output)

Output: 0.9666666666666667

References:

https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/what-is-logistic-regression/
https://www.kaggle.com/datasets/arshid/iris-flower-dataset

Leave a Comment

Your email address will not be published. Required fields are marked *