Tuesday 4 July 2023

 

What decides the profit? A decision tree in python

 

Steps we are going to do

The code starts by importing necessary libraries like pandas, which helps with data manipulation, and scikit-learn (sklearn), which provides machine learning algorithms.

·         The dataset is read from an Excel file using pandas and stored in the variable "data." It contains information related to turnover, return on capital employed (ROCE), liquidity ratio, number of employees, and profit after tax (pat).

·         Some data preprocessing is performed, such as transposing the dataset, setting column names, and converting the "pat" column to numeric values.

·         The dataset is divided into input features (X) and the target variable (y). X contains the columns "turnover," "roce," "liquidityratio," and "noemployees," while y contains the "pat" column.

·         The dataset is split into training and testing sets using the train_test_split function from sklearn. The test set size is set to 20% of the data, and a random seed (random_state) is used for reproducibility.

·         An instance of the DecisionTreeClassifier is created and assigned to the variable "classifier." This classifier is a decision tree model used to make predictions based on the input features.

·         The classifier is fitted to the training data, meaning it learns patterns and relationships between the input features (X_train) and the target variable (y_train).

·         The code then predicts the profit values for the test set (X_test) using the trained classifier and stores the predictions in the variable "y_pred."

·         The classification_report function from sklearn.metrics is used to generate a report evaluating the performance of the classifier. This report includes metrics such as precision, recall, and F1-score for each class (profit label) in the test set.

·         Finally, the report is printed to the console, showing precision, recall, F1-score, and support for each profit label, as well as overall accuracy.

·         In summary, the code uses a decision tree algorithm to predict profit based on given input features. It trains the model on a training dataset, makes predictions on a test dataset, and evaluates its performance using classification metrics.

 

Python Code for the Work

import pandas as pd

In [2]:

from sklearn.tree import DecisionTreeClassifier

In [3]:

from sklearn.model_selection import train_test_split

In [4]:

from sklearn.metrics import classification_report

In [5]:

data = pd.read_excel(r"C:\Users\Dell\Desktop\Book1.xlsx")

In [8]:

data = data.T

data.columns = data.iloc[0]

data = data[1:]

In [16]:

data['pat'] = pd.to_numeric(data['pat'], errors='coerce')

In [17]:

X = data[['turnover', 'roce', 'liqudityratio', 'noemployees']]

In [18]:

y = data['pat']

In [19]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:

classifier = DecisionTreeClassifier()

In [ ]:

print(data['pat'].unique())

print(data['pat'].dtype)

In [22]:

X = data[['turnover', 'roce', 'liqudityratio', 'noemployees']]

In [23]:

y = data['pat']

In [24]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [25]:

classifier = DecisionTreeClassifier()

In [26]:

classifier.fit(X_train, y_train)

Out[26]:

DecisionTreeClassifier

DecisionTreeClassifier()

In [35]:

from sklearn import tree

import graphviz

In [27]:

y_pred = classifier.predict(X_test)

In [29]:

report = classification_report(y_test, y_pred, zero_division=0)

In [30]:

print(report)

              precision    recall  f1-score   support

 

  -2490380.0       0.00      0.00      0.00       0.0

  -2448564.0       0.00      0.00      0.00       1.0

  -1906132.0       0.00      0.00      0.00       0.0

  -1571463.0       0.00      0.00      0.00       0.0

     60751.0       0.00      0.00      0.00       1.0

    373589.0       0.00      0.00      0.00       1.0

   3000785.0       0.00      0.00      0.00       1.0

 

    accuracy                           0.00       4.0

   macro avg       0.00      0.00      0.00       4.0

weighted avg       0.00      0.00      0.00       4.0

 

Interpretation

The classification report provides an assessment of the performance of the decision tree classifier. Let's interpret the different metrics:

Precision: Precision evaluates the accuracy of positive predictions. In this case, all the listed labels have a precision of 0.00, indicating that there were no correct positive predictions for these labels.

Recall: Recall measures the proportion of actual positives that were correctly identified. Similar to precision, the recall for all the listed labels is 0.00, indicating that there were no true positive predictions for these labels.

F1-score: The F1-score is a combined measure of precision and recall, taking into account both metrics. Since both precision and recall are 0.00, the F1-score for all the listed labels is also 0.00

Support: The support indicates the number of occurrences of each label in the test data. It reveals that labels such as -2490380.0, -1906132.0, and -1571463.0 did not appear in the test data (support of 0). On the other hand, labels like -2448564.0, 60751.0, 373589.0, and 3000785.0 had a single occurrence (support of 1).

Accuracy: The overall accuracy of the classifier is reported as 0.00, indicating that none of the predictions were correct.

Macro avg: This row presents the average precision, recall, and F1-score across all labels. Since all the individual metrics are 0.00, the macro average is also 0.00.

Weighted avg: Similar to the macro average, the weighted average calculates the metrics by considering the support of each label. As the support for each label is either 0 or 1, the weighted average metrics are also 0.00.

Overall, the classification report suggests that the decision tree classifier did not make any correct predictions for the given labels. It's important to note that the dataset might be imbalanced, or the classifier may require further tuning or more data to improve its performance.