What decides the profit? A decision tree in python
Steps we are going to do
The code starts by
importing necessary libraries like pandas, which helps with data manipulation,
and scikit-learn (sklearn), which provides machine learning algorithms.
·
The dataset is read from an Excel file using pandas and stored in the
variable "data." It contains information related to turnover, return
on capital employed (ROCE), liquidity ratio, number of employees, and profit
after tax (pat).
·
Some data preprocessing is performed, such as transposing the dataset,
setting column names, and converting the "pat" column to numeric
values.
·
The dataset is divided into input features (X) and the target variable
(y). X contains the columns "turnover," "roce,"
"liquidityratio," and "noemployees," while y contains the
"pat" column.
·
The dataset is split into training and testing sets using the
train_test_split function from sklearn. The test set size is set to 20% of the
data, and a random seed (random_state) is used for reproducibility.
·
An instance of the DecisionTreeClassifier is created and assigned to the
variable "classifier." This classifier is a decision tree model used
to make predictions based on the input features.
·
The classifier is fitted to the training data, meaning it learns
patterns and relationships between the input features (X_train) and the target
variable (y_train).
·
The code then predicts the profit values for the test set (X_test) using
the trained classifier and stores the predictions in the variable
"y_pred."
·
The classification_report function from sklearn.metrics is used to
generate a report evaluating the performance of the classifier. This report
includes metrics such as precision, recall, and F1-score for each class (profit
label) in the test set.
·
Finally, the report is printed to the console, showing precision,
recall, F1-score, and support for each profit label, as well as overall
accuracy.
·
In summary, the code uses a decision tree algorithm to predict profit
based on given input features. It trains the model on a training dataset, makes
predictions on a test dataset, and evaluates its performance using
classification metrics.
Python Code for the Work
import pandas as pd
In [2]:
from sklearn.tree import DecisionTreeClassifier
In [3]:
from sklearn.model_selection import train_test_split
In [4]:
from sklearn.metrics import classification_report
In [5]:
data = pd.read_excel(r"C:\Users\Dell\Desktop\Book1.xlsx")
In [8]:
data = data.T
data.columns = data.iloc[0]
data = data[1:]
In [16]:
data['pat'] = pd.to_numeric(data['pat'], errors='coerce')
In [17]:
X = data[['turnover', 'roce', 'liqudityratio', 'noemployees']]
In [18]:
y = data['pat']
In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
In [20]:
classifier = DecisionTreeClassifier()
In [ ]:
print(data['pat'].unique())
print(data['pat'].dtype)
In [22]:
X = data[['turnover', 'roce', 'liqudityratio', 'noemployees']]
In [23]:
y = data['pat']
In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
In [25]:
classifier = DecisionTreeClassifier()
In [26]:
classifier.fit(X_train, y_train)
Out[26]:
DecisionTreeClassifier
DecisionTreeClassifier()
In [35]:
from sklearn import tree
import graphviz
In [27]:
y_pred = classifier.predict(X_test)
In [29]:
report = classification_report(y_test, y_pred, zero_division=0)
In [30]:
print(report)
precision recall
f1-score support
-2490380.0 0.00 0.00
0.00 0.0
-2448564.0 0.00 0.00
0.00 1.0
-1906132.0 0.00 0.00
0.00 0.0
-1571463.0 0.00 0.00
0.00 0.0
60751.0 0.00 0.00
0.00 1.0
373589.0 0.00 0.00
0.00 1.0
3000785.0 0.00 0.00
0.00 1.0
accuracy
0.00 4.0
macro avg 0.00 0.00
0.00 4.0
weighted avg 0.00
0.00 0.00 4.0
Interpretation
The classification report provides an assessment of the
performance of the decision tree classifier. Let's interpret the different
metrics:
Precision: Precision evaluates the accuracy of positive
predictions. In this case, all the listed labels have a precision of 0.00,
indicating that there were no correct positive predictions for these labels.
Recall: Recall measures the proportion of actual positives
that were correctly identified. Similar to precision, the recall for all the
listed labels is 0.00, indicating that there were no true positive predictions
for these labels.
F1-score: The F1-score is a combined measure of precision
and recall, taking into account both metrics. Since both precision and recall
are 0.00, the F1-score for all the listed labels is also 0.00
Support: The support indicates the number of occurrences of
each label in the test data. It reveals that labels such as -2490380.0,
-1906132.0, and -1571463.0 did not appear in the test data (support of 0). On
the other hand, labels like -2448564.0, 60751.0, 373589.0, and 3000785.0 had a
single occurrence (support of 1).
Accuracy: The overall accuracy of the classifier is reported
as 0.00, indicating that none of the predictions were correct.
Macro avg: This row presents the average precision, recall,
and F1-score across all labels. Since all the individual metrics are 0.00, the
macro average is also 0.00.
Weighted avg: Similar to the macro average, the weighted
average calculates the metrics by considering the support of each label. As the
support for each label is either 0 or 1, the weighted average metrics are also
0.00.
Overall, the classification report suggests that the
decision tree classifier did not make any correct predictions for the given
labels. It's important to note that the dataset might be imbalanced, or the
classifier may require further tuning or more data to improve its performance.