This project was developed by Amanda Tartarotti Cardozo da Silva for the Advanced Machine Learning course in the 2025/2026 academic year of the Master in Computacional Engineering and Intelligent Systems at University of Basque Country (EHU/UPV).
This project is structured into 5 key topics:
This dataset is designed to explore and predict Myers-Briggs Type Indicator (MBTI) personality types based on a combination of demographic factors, interest areas, and personality scores. It includes 100K+ samples, each representing an individual with various features that contribute to determining their MBTI type. The dataset uses pre-processed numerical scores, making it a tabular supervised classification problem.
📌 Want to take this personality test yourself before diving in? https://www.16personalities.com/free-personality-test
Let’s go back to the dataset description:
Age: A continuous variable representing the age of the individual.
Gender: A categorical variable indicating the gender of the individual. Possible values are ‘Male’ and ‘Female’.
Education: A binary variable. A value of 1 indicates the individual has at least a graduate-level education (or higher), and 0 indicates an undergraduate, high school level or Uneducated.
Interest: A categorical variable representing the individual’s primary area of interest.
Introversion Score: A continuous variable ranging from 0 to 10, representing the individual’s tendency toward introversion versus extraversion. Higher scores indicate a greater tendency toward extraversion.
Sensing Score: A continuous variable ranging from 0 to 10, representing the individual’s preference for sensing versus intuition. Higher scores indicate a preference for sensing.
Thinking Score: A continuous variable ranging from 0 to 10, indicating the individual’s preference for thinking versus feeling. Higher scores indicate a preference for thinking.
Judging Score: A continuous variable ranging from 0 to 10, representing the individual’s preference for judging versus perceiving. Higher scores indicate a preference for judging.
Personality: Target variable that contains the People Personality Type.
df <- read.csv("data.csv", stringsAsFactors = TRUE)
val_personality <- unique(df$Personality)
total_personality <- length(val_personality)
cat("There are:", total_personality, " possible personalities:\n")
## There are: 16 possible personalities:
print(val_personality)
## [1] ENTP INTP ESFP ENFJ ISFP ISFJ ESTJ INFP ESTP ENFP INTJ ESFJ ISTJ INFJ ISTP
## [16] ENTJ
## 16 Levels: ENFJ ENFP ENTJ ENTP ESFJ ESFP ESTJ ESTP INFJ INFP INTJ INTP ... ISTP
Evaluation metrics that will used for this project:
Nature of the Problem: this is a Multiclass Supervised Classification scenario, since there are more than two possible categories and there is no intrinsic order among them, and the main challenge of this analysis lies in the potential overlap of traits between similar personalities.
# Gender Data
gender_data <- df %>%
count(Gender) %>%
mutate(perc = n / sum(n) * 100)
ggplot(gender_data, aes(x = Gender, y = n, fill = Gender)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(perc, 1), "%")), vjust = -0.5) +
theme_minimal() +
labs(title = "Distribution of Gender",
y = "Count", x = "Gender") +
scale_fill_brewer(palette = "Set2") +
theme(legend.position = "none")
# Education Distribution
edu_data <- df %>% count(Education) %>% mutate(perc = n / sum(n) * 100, Education = as.factor(Education))
ggplot(edu_data, aes(x = Education, y = n, fill = Education)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(perc, 1), "%")), vjust = -0.5) +
theme_minimal() +
labs(title = "Distribution of Education Level",
subtitle = "0: Undergraduate/HS | 1: Graduate+",
y = "Count", x = "Education Category") +
scale_fill_manual(values = c("0" = "#999999", "1" = "#E69F00")) +
theme(legend.position = "none")
# Distribution of Interests
interest_data <- df %>%
count(Interest) %>%
mutate(perc = n / sum(n) * 100)
ggplot(interest_data, aes(x = reorder(Interest, -n), y = n, fill = Interest)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(perc, 1), "%")), vjust = -0.5) +
theme_minimal() +
labs(title = "Primary Areas of Interest",
y = "Count", x = "Interest Category") +
theme(legend.position = "none")
Some highlights we can get from this visualization:
edu_gender_data <- df %>%
mutate(Education_Label = ifelse(Education == "1", "Graduate+", "Undergrad/HS")) %>%
group_by(Gender, Education_Label) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(Gender) %>%
mutate(percentage = count / sum(count) * 100)
ggplot(edu_gender_data, aes(x = Gender, y = percentage, fill = Education_Label)) +
geom_bar(stat = "identity", position = "stack") +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_stack(vjust = 0.5), color = "white", fontface = "bold") +
theme_minimal() +
labs(title = "Proportion of Education Level per Gender",
subtitle = "Comparing educational background across gender groups",
x = "Gender",
y = "Percentage (%)",
fill = "Education Level") +
scale_fill_manual(values = c("Graduate+" = "#2C3E50", "Undergrad/HS" = "#2980B9")) +
theme(legend.position = "bottom")
Key Findings:
Males and females with less than graduate-level education make up the largest proportions.
There is a slightly higher proportion of males with a graduate-level education compared to females, but the difference is not drastic.
The proportions of individuals with at least a graduate-level education (both males and females) are lower compared to those with less than graduate-level education.
interest_gender_data <- df %>%
group_by(Gender, Interest) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(Gender) %>%
mutate(percentage = count / sum(count) * 100)
ggplot(interest_gender_data, aes(x = reorder(Interest, -percentage), y = percentage, fill = Gender)) +
geom_bar(stat = "identity", position = "dodge", color = "white") +
# Add percentage labels on top of each bar
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
theme_minimal() +
labs(title = "Primary Interests by Gender",
subtitle = "Side-by-side comparison of interest profiles",
x = "Interest Category",
y = "Percentage within Gender (%)",
fill = "Gender") +
scale_fill_manual(values = c("Female" = "#D35400", "Male" = "#2980B9")) +
theme(legend.position = "top")
Key Findings:
p_density <- ggplot(df, aes(x = Introversion.Score, fill = Gender)) +
geom_density(alpha = 0.5) +
theme_minimal() +
labs(title = "Introversion Score Distribution by Gender",
subtitle = "Comparing the 'Introversion vs Extraversion' axis",
x = "Introversion Score (Higher = More Extraverted)",
y = "Density") +
scale_fill_manual(values = c("Female" = "#8E44AD", "Male" = "#2ECC71")) +
theme(legend.position = "top")
p_boxplot <- ggplot(df, aes(x = Gender, y = Introversion.Score, fill = Gender)) +
geom_boxplot(alpha = 0.7) +
theme_minimal() +
labs(title = "Introversion Score Summary by Gender",
x = "Gender",
y = "Introversion Score") +
scale_fill_manual(values = c("Female" = "#8E44AD", "Male" = "#2ECC71")) +
theme(legend.position = "none")
# Display the density plot
print(p_density)
Key Findings:
df_scores_long <- df %>%
select(Personality, Introversion.Score, Sensing.Score, Thinking.Score, Judging.Score) %>%
pivot_longer(cols = -Personality,
names_to = "Trait",
values_to = "Score")
ggplot(df_scores_long, aes(x = Personality, y = Score, fill = Trait)) +
geom_boxplot(outlier.size = 0.5, alpha = 0.8) +
facet_wrap(~Trait, scales = "free_x", ncol = 2) +
coord_flip() +
theme_minimal() +
labs(title = "Psychometric Scores by Personality Type",
subtitle = "Analysis of Introversion, Sensing, Thinking, and Judging dimensions",
x = "MBTI Personality Type",
y = "Score (0-10)") +
theme(legend.position = "none",
strip.text = element_text(face = "bold", size = 14),
axis.text.y = element_text(size = 6),
axis.text.x = element_text(size = 6),
panel.spacing = unit(1, "lines")) +
scale_fill_brewer(palette = "Set1")
Key Findings:
Introversion Score Analysis:
Sensing Score Analysis:
ES and IS types have the highest sensing scores, which suggests they are more likely to rely on concrete, sensory information.
Personality types with “N” (intuition) as their second letter, such as INTJ, ENTP, and INFJ, have lower sensing scores, indicating a preference for intuition over sensing.
Correlation Between Introversion and Sensing Scores:
Extroverted sensing types like ESTP and ESFP show high sensing scores but lower introversion, demonstrating the connection between extroversion and sensory processing.
Conversely, introverted intuitive types like INFJ and INTJ tend to have lower sensing scores but higher introversion scores.
Thinking Score:
INTP, ENTJ, and INTJ have the highest thinking scores, showing a strong preference for logical decision-making, while ESFJ, ISFJ, and INFJ have lower thinking scores, indicating a tendency towards emotional-based decision-making.
This highlight how the “T” as a third letter represents stronger traits of thinking score, while “F” goes for personalities with a feeling-based traits in decision-making.
Judging Score:
ESTJ and ISTJ seems to have the highest judging scores, highlighting their preference for structure and planning.
Perceiving types like ENFP and ESFP score lower, showing a preference for flexibility and spontaneity.
Patterns Across Personality Types:
Sensing types, such as ESTJ, ISTJ, ESFJ, generally display lower introversion scores and higher sensing scores, as sensing is typically associated with a more hands-on, practical approach, often aligned with extroversion or a balanced personality.
Intuitive types, particularly those who are introverted, such as INTJ, INFJ, seem to prefer internal thinking processes and abstract thinking over sensory data.
num_cols <- df[, sapply(df, is.numeric)]
cor_matrix <- cor(num_cols)
corrplot(cor_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
diag = FALSE)
Analysing the correlation between predictors, we can hightlight
psychometric scores - the personality traits - don’t strongly depend on
age or education. Almost all correlations are very close to 0, which
means there are weak or no linear relationships between most of these
variables.
Our dataset contains categorical variables like Gender, Education, and Interest. We must convert them into factors so that caret can internally handle them
df$Gender <- as.factor(df$Gender)
df$Education <- as.factor(df$Education)
df$Interest <- as.factor(df$Interest)
df$Personality <- as.factor(df$Personality)
# Verify the changes
str(df)
## 'data.frame': 43744 obs. of 9 variables:
## $ Age : num 21 24 26 30 31 33 32 27 30 26 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 2 1 1 1 2 2 1 ...
## $ Education : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 2 1 2 ...
## $ Introversion.Score: num 5.89 2.48 7.03 5.47 3.6 ...
## $ Sensing.Score : num 2.14 3.21 6.47 4.18 6.19 ...
## $ Thinking.Score : num 7.32 8.07 4.16 2.82 5.31 ...
## $ Judging.Score : num 5.46 3.77 5.45 5.08 3.68 ...
## $ Interest : Factor w/ 5 levels "Arts","Others",..: 1 5 2 3 2 3 1 4 5 1 ...
## $ Personality : Factor w/ 16 levels "ENFJ","ENFP",..: 4 12 6 1 14 13 7 10 13 8 ...
Before training our models, we need to ensure the data is in the correct format. First, we check if there are any missing values (NA) in our training set.
# Check for missing values
na_count <- sum(is.na(df))
cat("Total missing values found:", na_count)
## Total missing values found: 0
For this dataset, we can define an outlier as a person with extreme features. For this analysis, we will use the Mahalanobis Distance, a technique that considers the correlation between all the predictor variables (the 4 personality scores) to determine if an individual has an “anomalous” profile.
# Selecting the psychometric scores
predict_num <- df[, c("Introversion.Score", "Sensing.Score", "Thinking.Score", "Judging.Score")]
dist <- mahalanobis(predict_num, center = colMeans(predict_num), cov = cov(predict_num))
df$outlier_score <- dist
# Plotting outliers
ggplot(df, aes(x = outlier_score)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
geom_vline(xintercept = qchisq(0.99, df = 4), color = "red", linetype = "dashed") +
labs(title = "Outlierness Distribution",
subtitle = "The red line indicates the critical threshold (p < 0.01)",
x = "Mahalanobis Distance", y = "Frequency") + theme_minimal()
n_initial <- nrow(df)
# We identify the outliers (threshold based on the Chi-square distribution)
threshold <- qchisq(0.99, df = 4)
outliers_identified <- which(df$outlier_score > threshold)
# We remove the outliers to improve the stability of the LDA model
df_clean <- df[-outliers_identified, ]
n_final <- nrow(df_clean)
#Display Summary Summary
cat("--- Outlier Cleaning Summary ---\n")
## --- Outlier Cleaning Summary ---
cat("Initial observations: ", n_initial, "\n")
## Initial observations: 43744
cat("Outliers detected: ", length(outliers_identified), "\n")
## Outliers detected: 399
cat("Final observations: ", n_final, "\n")
## Final observations: 43345
cat("Percentage removed: ", round((length(outliers_identified) / n_initial) * 100, 2), "%\n")
## Percentage removed: 0.91 %
# Remove the auxiliary column to keep the dataset tidy for modeling
df_clean$outlier_score <- NULL
The primary algorithm that will be used in this pipeline, Linear Discriminant Analysis (LDA), is highly sensitive to outliers. LDA works by calculating the mean - centroid - for each of the 16 personality classes and estimating a shared covariance matrix. By removing these outliers, we ensure that the centroids represent the typical behavior of each personality type, leading to more stable and accurate decision boundaries. The percentage of removed data is low (0,91%), but can still contribute to better results.
Final results are a dataframe very well balanced:
kable(table(df_clean$Personality), col.names = c("Personality | ", "Frequency"))
| Personality | | Frequency |
|---|---|
| ENFJ | 2709 |
| ENFP | 2676 |
| ENTJ | 2719 |
| ENTP | 2679 |
| ESFJ | 2734 |
| ESFP | 2699 |
| ESTJ | 2734 |
| ESTP | 2714 |
| INFJ | 2716 |
| INFP | 2686 |
| INTJ | 2717 |
| INTP | 2684 |
| ISFJ | 2734 |
| ISFP | 2705 |
| ISTJ | 2734 |
| ISTP | 2705 |
Since Class Imbalance is not a problem for this dataset, resampling techniques, like SMOTE or Oversampling, will not be applied, as it would not provide any statistical benefit in this case.
We will divide the data into a training set (75%) and a test set (25%). We will use a fixed seed to ensure the experiment is reproducible.
set.seed(107)
inTrain <- createDataPartition(y = df_clean$Personality, p = 0.75, list = FALSE)
training <- df_clean[inTrain, ]
testing <- df_clean[-inTrain, ]
Principal Component Analysis (PCA) allows us to reduce the dimensions of our psychometric scores to visualize how personalities are grouped on a 2D plane.
# Perform PCA on numeric variables only
pca_data <- training %>% select(where(is.numeric))
pca_res <- prcomp(pca_data, scale. = TRUE)
#Calculate Cumulative Variance
prop_var <- pca_res$sdev^2 / sum(pca_res$sdev^2)
cum_var <- cumsum(prop_var)
pca_summary <- data.frame(
Component = paste0("PC", 1:length(cum_var)),
Variance_Explained = round(prop_var, 4),
Cumulative_Variance = round(cum_var, 4)
)
knitr::kable(pca_summary, caption = "PCA Explained Variance")
| Component | Variance_Explained | Cumulative_Variance |
|---|---|---|
| PC1 | 0.2343 | 0.2343 |
| PC2 | 0.2014 | 0.4357 |
| PC3 | 0.2007 | 0.6365 |
| PC4 | 0.1978 | 0.8343 |
| PC5 | 0.1657 | 1.0000 |
#Plotting
pca_plot_df <- as.data.frame(pca_res$x)
pca_plot_df$Personality <- training$Personality
ggplot(pca_plot_df, aes(x = PC1, y = PC2, color = Personality)) +
geom_point(alpha = 0.4, size = 1.5) +
theme_minimal() +
labs(title = "PCA: 2D Projection of Personality Types",
subtitle = paste0("Total Variance Explained (PC1+PC2): ",
round(cum_var[2]*100, 2), "%"),
x = paste0("PC1 (", round(prop_var[1]*100, 1), "%)"),
y = paste0("PC2 (", round(prop_var[2]*100, 1), "%)")) +
theme(legend.position = "right")
While the projection captures nearly 44% of the overall structure in the data, more than half of the variance remains in higher dimensions. The plot reveals substantial overlap among all 16 personality types, with no clearly separated clusters. This suggests that MBTI types are not linearly separable in a low-dimensional space. It is possible to suggest that PC1 may capure introversion–extraversion tendences, while PC2 may reflect differences related to decision-making preferences.
These will be applied inside the model training function using the preProc parameter to prevent data leakage. The data preprocessing strategy includes:
Center and Scale: The psychometric scores and Age have different ranges. Centering and Scaling - dividing by the standard deviation - will put all features on a similar scale.
Near Zero Variance: We will check for predictors that have a single unique value or very few unique values relative to the number of samples. These “zero-variance” predictors provide no information to the model and can cause numerical instability in algorithms like LDA.
# Cross-Validation Setting
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3)
LDA is a classic statistical model that seeks to find a linear combination of features that best separates the 16 personality classes. It works by modeling the distribution of the predictors, in this case the psychometric scores, for each class and then using Bayes’ theorem to estimate the probability of an individual belonging to a specific group. It assumes that the data follows a normal distribution and that all classes share the same covariance matrix.
# LDA model training
ldaModel <- train(Personality ~ .,
data = training,
method = "lda",
preProc = c("center", "scale"),
trControl = ctrl)
print(ldaModel)
## Linear Discriminant Analysis
##
## 32515 samples
## 8 predictor
## 16 classes: 'ENFJ', 'ENFP', 'ENTJ', 'ENTP', 'ESFJ', 'ESFP', 'ESTJ', 'ESTP', 'INFJ', 'INFP', 'INTJ', 'INTP', 'ISFJ', 'ISFP', 'ISTJ', 'ISTP'
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 29266, 29265, 29263, 29261, 29262, 29263, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7504221 0.7337538
k-Nearest Neighbors is a lazy learner and a non-parametric algorithm. Instead of building a mathematical formula, it classifies an individual by looking at the k most similar people in the training set - the so called neighbors. The individual is assigned to the personality type that is most common among those neighbors. It relies on distance metrics, which is why scaling our data is mandatory here.
# We use tuneLength = 10 to let caret try 10 different values of 'k'
knnModel <- train(Personality ~ .,
data = training,
method = "knn",
preProc = c("center", "scale", "nzv"),
trControl = ctrl,
tuneLength = 10)
print(knnModel)
## k-Nearest Neighbors
##
## 32515 samples
## 8 predictor
## 16 classes: 'ENFJ', 'ENFP', 'ENTJ', 'ENTP', 'ESFJ', 'ESFP', 'ESTJ', 'ESTP', 'INFJ', 'INFP', 'INTJ', 'INTP', 'ISFJ', 'ISFP', 'ISTJ', 'ISTP'
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 29262, 29267, 29263, 29265, 29264, 29261, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7473883 0.7305311
## 7 0.7507607 0.7341268
## 9 0.7538363 0.7374064
## 11 0.7565639 0.7403149
## 13 0.7573529 0.7411557
## 15 0.7557228 0.7394162
## 17 0.7561329 0.7398532
## 19 0.7549437 0.7385840
## 21 0.7536213 0.7371728
## 23 0.7521550 0.7356082
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 13.
A Random Forest is a ensemble learning method that builds a forest of multiple trees to achieve a more accurate and stable prediction. It is based on the principle of the “Wisdom of the Crowd”: the collective intelligence of many trees is almost always superior to the judgment of an individual one.
rfModel <- train(Personality ~ .,
data = training,
method = "rf",
trControl = ctrl,
preProc = c("center", "scale", "nzv"),
tuneLength = 3,
ntree = 200)
print(rfModel)
## Random Forest
##
## 32515 samples
## 8 predictor
## 16 classes: 'ENFJ', 'ENFP', 'ENTJ', 'ENTP', 'ESFJ', 'ESFP', 'ESTJ', 'ESTP', 'INFJ', 'INFP', 'INTJ', 'INTP', 'ISFJ', 'ISFP', 'ISTJ', 'ISTP'
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 29264, 29262, 29263, 29266, 29263, 29265, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8908605 0.8835827
## 6 0.8913221 0.8840754
## 11 0.8861753 0.8785855
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
# Collect all models into a list
model_results <- resamples(list(
LDA = ldaModel,
KNN = knnModel,
RF = rfModel
))
summary(model_results)
##
## Call:
## summary.resamples(object = model_results)
##
## Models: LDA, KNN, RF
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.7298462 0.7476930 0.7515375 0.7504221 0.7554001 0.7612170 0
## KNN 0.7431560 0.7534035 0.7578785 0.7573529 0.7607593 0.7669127 0
## RF 0.8804181 0.8894615 0.8914347 0.8913221 0.8934862 0.8993846 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.7117991 0.7308410 0.7349453 0.7337538 0.7390653 0.7452730 0
## KNN 0.7260117 0.7369429 0.7417168 0.7411557 0.7447896 0.7513493 0
## RF 0.8724445 0.8820907 0.8841955 0.8840754 0.8863837 0.8926759 0
bwplot(model_results, metric = "Accuracy",
main = "Model Comparison: Accuracy Distribution")
The Random Forest model clearly outperforms LDA and KNN in terms of both Accuracy and Cohen’s Kappa, while LDA and KNN achieve moderate and stable performance (mean accuracy around 0.75–0.76), Random Forest reaches a higher mean accuracy of approximately 0.89.
These results indicate that Random Forest is able to capture complex, non-linear relationships present in the data, making it the most suitable model for further analysis.
Global importance tells us which variables have the most “predictive power” across the entire dataset. We use the Permutation Importance method: we randomly shuffle a variable; if the model’s accuracy drops significantly, that variable is important.
predictor_rf <- Predictor$new(
model = rfModel,
data = training[, setdiff(names(training), "Personality")],
y = training$Personality,
type = "prob"
)
feat_imp <- FeatureImp$new(
predictor_rf,
loss = "ce" # cross-entropy for classification
)
plot(feat_imp)
Variables associated with personality dimensions (e.g., Introversion, Thinking, Sensing, Judging scores) exhibit the highest importance, indicating that shuffling these features leads to a substantial decrease in predictive performance. In contrast, demographic variables such as Age or Gender show comparatively lower importance, suggesting less influence on the model’s decisions.
While importance tells us which variable matters, Feature Effects tell us how it matters. A Partial Dependence Plot shows how the probability of a specific personality type (e.g., ‘INTJ’) changes as a score (e.g., Thinking Score) increases.
pdp_thinking <- FeatureEffect$new(
predictor_rf,
feature = "Thinking.Score",
method = "pdp"
)
pdp_thinking$results <- pdp_thinking$results[pdp_thinking$results$.class == "INTJ", ] #filtering for INTJ
plot(pdp_thinking) + ggtitle(paste("Feature Effect of Thinking Score for INTJ"))
The partial dependence plot for Thinking.Score shows how the predicted probability of the INTJ personality type evolves as this score increases, while averaging out the effects of other variables. The observed increase indicates that higher Thinking scores are strongly associated with a higher likelihood of being classified as INTJ, confirming the expected relationship between this psychological dimension and the model’s predictions.
SHAP (SHapley Additive exPlanations) values provide consistent, local explanations by quantifying the contribution of each variable to individual predictions. Unlike feature importance and partial dependence, SHAP values allow us to explain specific predictions at the observation level.
set.seed(123)
idx <- sample(1:nrow(training), 50)
shap_global <- Shapley$new(
predictor_rf,
x.interest = training[idx, setdiff(names(training), "Personality")]
)
plot(shap_global) +
theme(
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 7),
)
The aggregated SHAP values across multiple observations provide a global perspective on feature contributions. Positive SHAP values indicate features that increase the predicted probability of the assigned personality type, while negative values decrease it. This explanation allows us to understand why the model assigns a specific personality to a given individual.
A counterfactual explanation answers the question: * “What is the smallest change to this individual’s features that would change the model’s prediction?”
#Choose instance (ENTP)
x_orig <- training[1, ]
predict(rfModel, x_orig, type = "raw")
## [1] ENTP
## 16 Levels: ENFJ ENFP ENTJ ENTP ESFJ ESFP ESTJ ESTP INFJ INFP INTJ INTP ... ISTP
# Create a counterfactual by changing thinking and juding scores
x_cf <- x_orig
x_cf$Judging.Score <- x_cf$Judging.Score + 15
x_cf$Thinking.Score <- x_cf$Thinking.Score - 5
predict(rfModel, x_cf, type = "raw")
## [1] ENFP
## 16 Levels: ENFJ ENFP ENTJ ENTP ESFJ ESFP ESTJ ESTP INFJ INFP INTJ INTP ... ISTP
predict(rfModel, x_orig, type = "prob")
## ENFJ ENFP ENTJ ENTP ESFJ ESFP ESTJ ESTP INFJ INFP INTJ INTP ISFJ ISFP ISTJ
## 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## ISTP
## 1 0
predict(rfModel, x_cf, type = "prob")
## ENFJ ENFP ENTJ ENTP ESFJ ESFP ESTJ ESTP INFJ INFP INTJ INTP ISFJ ISFP ISTJ
## 1 0.035 0.87 0.01 0.01 0 0 0 0 0 0.075 0 0 0 0 0
## ISTP
## 1 0
The counterfactual simulation showed that moderate modifications to the Thinking and Judging variables led to a shift in the prediction from ENTP to ENFP. This behavior suggests that these dimensions have a greater influence on the model, highlighting how the classifier responds differently to changes in personality traits.
Let’s now visualize the data with a t-Distributed Stochastic Neighbor Embedding (T-SNE) Visualization:
set.seed(123)
sample_idx <- sample(1:nrow(training), 2000)
tsne_sample <- training[sample_idx, ]
numeric_cols <- tsne_sample[, sapply(tsne_sample, is.numeric)] #only numeric columns
tsne_res <- Rtsne(as.matrix(numeric_cols),
check_duplicates = FALSE,
perplexity = 30,
theta = 0.5,
dims = 2)
tsne_df <- as.data.frame(tsne_res$Y)
colnames(tsne_df) <- c("Dim1", "Dim2")
tsne_df$Personality <- tsne_sample$Personality
#Plotting
ggplot(tsne_df, aes(x = Dim1, y = Dim2, color = Personality)) +
geom_point(alpha = 0.7, size = 1.5) +
theme_minimal() +
labs(title = "t-SNE Visualization of Personality Types",
x = "t-SNE dimension 1", y = "t-SNE dimension 2")
Comparing this with our previous PCA analysis, the t-SNE provides a much
clearer separation, since the 2-component PCA fails to capture the
complex variance of this dataset, it is possible to conclude t-SNE is a
superior tool for visualizing high-dimensional clusters in a 2D
space.
In real-world scenarios, obtaining labeled data is expensive. We will simulate this by hiding the labels for 90% of our training set, leaving only 10% as labeled data. We will then compare:
set.seed(123)
# 1. Create a copy of the training data
df_semi <- training
# 2. Hide 90% of the labels
unlabeled_idx <- sample(1:nrow(df_semi), 0.9 * nrow(df_semi))
df_semi$Personality_Hidden <- df_semi$Personality
df_semi$Personality_Hidden[unlabeled_idx] <- NA
# Summary of the scenario
labeled_data <- df_semi[!is.na(df_semi$Personality_Hidden), ]
unlabeled_data <- df_semi[is.na(df_semi$Personality_Hidden), ]
cat("Labeled instances for training:", nrow(labeled_data), "\n")
## Labeled instances for training: 3252
cat("Unlabeled instances available:", nrow(unlabeled_data))
## Unlabeled instances available: 29263
We will implement a basic Self-Training loop using the logic:
#Baseline Model (10% Labeled Data)
set.seed(123)
ctrl_semi <- trainControl(method = "cv", number = 5)
rf_10_percent <- caret::train(Personality ~ .,
data = labeled_data[, setdiff(names(labeled_data), "Personality_Hidden")],
method = "rf",
trControl = ctrl_semi,
tuneGrid = data.frame(mtry = 6),
ntree = 100)
#Generate Pseudo-Labels
probs <- predict(rf_10_percent, newdata = unlabeled_data, type = "prob")
max_probs <- apply(probs, 1, max)
predicted_classes <- predict(rf_10_percent, newdata = unlabeled_data)
threshold <- 0.90
confident_idx <- which(max_probs > threshold)
pseudo_labeled_data <- unlabeled_data[confident_idx, ]
pseudo_labeled_data$Personality <- predicted_classes[confident_idx]
# Final Semi-Supervised Model
# Combine original labels with confident pseudo-labels
augmented_training <- rbind(labeled_data[, setdiff(names(labeled_data), "Personality_Hidden")],
pseudo_labeled_data[, setdiff(names(pseudo_labeled_data), "Personality_Hidden")])
rf_semi <- caret::train(Personality ~ .,
data = augmented_training,
method = "rf",
trControl = ctrl_semi,
tuneGrid = data.frame(mtry = 6),
ntree = 100)
# Evaluate 10% Supervised
pred_10 <- predict(rf_10_percent, testing)
acc_10 <- postResample(pred_10, testing$Personality)[1]
# Evaluate Semi-Supervised
pred_semi <- predict(rf_semi, testing)
acc_semi <- postResample(pred_semi, testing$Personality)[1]
# Full Supervised Model (from Topic 4)
acc_full <- postResample(predict(rfModel, testing), testing$Personality)[1]
# Comparison table
semi_comparison <- data.frame(
Model = c("Random Forest (10% Labels)", "Random Forest (Semi-Supervised)", "Random Forest (Full 100% Labels)"),
Accuracy = c(acc_10, acc_semi, acc_full)
)
knitr::kable(semi_comparison, caption = "Semi-Supervised Experiment Results")
| Model | Accuracy |
|---|---|
| Random Forest (10% Labels) | 0.8508772 |
| Random Forest (Semi-Supervised) | 0.8487535 |
| Random Forest (Full 100% Labels) | 0.8908587 |
While Semi-Supervised Learning aims to leverage unlabeled data, the Pseudo-Labeling strategy introduced a slight decrease in performance in our case. This highlights the risk of Self-Confirmation Bias: the model became confident in some incorrect classifications and reinforced those errors during retraining. In scenarios where accuracy is critical, investing in more labeled data provides a significant gain in accuracy - 4% in this case.
The comparative analysis between models revealed a clear hierarchy in predictive power, the Random Forest model was the undisputed winner, achieving an Accuracy of 89.1% and a Kappa of 0.88. This significant lead over the LDA (75.0%) and k-NN (75.7%) suggests that the boundaries between personality types are complex and non-linear.
This tutorial has demonstrated that a successful data mining project is not just about the final algorithm, but also about the pipeline. Beginning with Exploratory Data Analysis to assess data distributions, the project examined the underlying structure of the data using PCA and t-SNE, implemented and tuned three supervised learning models, and applied interpretable machine learning techniques—such as SHAP values and counterfactual explanations—to uncover the rationale behind individual predictions. Finally, the model’s robustness was evaluated through a semi-supervised learning simulation, proving the value of expert labeling. The final model is not only accurate but also interpretable, making it a practical and reliable tool for supporting personality classification.