In this project, I developed and evaluated predictive models to estimate the probability of loan default using borrower-level financial data. The analysis involved exploratory data visualization, correlation analysis, and model benchmarking using logistic regression and random forest classifiers. We carefully selected predictor variables based on statistical significance, interpretability, and multicollinearity, incorporating features such as FICO score, interest rate, credit policy adherence, and revolving utilization. ROC curves, AUC scores, and false negative rates were used as key performance indicators to assess model accuracy and predictive power, with an emphasis on minimizing missed defaults (false negatives) while maximizing profitability.

Ultimately, our refined logistic regression model outperformed the random forest in terms of both interpretability and performance, achieving a lower false negative rate (9.5%) and higher AUC (68.2%) under the selected threshold. We extended the analysis by building an optimization framework to identify the ideal cutoff probability (16%) that maximized expected return on investment while maintaining market realism. The final recommendation was to deploy the revised logistic model for decision-making in credit underwriting scenarios, allowing financial institutions to better balance risk and return across their lending portfolios.

Presentation:

Recording