Data Validation for LendingClub Loan Data Using Python and TensorFlow Data Validation

SUMMARY: The project aims to construct a data validation flow using TensorFlow Data Validation (TFDV) and document the end-to-end steps using a template. The Kaggle LendingClub Loan Data dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: The Kaggle dataset owner derived this dataset from the publicly available data of Lending Club connects people who need money (borrowers) with people who have money (investors). An investor naturally would want to invest in people who showed a profile of having a high probability of paying back the loan. The dataset uses the lending data from 2007 to 2010, and we will try to predict whether the borrower paid back their loan in full.

Additional Notes: I adapted this workflow from the TensorFlow Data Validation tutorial on The plan is to build a robust TFDV script for validating datasets in building machine learning models.

CONCLUSION: In this iteration, the data validation workflow helped to validate the features and structures of the training, validation, and test datasets. The workflow also generated statistics over different slices of data which can help track model and anomaly metrics.

Dataset Used: Kaggle LendingClub Loan Data

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference:

The HTML formatted report can be found here on GitHub.