SUMMARY: The project aims to construct a data validation flow using TensorFlow Data Validation (TFDV) and document the end-to-end steps using a template. The Diabetes 130 US Hospitals dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.
INTRODUCTION: The data set is the Diabetes 130-US Hospitals for years 1999-2008 donated to the University of California, Irvine (UCI) Machine Learning Repository. The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes.
Additional Notes: I adapted this workflow from the TensorFlow Data Validation tutorial on TensorFlow.org (https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic). I also plan to build a TFDV script for validating future datasets and building machine learning models.
CONCLUSION: In this iteration, the data validation workflow helped to validate the features and structures of the training, validation, and test datasets. The workflow also generated statistics over different slices of data which can help track model and anomaly metrics.
Dataset Used: Diabetes 130-US Hospitals for years 1999-2008 Dataset
Dataset ML Model: Binary classification with numerical and categorical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
The HTML formatted report can be found here on GitHub.