This project was completed as part of a data storage course at Northeastern (DA5020).  My team built an ETL data pipeline to analyze hospital bed capacity across over 22,000 medical institutions in the United States, producing an analytical report for a simulated healthcare consulting client.
The pipeline was split into three stages. First, we designed and created a SQLite relational database schema in R, defining tables for facilities, bed facts, and bed categories with appropriate primary and foreign keys. Second, we parsed a large XML file from a hospital clinical management system, extracted and transformed the relevant data, and loaded it into the SQLite database using R, mapping nested XML structures to relational tables. Third, we wrote an R Notebook that queried the database exclusively via SQL to generate the full report, using dplyr for any additional calculations, ggplot2 for visualizations, and kableExtra for formatted tables.
The final report surfaces key insights about U.S. hospital capacity, including total licensed vs. staffed bed counts, a breakdown by bed category across 20+ types, the top 10 facilities by staffed beds, and the leading institution in each bed category. It is formatted as a professional document readable by both technical and non-technical audiences. The button below will show the code and the final pdf report.

You may also like

Back to Top