Skip to content

AndreaHobby/black-in-data-2023-presentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 

Repository files navigation

Black in Data Week 2023: Data Science Project Template & Best Practices

Overview

Presentation and workshop materials from Black in Data Week 2023, where I presented a comprehensive framework for structuring data science projects with reproducibility, transparency, and best practices in mind.

This repository contains a data science project template designed to guide analysts and researchers through the complete analytics workflow—from problem formulation through model evaluation and deployment considerations.

Event

Black in Data Week 2023
November 2023
Virtual Conference

Presentation Abstract

Many data science projects fail not due to lack of technical skills, but from poor project structure, unclear documentation, and lack of reproducible workflows. This presentation introduced a comprehensive project template addressing common pitfalls in data science work.

The template emphasizes:

  • Clear problem formulation - Defining goals and success metrics upfront
  • Transparent data processes - Documenting data sources, cleaning decisions, and transformations
  • Reproducible analysis - Version control, dependency management, and code organization
  • Rigorous evaluation - Moving beyond accuracy to understand model limitations
  • Ethical considerations - Addressing bias, fairness, and interpretability

This framework is particularly valuable for health equity and social impact data science, where transparency and interpretability are critical.

Template Structure

This repository provides a structured approach to data science projects organized into seven key phases:

1. Background & Problem Formulation

  • Define the problem and its importance
  • Identify stakeholders and end users
  • Establish success criteria

2. Goals

  • Articulate 2-3 specific, measurable objectives
  • Distinguish between business goals and technical metrics
  • Consider equity and fairness objectives

3. Data Collection & Cleaning

  • Document data sources and provenance
  • Describe data shape, size, and structure
  • Detail cleaning processes and transformations
  • Address missing data transparently
  • Create reproducible data pipeline

4. Feature Engineering & Selection

  • Document feature creation logic
  • Explain feature selection rationale
  • Consider fairness implications of feature choices
  • Balance predictive power with interpretability

5. Modeling

  • Justify model selection
  • Document hyperparameter tuning process
  • Consider computational constraints
  • Address model assumptions

6. Results & Evaluation

Comprehensive evaluation beyond accuracy:

  • Multiple performance metrics (precision, recall, F1, AUC)
  • Confusion matrices and error analysis
  • Cross-validation for generalization
  • Bias-variance tradeoff analysis
  • Assessment of overfitting/underfitting
  • Feature importance and interpretability
  • Fairness metrics (if working with demographic data)
  • Uncertainty quantification

7. Reflection & Next Steps

  • Document obstacles and solutions
  • Analyze tradeoffs in methodological decisions
  • Propose improvements and extensions
  • Consider deployment and maintenance

Key Takeaways

  1. Structure drives success - A clear project template prevents common pitfalls and improves collaboration

  2. Documentation is not optional - Future you (and your collaborators) will thank you for clear README files and commented code

  3. Model evaluation is more than accuracy - Understanding when and why models fail is critical, especially in high-stakes domains like healthcare

  4. Reproducibility builds trust - Version control, dependency management, and transparent processes are essential for credible data science

  5. Equity considerations should be upstream - Addressing fairness starts with problem formulation and data collection, not just model evaluation

Skills Demonstrated

  • Data science workflow design
  • Best practices in reproducible research
  • Model evaluation and validation
  • Science communication and teaching
  • Community engagement in data science
  • Equity-centered data science practices

Recommended Repository Structure

When using this template for your own projects:

project-name/
├── README.md                 # Project overview and documentation
├── requirements.txt          # Python dependencies
├── environment.yml           # Conda environment (if using conda)
├── data/                     # Data files (add to .gitignore if sensitive)
│   ├── raw/                  # Original, immutable data
│   ├── processed/            # Cleaned, transformed data
│   └── README.md             # Data documentation and sources
├── notebooks/                # Jupyter notebooks for exploration
│   ├── 01_data_exploration.ipynb
│   ├── 02_data_cleaning.ipynb
│   ├── 03_modeling.ipynb
│   └── 04_evaluation.ipynb
├── src/                      # Source code for production
│   ├── __init__.py
│   ├── data_cleaning.py
│   ├── feature_engineering.py
│   ├── modeling.py
│   └── evaluation.py
├── outputs/                  # Model outputs, figures, reports
│   ├── figures/
│   ├── models/
│   └── reports/
├── tests/                    # Unit tests for code
└── .gitignore               # Files to exclude from version control

Application to Health Equity Research

This template is particularly valuable for health equity and disparities research, where:

  • Transparency is critical - Stakeholders need to understand how conclusions were reached
  • Bias detection is essential - Models must be evaluated for fairness across demographic groups
  • Reproducibility builds trust - Findings may inform policy, requiring rigorous documentation
  • Interpretability matters - Black-box models are often insufficient for healthcare applications

UPDATES - Related Work

GitHub Repositories:

Writing:

Citation

Hobby, A. (2023). Data Science Project Template & Best Practices. Presented at Black in Data Week 2023.

Additional Resources

For Reproducible Data Science:

For Health Equity Data Science:

Author

Andrea Hobby, DrPH Student
Johns Hopkins Bloomberg School of Public Health

Focus Areas: Algorithmic bias in healthcare AI, health equity analytics, patient safety, federal health equity reporting standards

Connect:


This presentation reflects my commitment to advancing diversity, equity, and excellence in data science and healthcare analytics. Special thanks to BlackTIDES for creating space for Black professionals to share knowledge, build connections, and elevate our collective impact.


Using This Template

Feel free to use this template for your own data science projects. When adapting it:

  1. Replace the template sections with your actual project content
  2. Document your decisions - Explain why you made specific choices
  3. Be transparent about limitations - Acknowledge what your analysis can and cannot tell you
  4. Make it reproducible - Include requirements.txt, clear instructions, and version information
  5. Consider equity - If working with demographic data, evaluate fairness metrics

Questions or suggestions for improving this template? Open an issue or reach out—I welcome feedback from the data science community.

About

Presentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors