Installation Guide for pyflow-ChIPseq (Modernized 2025)

This guide will help you set up the modernized pyflow-ChIPseq pipeline on your system.

Prerequisites
Installation
Reference Genome Setup
Configuration
Testing the Installation
Troubleshooting

Prerequisites

Required Software

Conda or Mamba (for environment management)
- Miniconda: https://docs.conda.io/en/latest/miniconda.html
- Mamba (faster): conda install -c conda-forge mamba

Snakemake >= 7.0

conda create -n snakemake -c bioconda -c conda-forge snakemake>=7.0
conda activate snakemake

Git (for cloning the repository)

# Usually pre-installed on Linux/Mac
git --version

System Requirements

Memory: Minimum 32 GB RAM recommended (alignment and peak calling are memory-intensive)
Storage: Depends on dataset size
- Raw FASTQ: ~10-50 GB per sample
- BAM files: ~5-20 GB per sample
- Final outputs: ~2-10 GB per sample
CPU: Multi-core processor (8+ cores recommended for parallel processing)

Installation

Step 1: Clone the Repository

# Clone from GitHub
git clone https://github.com/crazyhottommy/pyflow-ChIPseq.git
cd pyflow-ChIPseq

# Checkout the modernized branch
git checkout modernize-2025

Step 2: Install Snakemake (if not already installed)

# Create a conda environment for Snakemake
conda create -n snakemake python=3.11 -y
conda activate snakemake

# Install Snakemake and mamba
conda install -c bioconda -c conda-forge snakemake mamba -y

Step 3: Verify Installation

# Check Snakemake version (should be >= 7.0)
snakemake --version

# Check conda/mamba
conda --version
mamba --version

Reference Genome Setup

The pipeline requires a reference genome in FASTA format and BWA indices.

Option 1: Download Pre-built Genomes

Mouse (mm10)

# Create genome directory
mkdir -p genomes/mm10
cd genomes/mm10

# Download from UCSC
wget http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz
# or curl -O http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz
gunzip mm10.fa.gz

# Build BWA index
conda activate snakemake
mamba install -c bioconda bwa
bwa index mm10.fa

cd ../..

Human (hg38)

# Create genome directory
mkdir -p genomes/hg38
cd genomes/hg38

# Download from UCSC
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz

# Build BWA index
mamba install bwa
bwa index hg38.fa

cd ../..

Option 2: Use Existing Reference

If you already have a reference genome:

# Just create a symbolic link
ln -s /path/to/existing/genome.fa genomes/genome.fa

Configuration

Step 1: Prepare Your Metadata File

Create a tab-delimited file (meta.txt) describing your samples:

sample_name	fastq_name	factor	reads
Sample1	Sample1_Input_R1.fastq.gz	Input	R1
Sample1	Sample1_Input_R2.fastq.gz	Input	R2
Sample1	Sample1_H3K27ac_R1.fastq.gz	H3K27ac	R1
Sample1	Sample1_H3K27ac_R2.fastq.gz	H3K27ac	R2
Sample2	Sample2_Input_R1.fastq.gz	Input	R1
Sample2	Sample2_Input_R2.fastq.gz	Input	R2
Sample2	Sample2_H3K4me3_R1.fastq.gz	H3K4me3	R1
Sample2	Sample2_H3K4me3_R2.fastq.gz	H3K4me3	R2

Column descriptions:

sample_name: Biological sample identifier
fastq_name: FASTQ file name (must match actual files)
factor: Chromatin mark or transcription factor (use "Input" for controls)
reads: R1 for forward reads, R2 for reverse reads (paired-end)

For single-end data, only use R1.

Step 2: Generate samples.json

# Run the metadata converter
python sample2json.py /path/to/fastq/directory meta.txt

# This creates samples.json which the pipeline uses

Step 3: Edit config.yaml

Update the configuration file with your settings:

# Adjust these settings for your experiment
from_fastq: True
paired_end: True
long_reads: True

# IMPORTANT: Update this path
ref_fa: genomes/mm10/mm10.fa

# Genome size for MACS3
macs_g: mm
macs2_g: mm

# Control sample name (must match your metadata)
control: 'Input'

# Peak calling stringency
macs_pvalue: 0.05
macs2_pvalue: 0.05
macs2_pvalue_broad: 0.1

# Downsampling
downsample: True
target_reads: 50000000

# Optional analyses
chromHMM: False

Testing the Installation

Step 1: Create Log Directories

mkdir -p logs/slurm
mkdir -p 00log

Step 2: Dry Run (Local)

Test the workflow without actually running jobs:

# Activate snakemake environment
conda activate snakemake

# Dry run to check for errors
snakemake -n --use-conda

Step 3: Small Test Run (Local)

Run on a small subset of data locally:

# Run with 4 cores
snakemake --use-conda --cores 4 --keep-going

# Or limit to specific targets
snakemake --use-conda --cores 4 02fqc/Sample1_Input_R1_fastqc.zip

Step 4: Cluster Execution (SLURM)

If you have access to a SLURM cluster:

# Dry run with SLURM profile
snakemake --profile profiles/slurm -n

# Full run on cluster
snakemake --profile profiles/slurm

Workflow Execution Options

Local Execution (No Cluster)

# Use all available cores
snakemake --use-conda --cores all

# Limit to 8 cores
snakemake --use-conda --cores 8

SLURM Cluster Execution

# Using the SLURM profile (recommended)
snakemake --profile profiles/slurm

# Manual SLURM submission (legacy)
snakemake --use-conda --cluster "sbatch --mem={resources.mem_mb} --time={resources.runtime}" --jobs 100

Re-running Failed Jobs

# Rerun incomplete jobs
snakemake --use-conda --cores 8 --rerun-incomplete

# Force re-run specific rule
snakemake --use-conda --cores 8 --forcerun call_peaks_macs3_narrow

Environment Management

The pipeline automatically creates conda environments for each tool group. To pre-create them:

# Create all environments in advance (recommended)
snakemake --use-conda --conda-create-envs-only --cores 1

# This creates environments in .snakemake/conda/

To clean up conda environments:

# Remove all conda environments
snakemake --use-conda --cleanup-conda

Troubleshooting

Issue: "samples.json not found"

Solution: Run sample2json.py to generate it:

python sample2json.py /path/to/fastq meta.txt

Issue: "Reference genome not indexed"

Solution: Build BWA index:

bwa index genomes/mm10/mm10.fa

Issue: Conda environment creation fails

Solution: Use mamba instead of conda:

conda install -c conda-forge mamba
# Then use --conda-frontend mamba with snakemake

Issue: SLURM jobs fail with memory errors

Solution: Increase memory in the rule's resources: directive in Snakefile:

resources:
    mem_mb=32000  # Increase this value

Issue: "ChromHMM.jar not found"

Solution: ChromHMM is installed via conda environment. If using manually:

# Download ChromHMM
wget https://compbio.mit.edu/ChromHMM/ChromHMM.zip
unzip ChromHMM.zip

Issue: Permissions error with phantompeakqualtools

Solution: The pipeline auto-downloads run_spp.R. Ensure write permissions:

chmod +w .

Next Steps

After successful installation:

Prepare your data: Organize FASTQ files in a directory
Create metadata: Write meta.txt describing your samples
Configure pipeline: Edit config.yaml with your settings
Generate samples.json: Run sample2json.py
Dry run: Test with snakemake -n
Execute: Run the pipeline!

Support

For issues and questions:

GitHub Issues: https://github.com/crazyhottommy/pyflow-ChIPseq/issues
Original README: See README.md for usage examples
Snakemake Documentation: https://snakemake.readthedocs.io/

Key Improvements in Modernized Version

✅ Python 3 compatible (MACS3 instead of MACS1/MACS2)
✅ Modern Snakemake syntax (v7+)
✅ Conda environments for reproducibility
✅ Updated tool versions (samtools 1.19+, deepTools 3.5+)
✅ SLURM profiles (replaces cluster.json)
✅ Resource specifications in Snakefile
✅ Improved error handling and logging
✅ Better documentation

Enjoy your modernized ChIP-seq pipeline!

FilesExpand file tree

INSTALL.md

Latest commit

History