This guide will help you set up the modernized pyflow-ChIPseq pipeline on your system.
- Prerequisites
- Installation
- Reference Genome Setup
- Configuration
- Testing the Installation
- Troubleshooting
-
Conda or Mamba (for environment management)
- Miniconda: https://docs.conda.io/en/latest/miniconda.html
- Mamba (faster):
conda install -c conda-forge mamba
-
Snakemake >= 7.0
conda create -n snakemake -c bioconda -c conda-forge snakemake>=7.0 conda activate snakemake -
Git (for cloning the repository)
# Usually pre-installed on Linux/Mac git --version
- Memory: Minimum 32 GB RAM recommended (alignment and peak calling are memory-intensive)
- Storage: Depends on dataset size
- Raw FASTQ: ~10-50 GB per sample
- BAM files: ~5-20 GB per sample
- Final outputs: ~2-10 GB per sample
- CPU: Multi-core processor (8+ cores recommended for parallel processing)
# Clone from GitHub
git clone https://github.com/crazyhottommy/pyflow-ChIPseq.git
cd pyflow-ChIPseq
# Checkout the modernized branch
git checkout modernize-2025# Create a conda environment for Snakemake
conda create -n snakemake python=3.11 -y
conda activate snakemake
# Install Snakemake and mamba
conda install -c bioconda -c conda-forge snakemake mamba -y# Check Snakemake version (should be >= 7.0)
snakemake --version
# Check conda/mamba
conda --version
mamba --versionThe pipeline requires a reference genome in FASTA format and BWA indices.
# Create genome directory
mkdir -p genomes/mm10
cd genomes/mm10
# Download from UCSC
wget http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz
# or curl -O http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz
gunzip mm10.fa.gz
# Build BWA index
conda activate snakemake
mamba install -c bioconda bwa
bwa index mm10.fa
cd ../..# Create genome directory
mkdir -p genomes/hg38
cd genomes/hg38
# Download from UCSC
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
# Build BWA index
mamba install bwa
bwa index hg38.fa
cd ../..If you already have a reference genome:
# Just create a symbolic link
ln -s /path/to/existing/genome.fa genomes/genome.faCreate a tab-delimited file (meta.txt) describing your samples:
sample_name fastq_name factor reads
Sample1 Sample1_Input_R1.fastq.gz Input R1
Sample1 Sample1_Input_R2.fastq.gz Input R2
Sample1 Sample1_H3K27ac_R1.fastq.gz H3K27ac R1
Sample1 Sample1_H3K27ac_R2.fastq.gz H3K27ac R2
Sample2 Sample2_Input_R1.fastq.gz Input R1
Sample2 Sample2_Input_R2.fastq.gz Input R2
Sample2 Sample2_H3K4me3_R1.fastq.gz H3K4me3 R1
Sample2 Sample2_H3K4me3_R2.fastq.gz H3K4me3 R2
Column descriptions:
sample_name: Biological sample identifierfastq_name: FASTQ file name (must match actual files)factor: Chromatin mark or transcription factor (use "Input" for controls)reads: R1 for forward reads, R2 for reverse reads (paired-end)
For single-end data, only use R1.
# Run the metadata converter
python sample2json.py /path/to/fastq/directory meta.txt
# This creates samples.json which the pipeline usesUpdate the configuration file with your settings:
# Adjust these settings for your experiment
from_fastq: True
paired_end: True
long_reads: True
# IMPORTANT: Update this path
ref_fa: genomes/mm10/mm10.fa
# Genome size for MACS3
macs_g: mm
macs2_g: mm
# Control sample name (must match your metadata)
control: 'Input'
# Peak calling stringency
macs_pvalue: 0.05
macs2_pvalue: 0.05
macs2_pvalue_broad: 0.1
# Downsampling
downsample: True
target_reads: 50000000
# Optional analyses
chromHMM: Falsemkdir -p logs/slurm
mkdir -p 00logTest the workflow without actually running jobs:
# Activate snakemake environment
conda activate snakemake
# Dry run to check for errors
snakemake -n --use-condaRun on a small subset of data locally:
# Run with 4 cores
snakemake --use-conda --cores 4 --keep-going
# Or limit to specific targets
snakemake --use-conda --cores 4 02fqc/Sample1_Input_R1_fastqc.zipIf you have access to a SLURM cluster:
# Dry run with SLURM profile
snakemake --profile profiles/slurm -n
# Full run on cluster
snakemake --profile profiles/slurm# Use all available cores
snakemake --use-conda --cores all
# Limit to 8 cores
snakemake --use-conda --cores 8# Using the SLURM profile (recommended)
snakemake --profile profiles/slurm
# Manual SLURM submission (legacy)
snakemake --use-conda --cluster "sbatch --mem={resources.mem_mb} --time={resources.runtime}" --jobs 100# Rerun incomplete jobs
snakemake --use-conda --cores 8 --rerun-incomplete
# Force re-run specific rule
snakemake --use-conda --cores 8 --forcerun call_peaks_macs3_narrowThe pipeline automatically creates conda environments for each tool group. To pre-create them:
# Create all environments in advance (recommended)
snakemake --use-conda --conda-create-envs-only --cores 1
# This creates environments in .snakemake/conda/To clean up conda environments:
# Remove all conda environments
snakemake --use-conda --cleanup-condaSolution: Run sample2json.py to generate it:
python sample2json.py /path/to/fastq meta.txtSolution: Build BWA index:
bwa index genomes/mm10/mm10.faSolution: Use mamba instead of conda:
conda install -c conda-forge mamba
# Then use --conda-frontend mamba with snakemakeSolution: Increase memory in the rule's resources: directive in Snakefile:
resources:
mem_mb=32000 # Increase this valueSolution: ChromHMM is installed via conda environment. If using manually:
# Download ChromHMM
wget https://compbio.mit.edu/ChromHMM/ChromHMM.zip
unzip ChromHMM.zipSolution: The pipeline auto-downloads run_spp.R. Ensure write permissions:
chmod +w .After successful installation:
- Prepare your data: Organize FASTQ files in a directory
- Create metadata: Write meta.txt describing your samples
- Configure pipeline: Edit config.yaml with your settings
- Generate samples.json: Run sample2json.py
- Dry run: Test with
snakemake -n - Execute: Run the pipeline!
For issues and questions:
- GitHub Issues: https://github.com/crazyhottommy/pyflow-ChIPseq/issues
- Original README: See README.md for usage examples
- Snakemake Documentation: https://snakemake.readthedocs.io/
- ✅ Python 3 compatible (MACS3 instead of MACS1/MACS2)
- ✅ Modern Snakemake syntax (v7+)
- ✅ Conda environments for reproducibility
- ✅ Updated tool versions (samtools 1.19+, deepTools 3.5+)
- ✅ SLURM profiles (replaces cluster.json)
- ✅ Resource specifications in Snakefile
- ✅ Improved error handling and logging
- ✅ Better documentation
Enjoy your modernized ChIP-seq pipeline!