holo-omics.github.io/03.03-host_genomics.Rmd at main · EisenRa/holo-omics.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# Host genomics (HG) data processing {#host-genomics-data-processing}

Bioinformatic procedures for host genomics can be grouped in two main tasks, the generation of the reference genome and the resequencing of the genome. The first one aims at creating a (often single) reference genome sequence for a species, a breed or a population, which is then used to generate genomic profiles for multiple individuals. Reference genomes generation requires a considerably higher sequencing effort than resequencing, and the use of multiple sequencing technologies (long-read, Hi-C, etc.) is often needed to resolve the structural complexities of most eukaryotic organisms. Genome resequencing, in contrast, usually relies on short-read sequencing, which provide sufficient information for calling nucleotide variarions.

Overviews of both procedures are shown in the following chapters:

* **[Host reference genome](#host-reference-genome)**
* **[Host genome resequencing](#host-resequencing)**

## Host reference genome {#host-reference-genome}

Genomes of eukaryotic organisms are generally complex, because they carry multiple copies of the same genome, genomes contain duplications, repetitive sequences, mobile elements, etc. In consequence, generating a high-quality reference genome that represents all this complexity is a complex effort, that today, requires multiple complementary molecular techniques to be merged. Although multiple genome assembly protocols exist, in this guidebook we will focus on the one employed in the Vertebrates Genomes Project, the largest consortium aiming at generating animal reference genomes in a standardised way [@Rhie2021-cf]. The VGP assembly pipeline uses data generated by a variety of technologies, including PacBio HiFi reads, Bionano optical maps, and Hi-C chromatin interaction maps.

### Genome quality {#host-genome-quality}

Before advancing with genome generation procedures, it is important to acknowledge that reference genomes can have different qualities. Quality is measured by assembly statistics, such as the N50 and L90 metrics, which provide an overview of the completeness and accuracy of the genome. Based on those metrics, eukaryotic genomes are usually categorised in three levels:

**Contig level**: Contig level refers to the lowest level of genome assembly, where the genome is fragmented into small pieces called contigs. Contigs are contiguous sequences of DNA that are typically hundreds to thousands of base pairs in length. Contig-level genome assemblies lack information about the order and orientation of the contigs and may contain gaps between them.
**Scaffold level**: Scaffold level is the next level of genome assembly, where contigs are linked together using paired-end reads or other genomic information to form larger structures called scaffolds. Scaffolds provide information about the order and orientation of contigs but may still contain gaps between them.
**Chromosome level**: Chromosome level is the highest level of genome assembly, where the genome is fully assembled into chromosomes. Chromosome-level assemblies provide the most complete and accurate representation of the genome, with few gaps and accurate order and orientation of genomic elements. These assemblies typically require multiple sources of genomic information and sophisticated computational tools to produce.

### Genome profile analysis {#host-genome-profile-analysis}

Gathering metrics on genome properties before initiating a de novo genome assembly project is very helpful in setting expectations for the assembly. In the past, DNA flow cytometry was commonly used to estimate genome size, but computational approaches have become the preferred method in recent times [@Wang2020-fj]. Currently, genome profiling is based on k-mer frequency analysis, which not only provides information on the genome's complexity, such as its size and levels of heterozygosity and repeat content, but also on the quality of the data.

k-mer spectra can be generated with Meryl, which generates k-mer profile by decomposing the sequencing data into k-length substrings, counting the occurrence of each k-mer and determining its frequency.

\small
```{sh eval=FALSE}
#Create a k-mer database
meryl count k=31 mer=both output reads.meryl threads=4 \
     input reads_1.fastq reads_2.fastq

#Generate a k-mer spectrum
meryl histogram reads.meryl > reads.hist
```
\normalsize

The k-mer histogram produced by Meryl can be used to deduce genome properties with the help of GenomeScope2. This tool utilises a nonlinear least-squares optimisation to fit a combination of negative binomial distributions, providing estimates for genome size, repetitiveness, and heterozygosity rates [@Ranallo-Benavidez2020-am].

\small
```{sh eval=FALSE}
./genomescope2.pl -k 31 -i reads.hist -o reads_genomescope
```
\normalsize

### Genome assembly using hifiasm {#host-genome-assembly-hifiasm}

Hifiasm is a powerful de novo assembler specifically developed for PacBio HiFi reads. One of the key advantages of hifiasm is that it allows us to resolve near-identical, but not exactly identical, sequences, such as repeats and segmental duplications [@Cheng2021-nh]. Hifiasm can be run in multiple modes depending on data availability:

#### Solo mode {- #host-genome-assembly-hifiasm-solo}

The solo mode generates a pseudohaplotype assembly, resulting in a primary and an alternate assembly solely using HiFi reads.

#### Hi-C-phased mode {- #host-genome-assembly-hifiasm-Hi-C}

The Hi-C-phased mode generates a hap1 assembly and a hap2 assembly, which are phased using the Hi-C reads from the same individual.

#### Trio mode {- #host-genome-assembly-hifiasm-trio}

The trio mode requires long-read PacBio HiFi reads from child, and Illumina short-reads from both parents to generate a maternal assembly and a paternal assembly, which are phased using reads from the parents.

### Assembly evaluation {#host-genome-assembly-evaluation}

Assemblies can be evaluated using a variety of approaches that assess different parameters of the assembled genomes.

gfastats can be used or summary statistics (e.g., contig count, N50, NG50, etc.)

BUSCO assesses genome completeness based on an evolutionary functional perspective. BUSCO genes are anticipated to exist in a single-copy haplotype for a particular clade, and their presence, absence, or duplication can help researchers determine whether an assembly is deficient in significant regions or has multiple copies, which may necessitate purging [@Simao2015-ex].

Merqury performs a reference-free assessment of assembly completeness and phasing based on k-mers. Merqury compares k-mers in the reads to the k-mers found in the assemblies, as well as the copy number (CN) of each k-mer in the assemblies [@Rhie2020-hw,].

### Assembly scaffolding {#host-genome-scaffolding}

The following step in the process is to assemble contigs into scaffolds, i.e., to connect contigs interspaced with gaps. While traditionally, this process has been performed using paired-end short-read data with long insert-sizes, the VGP pipeline currently scaffolds using two more advanced technologies: Bionano optical maps and Hi-C data.

#### Scaffolding using Bionano optical maps {- #host-genome-scaffolding-bionano}

Content to be added.

#### Scaffolding using Hi-C data {- #host-genome-scaffolding-hi-c}

Content to be added.

### Final genome evaluation {#host-genome-final-evaluation}

Content to be added.

:::: {.authorbox}
Contents of this section were created by [Antton Alberdi](#antton-alberdi).
:::

### Reference genome annotation {#host-genome-annotation}

Content to be added.

## Host genome resequencing {#host-resequencing}

Once a reference genome is available, short-read sequencing data can be used for generating single nucleotide polymorphism (SNP) data. Although multiple options exists, the pipeline below describes a typical workflow to process data using Bowtie2 for read mapping, Picard for marking duplicates, and GATK for performing variant calling. The resulting SNP data can be used for a wide range of downstream analyses, such as identifying genetic variants associated with diseases, studying population genetics, and performing genome-wide association studies (GWAS). The pipeline is customisable and can be modified to suit the specific needs of the researcher, such as changing the parameters of the tools used or incorporating additional analysis steps. Overall, this pipeline is a powerful tool for investigating genetic variation in genomes and can provide valuable insights into the genetic basis of various biological processes.

The first step is to map the reads agains the reference genome:
\small
```{sh eval=FALSE}
bowtie2 -x reference_genome_index \
    -1 forward_reads.fq \
    -2 reverse_reads.fq \
    -S mapped_reads.sam
```
\normalsize

If the mapping file is saved to an uncompressed SAM file, this should be compressed, and sorted for downstream analyses.
\small
```{sh eval=FALSE}
samtools view -bS mapped_reads.sam > mapped_reads.bam
samtools sort mapped_reads.bam -o sorted_mapped_reads.bam
```
\normalsize

Picard can be then used to mark duplicates in the sorted BAM file.
\small
```{sh eval=FALSE}
java -jar picard.jar MarkDuplicates \
      INPUT=sorted_mapped_reads.bam \
      OUTPUT=dedup_sorted_mapped_reads.bam \
      METRICS_FILE=metrics.txt VALIDATION_STRINGENCY=LENIENT
```
\normalsize

The deduplicated BAM file without redundant reads must be done indexed before starting the variant calling.
\small
```{sh eval=FALSE}
samtools index dedup_sorted_mapped_reads.bam
```
\normalsize

GATK4 is the used to perform local realignment around indels.
\small
```{sh eval=FALSE}
gatk --java-options "-Xmx4g" IndelRealigner \
      -R reference_genome.fa \
      -I dedup_sorted_mapped_reads.bam
      -O realigned_reads.bam \
      -targetIntervals intervals.list
```
\normalsize

Then, base quality score recalibration is performed using GATK4.
\small
```{sh eval=FALSE}
gatk --java-options "-Xmx4g" BaseRecalibrator \
      -R reference_genome.fa
      -I realigned_reads.bam
      --known-sites known_snps.vcf \
      -O recal_data.table
```
\normalsize

Subsequently, base quality score recalibration is applied to the.
\small
```{sh eval=FALSE}
gatk --java-options "-Xmx4g" ApplyBQSR \
    -R reference_genome.fa
    -I realigned_reads.bam
    --bqsr-recal-file recal_data.table \
    -O recal_reads.bam
```
\normalsize