tags: Bioinformatics Shengxin
Genome assembly can be roughly divided into three steps (1) construct contig according to the overlap between sequences, (2) construct contig into scaffold based on the second-generation mate pair library or optical map, (3) sort and adjust scaffold Direction to get the final quasi-chromosome level genome.
The current three-generation sequencing assembly can handle the first and second steps. There are 4 options for upgrading contig/scaffold to the quasi-chromosomal level. One is based on genetic maps, one is based on BioNano DLS optical maps, the other is based on chromosomal homology of closely related species, and the other is HiC. Among them, HiC technology is the simpler of the three. It does not require a high-quality DNA library or a large population, and the results are more accurate and credible.
The schematic diagram of HiC library construction is as follows, what we need is the distance relationship between the two ends of the final paired-end sequencing.

At present, assembling software using HiC data includes LACHESIS, HiRise, SALSA1, 3D-DNA, etc. These software perform well on animal genomes and simple plant genomes, but are not suitable for direct use in polyploid and highly heterozygous species On the assembly. The main reason is the similarity of allele sequences, which makes the contigs of different sets of chromosomes appear false signals, and finally connects the contigs of different sets of chromosomes by mistake. The ALLHiC process recently published in Nature Plants is used to solve the HiC assembly problem of polyploid species and highly heterozygous genomes.
ALLHiC is divided into five steps (see the figure below, Zhang et al., 2019), pruning, partition, rescue, optimization, building, and the required input files are BAM after HiC data comparison and an Allele.ctg.table.

The pruning step is a key step that distinguishes ALLHiC from other software. Therefore, I specifically selected them for introduction. The red solid line is the potential collapse area (due to the similar sequence during assembly, there is no split), and the other color solid lines are different haplotypes (I use light gray ellipses to perform distinguish). The pink dotted line refers to the HiC signal between alleles, and the black dotted line is the HiC signal in the collapsed and uncollapsed regions.
In this step, ALLHiC will filter the HiC signals between alleles in the BAM file according to the Allele.ctg.table provided, and screen out the HiC signals in the collapsed and uncollapsed regions. These signals will be used in the Rescue step to assign unanchored contigs to the grouped contigs.

The installation of ALLHiC is very simple. According to habit, I install the software in~/opt/biosoftunder
mkdir -p ~/opt/biosoft && cd ~/opt/biosoft
git clone https://github.com/tangerzhang/ALLHiC
cd ALLHiC
mv allhic.v0.9.8 bin/allhic
chmod x bin/*
chmod x scripts/*
# Add to environment variables
PATH=$HOME/opt/biosoft/ALLHiC/scripts/:$HOME/opt/biosoft/ALLHiC/bin/:$PATH
export PATH
In addition, ALLHiC also relies on samtools (v1.9), bedtools and matplotlib (v2.0) of the Python 3 environment, which can be done in one step through conda.
conda create -y -n allhic python=3.7 samtools bedtools matplotlib
Then check whether the installation is successful
$ conda activate allhic
$ allhic -v
$ ALLHiC_prune
You may encounter the following error
ALLHiC_prune: /lib64/libstdc .so.6: version `GLIBCXX_3.4.21' not found (required by ALLHiC_prune)
This is caused by the GLIBC is too low, but do not try to upgrade GLIBC (you can not afford the consequences), conda provides a relatively new dynamic library, so you can solve the problem by the following methods
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/opt/miniconda3/lib
Thanks to the test data provided by Zhang Xingtan, I converted the BAM into the original FastQ format to explain from the beginning. The data is on the Baidu network disk, and you canmy blog Find the link at the corresponding location.
The input file needs to have 4, contig sequence, allelic contig information and two paired-end sequencing data
$ ls -1
Allele.ctg.table
draft.asm.fasta
reads_R1.fastq.gz
reads_R2.fastq.gz
Step 1: create an index
samtools faidx draft.asm.fasta
bwa index -a bwtsw draft.asm.fasta
The second step: sequence post. The speed limit step for this step isbwa sampe, Because it has no multithreading parameters. If the amount of data is large, you can split the original fastq data first, compare them separately and execute them separatelybwa sampe, And finally merged into a single file.
bwa aln -t 24 draft.asm.fasta reads_R1.fastq.gz > reads_R1.sai
bwa aln -t 24 draft.asm.fasta reads_R2.fastq.gz > reads_R2.sai
bwa sampe draft.asm.fasta reads_R1.sai reads_R2.sai reads_R1.fastq.gz reads_R2.fastq.gz > sample.bwa_aln.sam
The third step: SAM preprocessing, remove redundant and low-quality signals, improve processing efficiency
PreprocessSAMs.pl sample.bwa_aln.sam draft.asm.fasta MBOI
# If there is a BAM file
# PreprocessSAMs.pl sample.bwa_aln.bam draft.asm.fasta MBOI
filterBAM_forHiC.pl sample.bwa_aln.REduced.paired_only.bam sample.clean.sam
samtools view -bt draft.asm.fasta.fai sample.clean.sam > sample.clean.bam
among themfilterBAM_forHiC.plThe filter criteria is that the comparison quality is higher than 30 (MQ), only the only comparison (XT:A:U) is retained, the edit distance (NM) is lower than 5, the wrong match is lower than (XM) 4, and there can be no more than 2 Gap (XO, XG)
Step 4 (optional): For polyploid or highly heterozygous genomes, because the sequence similarity of alleles is high, it is very likely that there will be false signals between different sets of genomes, so it is necessaryBuild Allele.ctg.table, Used to filter this kind of false signal.
ALLHiC_prune -i Allele.ctg.table -b sample.clean.bam -r draft.asm.fasta
This step generatesprunning.bamFor subsequent analysis
Step 5: This step is to group different contigs according to HiC signals, and the number of groups is determined by-kcontrol. If you skip the fourth step, you can use the result of the third step directlysample.clean.bam
ALLHiC_partition -b prunning.bam -r draft.asm.fasta -e AAGCTT -k 16
This step will generate a series ofprunningFile at the beginning
Step 6: Assign unanchored contigs to existing groups.
ALLHiC_rescue -b sample.clean.bam -r draft.asm.fasta \
-c prunning.clusters.txt \
-i prunning.counts_AAGCTT.txt
This step is based on the groupYY.txt corresponding to the previous prunning.counts_AAGCTT.XXgYY.txt
Step 7: Optimize the order and direction of contig in each group
# Generate .clm file
allhic extract sample.clean.bam draft.asm.fasta --RE AAGCTT
# Optimization
for i in group*.txt; do
allhic optimize $i sample.clean.clm
done
This step will generate the corresponding groupYY.tour based on groupYY.txt
Step 8: Convert the tour format to fasta format and generate the corresponding agp.
ALLHiC_build draft.asm.fasta
This step generates two files, groups.asm.fasta and groups.agp. Among them, groups.asm.fasta is the result we need.
Step 9: Construct a chromatin interaction matrix and evaluate the results based on the heat map
samtools faidx groups.asm.fasta
cut -f 1,2 groups.asm.fasta.fai > chrn.list
ALLHiC_plot sample.clean.bam groups.agp chrn.list 500k pdf

Several precautions for using ALLHiC:
~10%The mosaic ratio of~20%The ratio of collapse. Therefore, it is best to use assembly software similar to Canu that can distinguish haplotypes. ----
Copyright Notice: All articles in this blog are used except for special statementsCreative Commons Attribution-Non-Commercial Use-No Derivation 4.0 International License Agreement (CC BY-NC-ND 4.0)Permission is granted.

Genome assembly --- Nanopore data evaluation (Naman mustard nanopore) 1. Download software 2. Software use (1)nanoQC (2)NanoPlot 1. Download software Use Conda to create an environment, downloadnanoqc...
RaGOO Raising chromosomes from contig/scaffold level to chromosome level is the ultimate goal of assembly. We usually use genetic maps, optical maps, and HiC technologies to provide information to sor...
introduce Canu is specifically assembled by PACBIO or Oxford Nanopore sequences. CANU is divided into three phases: correction, trimming and assembly. The correction phase will increase the accuracy o...
Introduction to Ragtag RAGTAG can perform error assembly and correction, scaffold assembly and repair, scaffold merger, etc., a total of four steps: Correct, Scaffold, Patch, Merge. After that, you ca...
The mainstream NGS genome assembly software first divides the sequence into kmer, and then obtains the assembled sequence based on the de Bruijn Graph graph theory algorithm. When the program is runni...
First of all, thanks to jimmy for a very detailed tutorial.HiC data analysis in real-time HiC-Pro This article is the second part of the three-dimensional genome study notes, mainly recording the prob...
Genome assembly refers to the use of sequencing methods to generate sequence fragments (ie read) from the genome of the species to be tested, and to splice the fragments according to the overlapping a...
This article can behttp://xuzhougeng.top/archives/HiC-Pro-An-optimized-and-flexible-pipeline-for-Hi-C-data-processingFree reading HiC-Pro is a highly efficient tool HiC data preprocessing, can be used...
Genome assembly --- Genome Survey (Genome Survey) 1. Estimate the principle of genome size 2. Jellyfish software 3. Genomescope Calculate Hybridization 4. GCE software 5. Estimate the size of the Arab...
This article can be found inhttp://xuzhougeng.top/Read the original text for free After using the second-generation data or the third-generation data to get the contig, the next step is to raise the c...