🏕️

Hi-C data processing

For filtering valid Hi-C alignments, we use (previously called pairsamtools). Specifically, we use version 0.2.2 . The filtering workflow outputs a pairs file containing a list of valid contacts. This filtering workflow applies the following criteria: Reads marked as duplicates are removed. Full-length alignments that are unique are kept.

https://data.4dnucleome.org/resources/data-analysis/hi_c-processing-pipeline

4DN processing pipeline을 따라가자.

GitHub - 4dn-dcic/docker-4dn-hic at v43

This repo contains the source files for a docker image stored in duplexa/4dn-hic:v43. (we will change the docker hub account soon) git clone https://github.com/4dn-dcic/docker-4dn-hic cd docker-4dn-hic Major software tools used inside the docker container are downloaded by the script downloads.sh. This script also creates a symlink to a version-independent folder for each software tool.

https://github.com/4dn-dcic/docker-4dn-hic/tree/v43

https://micro-c.readthedocs.io/en/latest/contact_map.html

bwa-mem

bwa mem -SP5M -t<nthreads> <genome_index> <fastq1> <fastq2>
Shell
복사

•

-SP option is used to ensure the results are equivalent to that obtained by running bwa mem on each mate separately, while retaining the right formatting for paired-end reads. This option skips a step in bwa mem that forces alignment of a poorly aligned read given an alignment of its mate with the assumption that the two mates are part of a single genomic segment. 

◦

옵션은 각각

▪

-S 는 skip mate rescue

▪

-P는 skip pairing; mate rescue performed unless -S also in use

◦

를 나타낸다. 

◦

Mate rescue란? https://bioinformatics.stackexchange.com/questions/5514/what-is-mate-rescue-in-bwa-mem

▪

한 read는 제대로 매핑되고 다른 한 read는 잘 매핑이 되지 않았을 때, BWA는 unmapped mate에 대해서 Smith-Waterman algorithm을 수행하여 rescue를 시도함. -S는 이 mate rescue가 수행되지 않도록 함. (근데 어떤 read가 mate rescue에 의해 매핑되었는지는 결과로부터 알 수는 없다고 한다…)

▪

-SP 옵션을 같이 쓰면 결국 paired-end 데이터이지만 single-read 처럼 alignment를 수행한다고 보면 된다. 

•

-5 option is used to report the 5’ portion of chimeric alignments as the primary alignment. In Hi-C experiments, when a mate has chimeric alignments, typically, the 5’ portion is the position of interest, while the 3’ portion represents the same fragment as the mate. For chimeric alignments, bwa mem reports two alignments: one of them is annotated as primary and soft-clipped, retaining the full-length of the original sequence. The other end is annotated as hard-clipped and marked as either ‘supplementary’ or ‘secondary’. The -5 option forces the 5’ end to be always annotated as primary.

•

-M option is used to annotate the secondary/supplementary clipped reads as secondary rather than supplementary, for compatibility with some public software tools such as picard MarkDuplicates.

•

-t option is used for multi-threading and should not affect the result.

pairtools (previously pairsamtools)

pairtools

open2c

•

parse: read .sam files produced by bwa and form Hi-C pairs

•

sort: sort pairs files (the lexicographic order for chromosomes, the numeric order for the positions, the lexicographic order for pair types)

•

dedup: remove PCR duplicates from a sorted triu-flipped .pairs file

◦

주의 : markasdup 은 그냥 .pairs 파일 내에 존재하는 모든 pair의 type을 DD로 바꾸는 용도임. 

•

select: Select certain types of pairs based on their properties

◦

예) UU pair만 얻고 싶으면 pairtools select "mapq1>0 and mapq2>0" test.pairs.gz -o test.UU.pairs.gz 또는 pairtools select '(pair_type=="UU")' test.pairs.gz -o test.UU.pairs.gz

◦

예) UU or UC를 얻고 싶으면 pairtools select '(pair_type=="UU") or (pair_type=="UC")' test.pairs.gz -o test.UU_UC.pairs.gz

•

stats: Describe the types of distance properties of pairs

◦

pairtools stats test.pairs.gz -o test.stats

cooler

•

Cooler is the implementation of a data model for genomically-labeled sparse 2D arrays (matrices) with identical axes in HDF5. It is also the name of the Python package that supports the format.

→ Sparse한 2차원 배열을 다루기 위한 데이터 형식 + 패키지

•

Genomically-labeled 2D array란?

◦

Genomic bin의 쌍 각각에 값을 부여한 2차원 배열. 값이 0이거나 값이 없는 쌍을 제외하면 대부분 sparse한 경우가 많다.

•

파일 형식

◦

.cool : 단일 resolution의 genomically-labeled 2D array를 저장함.

◦

.mcool : multi-resolution genomically-labeld 2D array를 저장함. .cool 파일로부터 cooler zoomify 를 통해서 만들어짐.

Genomically-labeled array는 두가지 형식을 사용함.

•

BG2 (2차원 bedGraph라고 보면 됨)

◦

chrom1 / start1 / end1 / chrom2 / start2 / end2 / value

•

COO (Classic coordinate list)

◦

BG2에서 chrom-start-end triplet은 중복되는 경우가 많다. chrom-start-end interval → bin_id 로 유일하게 매핑해주는 다른 table을 하나 두고, 2개의 database를 maintain하는 방법이다.

bins → chrom / start / end 의 리스트

elements → bin1_id / bin2_id / value 의 리스트

Data model - 3개의 table을 이용하여 genomically-labeled sparse matrix를 나타냄. 위의 COO와 비슷한데, 세번째 table은 chromosome description table임.

chroms

Required columns: name[, length]

Order: enumeration

→ Chromosome의 semantic ordering. 이 순서대로 global genomic matrix의 순서가 구성된다. length 등의 추가적인 정보를 함께 나타낼 수 있다.

bins

Required columns: chrom, start, end[, weight]

Order: chrom (enum), start

→ An enumeration of the concatenated genomic bins that make up a single dimension or axis of the global genomic matrix. Bin size는 fixed일 수도 있고, variable size일 수도 있다. 순서는 chromosome enumeration 순서로 정렬되어야 한다.

pixels

Required columns: bin1_id, bin2_id, count

Order: bin1_id, bin2_id

→ Nonzero upper triangle element를 저장한 테이블. bin1_id, 그 다음 bin2_id 로 정렬된다.

Indexes

•

Compressed sparse row (CSR) sparse matrix representation.

pairtools 로 얻은 pair들을 cooler로 load하기

•

cooler cload pairs 이용.

•

cooler cload pairs [OPTIONS] BINS PAIRS_PATH COOL_PATH

cooler cload pairs \
	-c1 2 -p1 3 -c2 4 -p2 5 \
	-assembly hg19 \
	~/.local/share/genomes/hg38/hg38.fa.sizes:1000000 \
  test.nodups.UU.pairs.gz \
	test.hg38.1000000.cool
Python
복사

•

options:

◦

-c1 : chrom1 field number (one-based)

◦

-p1: pos1 field number (one-based)

◦

-c2: chrom2 field number (one-based)

◦

-p2: pos2 field number (one-based)

•

BINS 를 구성하는 방법

◦

<TEXT:INTEGER> 형식: TEXT = Path to a chromsizes file, INTEGER = Bin size in bp

◦

<TEXT> : Path to BED file defining the genomic bin segmentation.

•

4DN 파이프라인 보니까 cooler cload pairix 를 이용한다. (https://github.com/4dn-dcic/docker-4dn-hic/blob/v43/scripts/run-cooler.sh)

◦

Multithreading을 쓸 수 있는 것 같아서, 더 빠르게 처리 가능하지 않을까?

◦

똑같이 .cool 파일을 얻을 수 있는 건가?

◦

pairix로 얻은 .cool 파일은 뒤에 cooler zoomify를 통해서 multires cool 파일로 바꾸는 것 같음. https://github.com/4dn-dcic/docker-4dn-hic/blob/v43/scripts/run-cool2multirescool.sh

cooler zoomify

•

Generate a multi-resolution cooler file by coarsening.

•

Balancing (normalization 이라고 보면 됨) algorithm은 ICE (iterative correction and eigenvalue decomposition) 을 사용한다.

.pairsam 파일과 .pairs 파일의 차이?

https://pairtools.readthedocs.io/en/latest/formats.html

•

.pairs 파일 형식의 spec은 4DN Consortium 에서 정의하고 있음.

index	name	description
1	read_id	the ID of the read as defined in fastq files
2	chrom1	the chromosome of the alignment on side 1
3	pos1	the 1-based genomic position of the outer-most (5’) mapped bp on side 1
4	chrom2	the chromosome of the alignment on side 2
5	pos2	the 1-based genomic position of the outer-most (5’) mapped bp on side 2
6	strand1	the strand of the alignment on side 1
7	strand2	the strand of the alignment on side 2

•

(+) 여기에 pairtools 결과로 나온 .pairs 파일은 pair_type 컬럼이 추가된다는 특징이 있음.

index	name	description
8	pair_type	the type of a Hi-C pair

◦

pair type의 경우는 https://pairtools.readthedocs.io/en/latest/formats.html#pair-types 참조

•

.pairsam 포맷은 .pairs 포맷의 확장이라고 보면 된다. 1~8번째 컬럼까지는 같고, 두 개의 추가 컬럼을 사용한다.

index	name	description
9	sam1	the sam alignment(s) on side 1; separate supplemental alignments by NEXT_SAM
10	sam2	the sam alignment(s) on side 2; separate supplemental alignments by NEXT_SAM

◦

결국 끝에 개별 SAM entry 정보가 추가되는건데, 이미 SAM은 \t character를 구분자로 사용하고 있어서 .pairsam 파일의 구분자와 혼동된다. 따라서 SAM entry의 \t chraracter를 \031 UNIT SEPARATOR 로 치환하여 사용한다.

•

pairtools parse 옵션으로 --drop-seq --drop-sam 을 주게 되면 이 9/10번째 entry가 사라지면서 .pairsam 대신 일반적인 .pairs 포맷으로 변한다고 생각하면 될 듯.