Running Peakranger Bcp

So, I was interestd in running the software peakranger bcp recently, but stumbeled into a rather irritating probelm. I kept getting the classic message Segmentation fault (core dumped). Upon looking around for the solution to this problem I found nothing. So I thought I would document briefly what I did to fix it.

Initial Output

program version:          1.18
Data files:
 File format:             bam
 Sample file:             chip_rep1_BWA.sorted.bam
 Control file:            input_rep1_BWA.sorted.bam
Qualities:
 P value cut off:         0.0001
 sliding window size:     500
 Read extension length:   200
Output:
 Regions:                 test_region.bed
 HTML reports:            Disabled(--report not specified)

Reads statistics:

 Treatment reads +:       12002431
 Treatment reads -:       11987889
 Average read length:     30
 Control reads +:         11301505
 Control reads -:         11298535
 Average read length:     30
 Verifying reads...
Warning: B73V4_ctg162 only contains reads in the positive strand. The chromosome is removed.
Warning: B73V4_ctg162 only contains reads in the positive strand. The chromosome is removed.

Reads statistics after correction:

 Treatment reads +:       12002424
 Treatment reads -:       11987889
 Control reads +:         11301503
 Control reads -:         11298535

 Calling peaks...

Segmentation fault (core dumped)

Upon investigating this further, I decided to see if the issue was coming from the scaffold reported in the error, B73V4_ctg162. To see check this I first analysed how many reads were aligning to this contig in the maize genome using using a quick samtools command:

#Command:
samtools idxstats chip_B73_ear_H3K36me3_rep1_bowtie2algn.sorted.bam | grep B73V4_ctg162

#Output:
B73V4_ctg162    40634   0       0

So, there were ZERO reads aligning to this contig. Making me think the software might be chocking on this scaffold. But after removing this scaffold from the SAM file, I was still being greated by Segmentation fault (core dumped) . So, in order to try again I took a more drastic measure removed all scaffolds that had CTG in them. I did this after checking the gff3 file for all scaffolds/contigs, and coming to the conclusion that none of them are particularly gene rich and/or interesting for my questions. I removed ALL of these scaffolds using the following command:

grep -v "B73V4_ctg*" chip_rep1_bowtie2algn.sam  > chip_rep1_bowtie2algn.noctg_scaff.sam && grep -v "B73V4_ctg*" chip_rep1_bowtie2algn.sam  > chip_rep1_bowtie2algn.noctg_scaff.sam 

I eventuall re-ran this software using these generated SAM (I didn’t want to take the extra step of going from samtools view | grep _v back to sam) files, and was eventually rewarded with:

Final output:

program version:          1.18
Data files:
 File format:             sam
 Sample file:             chip_rep1_bowtie2algn.noctg_scaff.sam
 Control file:            chip_input_rep1_bowtie2algn.noctg_scaff.sam
Qualities:
 P value cut off:         0.0001
 sliding window size:     500
 Read extension length:   200
Output:
 Regions:                 test_sam_no162_region.bed
 HTML reports:            Disabled(--report not specified)

Reads statistics:

 Treatment reads +:       20131985
 Treatment reads -:       20126418
 Average read length:     76
 Control reads +:         15762557
 Control reads -:         15771271
 Average read length:     76
 Verifying reads...

Reads statistics after correction:

 Treatment reads +:       20131985
 Treatment reads -:       20126418
 Control reads +:         15762557
 Control reads -:         15771271

 Calling peaks...

Discovered 2693 regions in 1.
Discovered 1222 regions in 10.
Discovered 2065 regions in 2.
Discovered 1902 regions in 3.
Discovered 1879 regions in 4.
Discovered 2085 regions in 5.
Discovered 1538 regions in 6.
Discovered 1445 regions in 7.
Discovered 1649 regions in 8.
Discovered 1396 regions in 9.
Discovered 3    regions in Mt.
Discovered 0    regions in Pt.



Total regions discovered:       17877

Success! It appears that those nasty scaffolds/contigs were proving to be quite troublesome for this software. But, this isn’t an ideal solution. We’re still casting out genomic information that I would prefer to keep for analysis. So… There has to be a smarter way of doing this. Maybe pin pointing the exact scaffolds that were failing and why? But in my case, this got me the desired results I needed.