Genome


Coconut (Cocos nucifera L.), an important source of vegetable oil, nutraceuticals, functional foods, and housing materials, provides raw materials for a repertoire of industries engaged in the manufacture of cosmetics, soaps, detergents, paints, varnishes, and emulsifiers, among other products. The palm plays a vital role in maintaining and promoting the sustainability of farming systems of the fragile ecosystems of islands and coastal regions of the tropics. In this study, we present the genome of a dwarf coconut variety ‘‘Chowghat Green Dwarf’’ (CGD) from India, possessing enhanced resistance to root (wilt) disease. Utilizing short reads from the Illumina HiSeq 4000 platform and long reads from the Pacific Biosciences RSII platform, we have assembled the draft genome assembly of 1.93 Gb.

The genome is distributed over 26,855 scaffolds, with *81.56% of the assembled genome present in scaffolds of lengths longer than 50 kb. About 77.29% of the genome was composed of transposable elements and repeats. Gene prediction yielded 57,660 Transcripts. A total of 112 nucleotide-binding and leucine-rich repeat loci, belonging to six classes, were detected. We have also undertaken the assembly and annotation of the CGD chloroplast and mitochondrial genomes. The availability of the dwarf coconut genome shall prove invaluable for deducing the origin of dwarf coconut cultivars, dissection of genes controlling plant habit and fruit color, and accelerated breeding for improved agronomic traits.

*Keywords: Cocos nucifera, dwarf cultivar, disease resistance, de novo assembly, organellar genomes, agri-genomics, nutrigenomics


Coconut Genome Assembly And Annotations

  1. Primary Assembly:

    Tool used - Masurca(v 0.4.1)
    Filtered Nanopore reads and Illumina reads were used for constructing the Primary contig assembly using Masurca(v 0.4.1) by a hybrid assembly stratergy.

  2. Final Assembly:

    Tool used - YaHS( v 1.1 ), Hicup (0.7.4) , Purge_haplotigs(v 1.1.1)
    The primary assembly was merged with contigs from previous assembly[doi: 10.1089/omi.2020.0147. Epub 2020 Nov 10.] followed by Scaffolding with Hi-C data using YaHS(yet another Hi-C scaffoldingtool)(version 1.1). The Hi-C data was processed using hicup.This was followed by reference based scaffolding using chinese dwarf genome[GWHBEBT00000000 from cncb paper id:doi.org/10.1186/s13059-021-02522-9]. Redundancy was removedusing purge_haplotigs. Scaffolds with less than 1000 bp length was removed.

  3. Validation

    1. Benchmarking Universal Single-copy Orthologs

      Tool used - BUSCO(v5.4.7)
      The final assembly was assessed for genome completeness with BUSCO using its single copy orthologue database for embryophyta_odb10.

    2. Alignment:

      Tool used - bowtie2(v2.4.1)
      The raw WGS illumina short reads were aligned to the genome to assess the percentage of reads mapping to the genome.

  4. Repeat Masking:

    Tool used - RepeatModeler (v2.0.4), LTR retriever (v2.9.0), LTR harvest (v0.6.5), RECON (v1.08) and RepeatMasker (v4.1.5)
    1. RepeatModeler version 2.0.4 for de novo transposable element (TE) family identification and modeling.
    2. LTR retriever version 2.9.0, LTR harvest version 0.6.5 for LTR identification.
    3. RECON version 1.08 for identified repeat element classification.
    4. RepeatMasker version 4.1.5 for masking the annotated repeat elements.
    Total repeats identified in the coconut genome constitute 81.64% of the genome that is approximately 2.19Gb of 2.68 Gb coconut genome assembly.

  5. Gene Prediction and Annotation:

    Tool used - Genmark EP(v4.71), maker2(v3.01.03), augustus, blast(v2.12), cd-hit(v4.8.1)
    Maker pipeline was used for the gene prediction in the Draft coconut genome assembly. Denovo assembled transcript sequences from RNAseq of embryonic callus, endosperm, leaf tissue were selected and merged using cd-hit to obtain a non redundant set of transcriptsequences as EST evidence from coconut( est in maker ). CDS and Protein sequneces from Oil palm were obtained from Refseq database( GCF_000442705.1) which was used as EST evidence from alternative organism ( alt_est in maker ) and protein evidence respectively.
    The predictions obtained from maker pipeline was filtered to retain prediction with AED score less than 0.5 and length of predicted protein product greater than 100 amino acids.

  6. Non-coding feature prediction:

    Tool used - tRNAscan -SE(v2.0), barrnap, Infernal
    tRNA was predicted using tRNAscan-SE with an Infernal score cutoff above 20. rRNA prediction was carried out using barrnap with eukaryote option. Other ncRNAs were inferred from infernal results.


Comparison of Final Assembly Statistics of the Genome of Chowghat Green Dwarf Variety Using Two Assemblers

Assembly features Masurca
Total no of contigs 1930
Total no of bases(bp) 2,688,203,608(~2.7 Gb)
Min contig length(bp) 1013
Max contig length(bp) 241,587,684(~241.6 Mb)
Average contig length(bp) 1,392,851(~1.4Mb)
N50 contig size(bp) 174,886,800(~174.8 Mb)
(G+C)% 37.70%
#N’s 0.12%

*MaSuRCA, Maryland Super Read Cabog Assembler.