Improving processing and quality of DNA data for biodiversity research

Oscillatoria redekei bacteria, observed in Rambla del Puerto del Garruchal, Murcia, Spain. Credit: Vicente Franch Meneu

Improving processing and quality of DNA data for biodiversity research

13 Sep 2021 - 17:43


  • ENA and the Global Biodiversity Information Facility set up automated processes for publishing better-organised data.
  • Sequencing is an important data feed for global biodiversity observation.
  • The ongoing ENA and GBIF collaboration will provide a robust body of observation for the scientific community working on biodiversity.

14 September, Cambridge – A collaboration between EMBL-EBI’s European Nucleotide Archive (ENA) and the Global Biodiversity Information Facility (GBIF) has established automated processes for publishing better organised, cleaner and more up-to-date datasets on GBIF.

These datasets reuse the globally comprehensive DNA sequence data that ENA and its partners, the National Centre for Biotechnology Information (NCBI)  and the DNA Data Bank of Japan, maintain in the International Nucleotide Sequence Database Collaboration (INSDC).

EMBL-EBI maintains ENA, which supplied the first DNA-derived dataset shared through GBIF in 2014. As a result of the recent collaboration, these records have been segmented into three different datasets containing sequence-based records, records associated with host organisms and records associated with environment sample identifiers.

An important data feed for biodiversity

"Sequencing is one of the most important data feeds for global biodiversity observation,” said Guy Cochrane, head of ENA. ”I am delighted that the GBIF and EMBL-EBI ENA teams are working together to extend and enhance the availability of comprehensive INSDC data through GBIF. Our continued work together on improving granularity and filtering of these data will provide an increasingly accurate and reliable body of openly available observations for the scientific community."

Many of the records coming from the EMBL-EBI datasets represent the sequences of specimens held in natural history institutions. Thanks to the clustering algorithm deployed last year and the inclusion of all specimen-related records from EMBL, many of these records now link to the originating museum records, as in this example.

Find out more

Read the full announcement on the GBIF website.

This work complements EMBL-EBI and GBIF's earlier efforts to improve the connections between metagenomics and species occurrence data.


Contact the news team

Vicky Hatch | Communications Officer

Oana Stroe | Senior Communications Officer

Subscribe to the email newsletter

Subscribe to our publications.

Sign up Or stay updated with the RSS feed (EMBL-EBI only).