Models that map DNA and protein sequences into deep embeddings have been recently developed. While their ability to improve prediction in downstream tasks has been demonstrated, clear advantages and disadvantages of embedding types, and different means of applying them, are not yet available. In this paper we compare five different models (one for DNA, four for proteins) and different embedding aggregation methods with respect to their ability to preserve evolutionary and functional information, using a hierarchical tree approach. Specifically, we introduce a novel procedure that builds hierarchical clustering trees to assess the relative position of sequences in the embedding latent space, compared to the phylogenetic and functional similarities between sequences. The methods are benchmarked on five different datasets from various organisms. The ESM protein language model and DNABert emerge as best performers in different settings.

Tolloso, M., Galfre, S.G., Pavone, A., Podda, M., Sirbu, A., Priami, C. (2024). How Much Do DNA and Protein Deep Embeddings Preserve Biological Information?. In Computational Methods in Systems Biology 22nd International Conference, CMSB 2024, Pisa, Italy, September 16–18, 2024, Proceedings (pp. 209-225). Springer Science and Business Media Deutschland GmbH [10.1007/978-3-031-71671-3_15].

How Much Do DNA and Protein Deep Embeddings Preserve Biological Information?

Pavone A.;
2024-01-01

Abstract

Models that map DNA and protein sequences into deep embeddings have been recently developed. While their ability to improve prediction in downstream tasks has been demonstrated, clear advantages and disadvantages of embedding types, and different means of applying them, are not yet available. In this paper we compare five different models (one for DNA, four for proteins) and different embedding aggregation methods with respect to their ability to preserve evolutionary and functional information, using a hierarchical tree approach. Specifically, we introduce a novel procedure that builds hierarchical clustering trees to assess the relative position of sequences in the embedding latent space, compared to the phylogenetic and functional similarities between sequences. The methods are benchmarked on five different datasets from various organisms. The ESM protein language model and DNABert emerge as best performers in different settings.
2024
Settore INFO-01/A - Informatica
9783031716706
9783031716713
Tolloso, M., Galfre, S.G., Pavone, A., Podda, M., Sirbu, A., Priami, C. (2024). How Much Do DNA and Protein Deep Embeddings Preserve Biological Information?. In Computational Methods in Systems Biology 22nd International Conference, CMSB 2024, Pisa, Italy, September 16–18, 2024, Proceedings (pp. 209-225). Springer Science and Business Media Deutschland GmbH [10.1007/978-3-031-71671-3_15].
File in questo prodotto:
File Dimensione Formato  
articolo.pdf

accesso aperto

Tipologia: Versione Editoriale
Dimensione 1.09 MB
Formato Adobe PDF
1.09 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/692040
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact