Artificial Intelligence and Pharmaceutical Chemistry

A study shows that transformer-based chemical language models can transform molecular cores and substituents into novel and diverse compounds. The generated molecules exhibit good synthetic accessibility and potential biological relevance, with significant implications for drug design.

Valentina Parrella

September 24, 2025

339

The use of artificial intelligence in drug discovery is entering a stage of maturity: no longer just a tool for data management or property prediction, but a truly generative instrument. A recent study published in the European Journal of Medicinal Chemistry by Lisa Piazza, Sanjana Srinivasan, Tiziano Tuccinardi, and Jürgen Bajorath demonstrates how Chemical Language Models (CLMs) can transform molecular fragments into structurally and topologically novel compounds, achieving a level of diversification that challenges conventional methods.

From natural language to chemistry

At the heart of this research lies a simple yet powerful idea: if transformers have revolutionized natural language processing, why not apply them to the “language of molecules”? Chemical compounds can be represented as strings (SMILES), that is, sequences of symbols encoding atoms and bonds. Just like the words in a sentence, molecular fragments can also be learned, combined, and transformed.

The authors developed three models:

C model, trained exclusively on molecular cores (central scaffolds);
S model, trained on substituents (R-groups);
CS model, the most ambitious one, capable of integrating a core and two substituents without any prior information about their connection.

Pushing beyond traditional boundaries

In traditional de novo design approaches, chemical constraints and bonding rules play a fundamental role. Here, however, the models were trained without any explicit connection rules, learning directly from the associations between molecular fragments and bioactive compounds extracted from the ChEMBL database.

The results are remarkable. The CS model achieved an 80% validity rate in generating compounds containing the input fragment combinations, with an unexpectedly high degree of diversification. Even the simpler models displayed noteworthy behavior: the C model, although less productive, generated more than 70% of scaffolds never encountered in the training data, demonstrating a level of chemical creativity that goes beyond mere imitation.

Novel structures and expanding chemical diversity

One of the most challenging aspects of molecular generation is distinguishing between mere reproduction and genuine innovation. In this study, the researchers demonstrated that the majority of the generated compounds were novel, with structures not present in the training set.

Analysis according to the Bemis–Murcko hierarchy revealed that more than 50% of the scaffolds and 30–40% of the carbon skeletons were original. This is therefore not just a case of cosmetic variations, but of new molecular topologies that concretely expand the chemical space available for exploration.

Biological relevance and the bioactive landscape

An inevitable question is whether this creativity is pharmacologically useful. The data suggest that it is: by comparing the generated compounds with bioactive molecules in ChEMBL, thousands of structural analogues of active compounds targeting more than 1,300–1,600 different proteins were identified. In other words, the models not only invent new molecules, but do so in a direction with a high likelihood of biological relevance.

A particularly striking example is that of enzyme inhibitors reproduced almost exactly by the CS model, demonstrating the algorithm’s ability to capture implicit chemical rules that were never explicitly provided.

Accessibilità sintetica e drug-likeness

A critical challenge for generative models concerns the practical feasibility of the molecules they produce. It would make little sense to propose structures that, while theoretically plausible, cannot be synthesized in the laboratory. To address this, the researchers calculated synthetic accessibility (SA) and drug-likeness (QED) scores, comparing them with real compounds from the ChEMBL database.

The result is reassuring: the new candidates exhibited SA and QED values comparable to those of known drugs, with average scores of 2.44 for synthesis (versus 2.73 for real compounds) and 0.56 for “chemical beauty” (versus 0.53). In other words, the generated compounds are not only novel and diverse, but also credible as potential drugs.

Strengths and Limitations

The study presents several clear strengths:

it demonstrates the possibility of generating new molecules without predefined bonding constraints, relying solely on statistical learning;
it produces results relevant to fragment-based drug design, expanding the range of options for hit expansion and lead optimization;
it makes both the code and dataset publicly available, a significant contribution to the scientific community.

However, some limitations remain. The lower validity of the C model suggests that not all fragment-based approaches are equivalent and that combining multiple structural cues is crucial. Moreover, the actual biological activity of the generated compounds has yet to be demonstrated experimentally: the fact that they are analogues of known molecules is encouraging, but not sufficient.

Implications for Pharmaceutical Research

In a field where the average cost of developing a drug exceeds two billion dollars and timelines are increasingly compressed, tools capable of chemically navigating “empty spaces” hold enormous strategic potential. The approach proposed by Piazza and colleagues does not replace synthetic chemistry or pharmacological evaluation, but it dramatically reduces the gap between existing data and new molecular hypotheses.

Moreover, the availability of open-source models and public datasets paves the way for a democratization of AI tools for drug discovery, with significant ethical and industrial implications — from accelerating academic research to enabling small biotech companies to access technologies that were once the exclusive domain of big pharma.

Conclusion

The study represents an important milestone in the convergence between artificial intelligence and pharmaceutical chemistry. If the molecular creativity of language models is confirmed in experimental phases, we could be witnessing a paradigm shift: no longer chemistry dictating rules to AI, but AI suggesting new chemistries to research.

As the authors emphasize, the method lends itself not only to the expansion of existing scaffolds but also to fragment-guided lead optimization, with the potential for immediate impact on library design and pharmaceutical innovation. The challenge now is to transform these computer-generated strings into real molecules — in the laboratory and, one day, in the clinic.

Reference: Lisa Piazza, Sanjana Srinivasan, Tiziano Tuccinardi, Jürgen Bajorath,
Transforming molecular cores, substituents, and combinations into structurally diverse compounds using chemical language models, European Journal of Medicinal Chemistry,
Volume 291, 2025, 117615, ISSN 0223-5234.