Tallinn University of Technology

Machine Learning Interatomic Potentials for Chemical Reactions

Predictive models are essential in early-stage drug development, enabling efficient screening of large compound libraries and early identification of candidates with favourable activity and properties. The models are used to help researchers guide lead discovery and lead optimisation, reducing reliance on costly laboratory experiments. The models also promote more sustainable research by minimising waste, conserving resources, and reducing animal testing—aligning modern drug discovery with the principles of green chemistry.

While different model types form a spectrum, it is bounded by two clear extremes: models that rely purely on empirical data and are built using statistical or machine learning approaches, and those that are based on physical principles, such as force field methods or ab initio calculations. In drug discovery, we often lack sufficient data or encounter high variability in said data, making it difficult to build reliable predictive models with broad applicability. Conversely, physics-based approaches are time-consuming and impractical for screening large numbers of compounds.

The sweet spot on the spectrum is a hybrid approach that combines empirical data with faster, though less accurate, physics-based methods such as semi-empirical calculations. This strategy allows researchers to train models with fewer data points while leveraging physical principles to ensure broad applicability. As a result, it offers a practical balance between accuracy and scalability, making it especially valuable in early-stage drug discovery where rapid screening is essential.

In recent years, the machine learning interatomic potentials (MLIPs) have become popular since they offer faster calculation speeds compared to semi-empirical methods while being as accurate as ab initio methods they have been trained on. However, these methods have often been trained on equilibrium or near-equilibrium geometries of neutral molecules, causing significant inaccuracies when applied to ions, radicals or transition states.

Predictive models that rely on the aforementioned non-equilibrium geometries would greatly benefit from new MLIPs. The overarching goal of this research is to advance and expand existing MLIP approaches. The role of the successful candidate will be training of MLIPs for radicals of drug-like molecules that are the result of hydrogen abstraction. The study will be initially limited to compounds that contain the H, C, N, O chemical elements with the intention to subsequently expand the scope of chemical elements. Once the MLIPs have been successfully built, they shall be integrated into models that predict the metabolism mediated by the Cytochrome P450 family of enzymes—the most important enzymes involved in xenobiotic metabolism. This integration ensures that the MLIPs have immediate practical relevance and are positioned to make a high impact in the scientific community, facilitating rapid adoption by domain experts.

Mario Öeren

Mario Öeren (ORCID ID: 0000-0003-4292-5557) obtained his PhD in computational chemistry from the Tallinn University of Technology (TalTech) in 2015. During his PhD, he collaborated with laboratories in both analytical and organic chemistry, focusing on the interpretation of complex IR and VCD spectra, and characterising weak non-covalent interactions governing the stability and specificity of host–guest complexes.

In 2017, Mario joined Optibrium (UK), a company that develops software for small molecule design, optimisation, and data analysis. At Optibrium, his research focuses on predictive metabolism models, combining data-driven QSAR and machine learning approaches with ligand-based methods that account for chemical reactivity and site accessibility.

In recent years, Mario has assisted his former group at TalTech in applying similar techniques used for predicting metabolism to the synthesis of novel host–guest systems.

His expertise lies primarily in computational chemistry, cheminformatics and machine learning. He is proficient in Python and contributed to the development of the software at Optibrium. He has co-authored 17 peer-reviewed publications and has presented his work at conferences around the world.

Mario has supervised six undergraduates and half of those projects resulted in peer-reviewed papers. In addition, he has co-supervised an industrial PhD student jointly affiliated with Optibrium and the University of Cambridge.

Current research focus: predictive models, machine learning in chemistry, computational chemistry and cheminformatics, metabolism, host-guest chemistry
Number of Publications: 17
Awards, memberships: Seal of Excellence from the European Commission, stipend for advancing the Estonian language in the field of chemistry

Matthew Segall (ORCID ID: 0000-0002-2105-6535) obtained a BA in Physics and MSc in Computation from the University of Oxford, and a PhD in Theoretical Physics from the University of Cambridge, where he also held a Postdoctoral Research Associate position for approximately three years.

Matthew moved to industry in 2001, where, as Associate Director at Camitro (UK), ArQule Inc. and then Inpharmatica, he led a team developing predictive absorption, distribution, metabolism, and excretion (ADME) models and state-of-the-art intuitive decision-support and visualisation tools for drug discovery. In January 2006, he became responsible for the management of Inpharmatica's ADME business, including experimental ADME services and the StarDrop software platform. Following the acquisition of Inpharmatica, Matt became Senior Director responsible for BioFocus DPI's ADMET division and in 2009 led a management buyout of the StarDrop business to found Optibrium, which develops software and AI solutions for small molecule design, optimisation and data analysis.

Matthew has 68 peer-reviewed publications with over 33,000 citations.

Matthew has co-supervised one PhD student and led an industry research group for over 20 years.

Toomas Tamm (ORCID ID: 0000-0002-5275-8580) has worked in professor-level as well as administrative roles at TalTech since 1998. His research has revolved around computational chemistry, primarily using density-functional theory methods. Electrochemical potentials, reaction equilibria and mechanisms, host-guest complexes, interpretation of spectra, and other subjects have been studied by these methods over the years. He is the co-author of 45 peer-reviewed publications and has successfully supervised 4 PhD students, with 3 currently under supervision.

The candidate must be proficient in computational chemistry, cheminformatics, or a related field. Additionally, proficiency in at least one of the following fields: data science, machine learning, or programming is expected. Experience with Python and familiarity with libraries such as NumPy, Pandas, and matplotlib will be highly advantageous. Knowledge of Linux would also be beneficial.

The PhD candidate should be capable of presenting their research findings in group seminars, conferences, and peer-reviewed articles. Strong communication and writing skills are essential for conveying complex concepts to both specialists and broader audiences. The candidate should approach their research with openness and clarity, conveying developments as they evolve.