Creating a vaccine involves examining the structure of the virus and identifying its spike protein which is used to gain entry to the host’s cells. Antibodies that target the spike protein can block the virus and inhibit replication. In the case of the Covid-19 virus, the genetic sequence of the virus was released in January 2020 to allow developers around the world to have a blueprint for their research.
Typically, drugs are developed by pharmaceutical companies analysing chemical compounds to assess properties in a drug, for example absorption rates, metabolic stability, binding strengths and so forth.
Today’s commercially available anti-viral therapeutics target diseases like influenza, hepatitis C, chickenpox, papilloma and AIDS and the R&D activities to model properties for future anti-viral therapeutics are a significant part of the typical $2.5 billion cost of a new drug.
Neural networks
To accelerate drug discovery, biotechnology companies are increasingly using artificial intelligence (AI) in drug discovery. These are based on neural network architectures which use supervised learning to pre-train algorithms on unlabelled data. These algorithms undergo supervised learning using smaller amounts of labelled data to fine tune the algorithms for predictive drug development.
High performance computing (HPC) using GPUs can meet the compute intensive demands of deep learning, for example, by gathering information from the many studies and papers in a particular research field.
NVIDIA’s Clara Discovery, for example, has GPU-accelerated libraries, software development kits (SDKs) and reference applications for AI-powered imaging and genomics.
Above: The A100 GPU from NVIDIA
It has over 40 pre-trained models and more than 12 libraries, AI application frameworks and reference applications and is being used by GlaxoSmithKline’s AI lab in the UK with NVIDIA’s DGX A100 systems to mine data, based on the pharmaceutical company’s imaging, genomic and genetic data sets to develop new therapies.
The DGX A100 is built on NVIDIA’s A100 Tensor Core GPUs and can deliver the performance of 1,000 CPU servers, according to the company.
There is also the DGX SuperPOD, built using DGX A100 systems and NVIDIA’s InfiniBand high dynamic range network switcher. This has been developed in partnership with life sciences company, Schrödinger, to accelerate the speed and accuracy of its computational drug discovery software platform which is used by pharmaceutical companies to perform physics-based simulations at the atomic level.
“Deep learning is proving to be essential to drug research and development,” said Rory Kelleher, Global Head of Business Development, NVIDIA Healthcare and Life Sciences. “From understanding the casual biology of disease to screening chemical space, to designing de-novo molecules with desired characteristics – the application of deep learning AI in drug discovery continues to expand and amaze.”
Schrödinger has recently implemented multi-instance GPUs which allows users to split the A100 GPU into several discrete GPUs that can each run simulations in parallel to maximise overall throughput, explained Kelleher.
At the University of Massachusetts in Boston, Symmetric Computing has developed a Virtual Drug Discovery Platform to accelerate the process of finding potential drugs for the treatment of diseases such as Alzheimer’s, Parkinson’s and diabetes.
Using Scripps Research’s AutoDock molecular modelling simulation software, the university’s Venture Development Center simulates protein-ligand docking (i.e., predicting the binding behaviour) and screens the results using NAMD’s molecular dynamics software to identify the compound with the most potential.
The project also uses the TensorFlow open-source AI and machine learning software library.
Drug databases
“One of the challenges was the storage of data and lack of speed,” said Kimberly Stieglitz, Professor Biotechnology and Chemistry and Roxbury Community College.
The project used a database of over 500 million molecules and 23,000 human proteins both with 3D structures and data of all known drugs. Each compound had to be tested across all human proteins not just those associated with Alzheimer’s as a check for side effects. The process could take years of laboratory-based testing.
Symmetric’s supercomputer consists of a dual-processor head node using two AMD EPYC 7601 32-core 2.7GHz CPUs with up to 4TB of memory. Three InfiniBand adapters connect to three GPU compute nodes. Each node contains four Radeon Instinct MI25 accelerators managed by another AMD EPYC processor with 64GB of memory. The GPU computer nodes deliver up to 150TFLOPS of single-precision performance. Each EPYC processor-based head node supports up to 4TB of memory; with four head nodes, this provides up to 16TB of addressable memory. “This can dramatically improve performance,” said Anderson.
Above: The AMD EPYC server
“Thanks to this technology, we're able to identify the allosteric or control sites on a protein where small molecules bind,” said Stieglitz. “This used to be a very laborious process in the lab, but I’m now able to do it with NAMD, with speed and accuracy using the computational studies.”
“The work we are doing on developing our whole AI platform is completely dependent on the performance of the GPU and how fast it can talk to the CPUs because that’s where all the machine learning training happens,” explained Richard Anderson, President and CTO of Symmetric Computing. “We hope to develop the AI techniques to limit the compound search space from 500 million to 50 million.”
“You’re not going to physically test 100 million chemicals . . . but with our platform, we can do the main work via the simulation.”
Anticipating the introduction of the 7nm EPYC process, Anderson expects the AutoDock Vina software to double in performance.
“Right now, we're screening up to 400,000 compounds a day, but we have 500 million to get through so doing twice as many will make even more of a difference,” he enthused.
Deep learning
Another modelling system is AlphaFold, also based on a neural network and using deep learning to infer 3D models of protein structures and sequences. It was developed by DeepMind Technologies, the London-based subsidiary of Google’s parent company Alphabet and was trained using 170,000 protein structures to determine protein structures in hours. It was used to predict protein structures of the Covid-19 virus within hours in the efforts to combat the virus.
“AlphaFold inference can be run on a variety of generations of NVIDIA GPUs of Pascal, Volta, Turing and Ampere architecture,” said Kelleher. “Since the AlphaFold release, we’ve seen a community driven effort of similar but uniquely different protein structure prediction models available in open source,” he added. “OpenFold, for example, is a faithful reproduction of AlphaFold that also provides a training pipeline so research can finetune the model to work with their own local or proprietary protein sequence databases.”
Another project at Oak Ridge National Laboratory (ORNL) also using Scripps Research’s AutoDock, reduced years of work into hours with accelerated computing to screen drug candidates against a protein target. The ORNL computer’s 27,000+ NVIDIA GPUs helped the team to screen over 25,000 molecules per second and dock one billion compounds in less than 12 hours.
“Two application areas I’m particularly excited about are large language models [using question and answer and text summarisation techniques] and physics informed neural networks,” says Kelleher.
For the first, once the fundamental protein, chemical and biological relationships have been “learned”, the trained models can be used to predict the properties or functions of molecules.
“Deep learning also has the ability to learn the laws of physics, and predict or project forward accurate simulations, approaching quantum level, at a fraction of the computational cost.
“These are just two examples of why this is going to be such an exciting space to watch in the coming years,” said Kelleher.