Enabling faster drug discovery using LLMs and RAG

Dec 18

Kevin Ha, Head of Data Science, Genomics – December 17, 2024

Highlights

Retrieval Augmented Generation (RAG) is a versatile and powerful technique for enabling large language models (LLMs) to interact with external data sources.
RAG can enhance drug discovery pipelines by streamlining access to complex, proprietary information, making insights more accessible and actionable.
To build effective RAG applications, it is crucial to design and implement the correct strategies for data preparation, document retrieval, prompt engineering, and deployment.

At BioSymetrics, we are excited by how generative AI and LLMs are transforming drug discovery. Recently, we have been experimenting and integrating LLMs into our Elion drug discovery platform. Here, we share some of our insights and challenges into using LLMs, in particular, Retrieval Augmented Generation (RAG), to enhance user productivity for our products. Let us introduce RAG and its benefits, and then highlight some considerations for developing effective RAG applications.

“LLMs do not know everything, including recent events and proprietary data.”

When trained on diverse data sources from the internet, LLMs can contain a vast array of knowledge. However, depending on when they were trained, LLMs do not know everything, including recent events and proprietary data. This limitation can be addressed in two ways: fine-tuning (continue training the LLM with new data), or RAG. In this article we will focus on discussing the latter approach.

RAG is an increasingly popular technique for enabling LLMs to interact with external data. To understand how RAG works, let us consider one common LLM use case: a user uploads a document that the LLM may not have knowledge of and writes a prompt that instructs the LLM to summarize the document. In this scenario, the document provides the additional context – relevant information or data – needed by the LLM to generate a summary.

RAG operates on a similar principle but extends it to more complex scenarios. Let us now consider that the user’s single document is instead stored in a database containing many documents. Given only the user’s query, RAG coordinates the retrieval of relevant contexts from the database by identifying contexts that share similar meanings or concepts, also known as semantic similarity, to the query. The retrieved contexts are then augmented to the query, forming a new contextualized prompt that considers both the user's intent and the relevant external information. This prompt is then fed to the LLM which now has the information it needs to generate a tailored response.

“[An] exciting enhancement is integrating RAG into the Phenograph, our phenomics-driven knowledge graph that connects human clinical data with model system phenotypes.”

For organizations, RAG is a versatile approach that leverages LLMs to interact with their proprietary, domain-specific data using natural language queries. Some advantages include:

Data control: Organizations retain full control of their data. It does not get “absorbed” by the LLM, unlike with fine-tuning. However, fine-tuning can be a valid approach for certain use cases, such as building new foundation models.
Flexibility in LLM selection: Organizations can choose the most suitable LLM. The number of free and commercial models available continues to grow, and these models can be deployed locally or in the cloud. By considering the needs of the organization, such as data privacy, performance, and cost, the appropriate model(s) can be chosen and evaluated.
Support for diverse data types: In addition to text documents, RAG can handle a wide range of data types such as code, structured and unstructured data, and more.

At BioSymetrics, we have been experimenting with RAG to empower our products and services. For example, we have developed a RAG application to interact with and query information from our vast internal knowledge base that includes technical documentation, lab reports, meeting summaries, and more. Another exciting enhancement is integrating RAG into the Phenograph, our phenomics-driven knowledge graph that connects human clinical data with model system phenotypes. This application provides a RAG-driven LLM interface to seamlessly tap into the Phenograph’s rich data connections.

Current challenges and considerations in developing effective RAG applications

RAG has clear benefits, but there are still some current limitations and challenges to consider. In this section we will share a few insights on some of these challenges. It is important to note that this discussion is not exhaustive and only scratches the surface of RAG development. Moreover, given the rapid pace of generative AI advancements, it is likely that some of the topics highlighted below may one day be addressed by improved methodologies.

Data preparation

“Careful attention is needed to ensure that the source content is comprehensive, accurate, and contextually relevant.”

Data preparation plays a critical role in developing RAG applications. For RAG to be reliable, it must be able to retrieve the relevant contexts from the source documents. Careful attention is needed to ensure that the source content is comprehensive, accurate, and contextually relevant. Otherwise, sources that lack sufficient context will likely lead to poor RAG performance. To address this, it may be necessary to enrich it with additional metadata and information from other sources.

Context retrieval

“Optimizing chunking and retrieval is [...] a vital component of building an effective RAG application.”

Document processing for retrieval is also an important component of RAG. To process documents, a common approach is to split them into smaller fragments called chunks. These chunks are then encoded into word embeddings, which are numerical representations of the chunks, using an embedding model and indexed in a vector database. One of the main rationales for chunking documents into smaller pieces is to improve retrieval accuracy. By adjusting the chunk size, each chunk ideally encapsulates a single meaning or concept. In practice, however, optimizing chunk size is not easy to achieve as there are trade-offs between relevance and comprehensiveness, as we discuss next.

The choice of chunk size depends on the type of documents being processed. Smaller-sized chunks, which can more likely capture a single concept, are easier to index and retrieve accurately. However, this approach may sacrifice the broader context present in the source document, such as a long-form scientific report, where related concepts may span over large distances of text. Without broader context retention the LLM may only have a limited contextual understanding, potentially leading to sub-optimal performance.

You may wonder, then, why not simply store larger chunks or even entire documents instead? While this approach can help preserve broader context, larger chunks tend to be noisier in information content, resulting in embeddings that are less specific and meaningful. As a result, recalling the most relevant contexts that best match a user’s query may become more challenging, leading to irrelevant or misleading LLM responses. Moreover, although newer LLMs support increasingly large context windows, it has been shown that retrieval accuracy eventually declines as the context length grows.

Optimizing chunking and retrieval is therefore a vital component of building an effective RAG application. Different document formats may require tailored chunking strategies. For example, one can employ a hybrid strategy where smaller chunks are initially retrieved and subsequently replaced by a larger context that encompasses those chunks. This strategy offers the best of both worlds, retrieving small, specific contexts while retaining the benefits of a broader context. Iterative testing and refinement are necessary to find the right balance and strategy.

There is also ongoing research and development proposing new solutions to improve context retention. For example, GraphRAG is a hierarchical approach that processes the source documents into a knowledge graph, making it more effective at retrieving related contexts from distant data sources. As RAG methodologies continue to improve, we expect these advancements to enhance accuracy and versatility of RAG applications, especially in drug discovery.

Prompt engineering

“[It is essential] to iteratively test and refine prompt templates to optimize performance and reliability.”

Prompt engineering refers to the design of a prompt using natural language to produce a desired output by an LLM. In RAG, this involves designing prompt templates that contain concise yet clear instructions to guide the LLM in generating a tailored response. Effective prompts often include placeholders for both the retrieved contexts and the user query. They also provide clear instructions for the LLM to prioritize or synthesize key information that aligns with the user's intent. Furthermore, prompt templates can yield varying results across different LLMs. This makes it essential, like with retrieval above, to iteratively test and refine prompt templates to optimize performance and reliability.

Deploying RAG applications

“To mitigate these risks, it is crucial to employ the latest security best practices when deploying RAG applications.”

When deploying RAG applications, two considerations worth mentioning include selecting the appropriate hardware infrastructure and security.

While LLMs can operate in a variety of environments, they perform best when paired with GPUs. Whether deployed on local GPU hardware or on the cloud, it is important to evaluate the RAG application's requirements, including computational needs and operational costs, to ensure optimal performance. Alternatively, cloud-based LLM services can be used, where LLM queries are handled by a third party, offering scalability and reducing the burden of maintaining hardware infrastructure. However, this approach requires careful consideration of data security and privacy, especially for proprietary information.

Despite the benefits of RAG in connecting LLMs with external, proprietary data, it introduces new risks to exploit the system via adversarial attacks. For example, with carefully crafted prompts, a malicious user might trick the LLM into executing data destructive queries on the underlying database, such as deleting or modifying data, or gaining unauthorized access to sensitive information. To mitigate these risks, it is crucial to employ the latest security best practices when deploying RAG applications. This includes, but is not limited to, implementing robust access controls, limiting database permissions, and validating user queries before execution.

Summary

The field of generative AI is transforming drug discovery in many meaningful ways. Through RAG, LLMs can be used to interact with external data sources, enabling highly customized applications. This approach offers significant benefits, such as enhancing user productivity by streamlining access to complex information, making insights more accessible and actionable.

While RAG is a relatively recent advancement, it is important to consider some of the current limitations, as highlighted above. However, effective RAG applications can be implemented by carefully considering the use case requirements and implementing the appropriate strategies.

At BioSymetrics, we continue to find ways to innovate in drug discovery through our Elion platform. By harnessing the power of LLMs, we are further accelerating the drug discovery process to generate novel and impactful insights.

Simon Eng