The Challenges of Building Generative AI Applications in Cybersecurity

In July, I started an exciting journey with Cisco via the acquisition of Armorblox. Armorblox was acquired by Cisco to further their AI-first Security Cloud by bringing generative AI experiences to Cisco’s security solutions. The transition was filled with excitement and a touch of nostalgia, as building Armorblox had been my focus for the past three years.

Quickly, however, a new mission came my way: Build generative AI Assistants that will allow cybersecurity administrators to find the answers they seek quickly, and therefore make their lives easier. This was an exciting mission, given the “magic” that Large Language Models (LLMs) are capable of, and the rapid adoption of generative AI.

We started with the Cisco Firewall, building an AI Assistant that Firewall administrators can chat with in natural language. The AI Assistant can help with troubleshooting, such as locating policies, giving summarization of existing configurations, providing documentation, and more.

Throughout this product development journey, I’ve encountered several challenges, and here, I aim to shed light on them.

1.The Evaluation Conundrum

The first and most obvious challenge has been evaluation of the model.

How do we know if these models are performing well?

There are several ways a model’s responses can be evaluated.

Automated Validation – using metrics computed automatically on AI responses without the need for any human review
Manual Validation – validating AI responses manually with human review
User Feedback Validation – signal directly from users or user proxies on model responses

Automated Validation

An innovative method that was proposed early on by the community was using LLMs to evaluate LLMs. This works wonders for generalized use cases, but can fall short when assessing models tailored for niche tasks. In order for niche use cases to perform well, they require access to unique or proprietary data that is inaccessible to standard models like GPT-4.

Alternatively, using a precise Q&A set can pave the way for the formulation of automated metrics, with or without an LLM. However, curating and bootstrapping such sets, especially ones demanding deep domain knowledge, can be a challenging task. And even with a perfect question and answer set, questions arise: Are these representative of user queries? How aligned are the golden answers with user expectations?

While automated metrics serve as a foundation, their reliability for specific use cases, especially in the initial phases, is arguable. However, as we expand the size of real user data that can be used for validation, the importance of automated metrics will grow. With real user questions, we can more appropriately benchmark against real use cases and automated metrics become a stronger signal for good models.

Manual Validation

Metrics based on manual validation have been particularly valuable early on. The first set of use cases for our AI assistant are aimed at allowing a user to become more efficient by either compiling and presenting data coherently or making information more accessible. For example, a Firewall Administrator quickly wants to know which rules are configured to block for a particular firewall policy, so they can consider making changes. Once the AI assistant gives summarizes their rule configuration, they want to know how to alter it. The AI assistant will give them guided steps to configured the policy as desired.

The information and data that it presents can be manually validated by our team. This has already given me insight into some hallucinations and poor assumptions that the AI assistant is making.

Although manual metrics come with their own set of expenses, they can be more cost-effective than the creation of golden Q&A sets, which necessitate the involvement and expertise of domain specialists. It’s essential to strike a balance to ensure that the evaluation process remains both accurate and budget-friendly.

User Feedback Validation

Engaging domain experts as a proxy for real customers at pre-launch to test the AI assistant has proven invaluable. Their insights help develop tight feedback loops to improve the quality of responses.

Designing a seamless feedback mechanism for these busy experts is critical, so that they can provide as much information on why responses are missing the mark. Instituting a regular team ritual to review and act on this feedback ensures continued alignment with expectations for the model responses.

2. Prioritizing Initiatives based on Evaluation Gaps

Upon reviewing evaluation gaps, the immediate challenge lies in effectively addressing and monitoring them towards resolution. User feedback and eval metrics often highlight many areas or errors. This naturally leads to the question: How do we prioritize and address these concerns?

Prioritizing the feedback we get is extremely important, focusing on the impact of the user experience and the loss of trust in the AI assistant are the core criteria for prioritization along with the frequency of the issue.

The pathways for addressing evaluation gaps are varied – be it through prompt engineering, different models, or trying various augmented model strategies like knowledge graphs. Given the plethora of options, it becomes imperative to lean on the expertise and insights of the ML experts on your team. Given the rapidly evolving landscape of generative AI, it’s also helpful to stay up to date with new research and best practices shared by the community. There are a number of newsletters, podcasts that I use to stay up-to date with new developments. However, I’ve found that the most useful tool has been Twitter where the Generative AI community is partciuarly strong.

3. Striking a Balance: Latency, Cost, and Quality

In the early phases of LLM application development, the emphasis is primarily on ensuring high quality. Yet, as the solution evolves into a tangible, demoable product, latency, the amount of time it takes for a response to be returned to a user, becomes increasingly important. And when it’s time to introduce the product as generally available, striking a balance between delivering exceptional performance and managing costs is key.

In practice, balancing these is tricky. Take, for instance, when building chat experiences for IT administrators. If the responses fall short of expectations, do we modify the system prompt to be more elaborate? Alternatively, do we shift our focus to fine-tuning, experimenting with different LLMs, embedding models, or expanding our data sources? Each adjustment cascades, impacting quality, latency, and cost, requiring a careful and data informed approach.

Depending on the use case, you may find that users will be accepting of additional latency in exchange for higher quality. Knowing the relative value that your users have for each of these will help your team strike the right balance. For the sustained success of the project, it’s crucial for your team to monitor and optimize these three areas according to the tradeoffs that your user deems acceptable.

The Future of LLM Applications

It’s been an exciting start to the journey of building products with LLMs and I can’t wait to learn more as we continue building and shipping awesome AI products.

It’s worth noting that my main experience has been with chat experiences using vector database retrieval augmented generation (RAG) and SQL agents. But with advancements on the horizon, I’m optimistic about the emergence of autonomous agents with access to multiple tools that can take actions for users.

Recently, Open AI released their Assistants API, which will enable developers to more easily access the potential of LLMs to operate as agents with multiple tools and larger contexts. For a deeper dive into AI agents, check out talk by Harrison Chase, the founder of Langchain, and this intriguing episode of the Latent Space podcast that explores the evolution and complexities of agents.

Thanks for reading! If you have any comments or questions feel free to reach out.

You can follow my thoughts on X or connect with me on Linkedin.

We’d love to hear what you think. Ask a Question, Comment Below, and Stay Connected with Cisco Security on social!

Cisco Security Social Channels

Instagram
Facebook
Twitter
LinkedIn

Anonymous says:

January 9, 2024 at 1:07 pm

Would we be looking at using it for zero-day attacks? Also, how about providing such an assistant for code reviews, starting in security, eliminating bugs to begin with? Finally, would like to see if we are going to expand into generative AI for threat monitoring/remediation.

Comments are closed.

Security

The Challenges of Building Generative AI Applications in Cybersecurity

1.The Evaluation Conundrum

2. Prioritizing Initiatives based on Evaluation Gaps

3. Striking a Balance: Latency, Cost, and Quality

The Future of LLM Applications

Authors

Ravi Bhandia

AI/ML Product Manager

Security Business Group

One Comment

Cisco Cybersecurity Viewpoints

Why Cisco Security?

Security

The Challenges of Building Generative AI Applications in Cybersecurity

1.The Evaluation Conundrum

2. Prioritizing Initiatives based on Evaluation Gaps

3. Striking a Balance: Latency, Cost, and Quality

The Future of LLM Applications

Authors

Ravi Bhandia

AI/ML Product Manager

Security Business Group

One Comment

Cisco Cybersecurity Viewpoints

Why Cisco Security?

CONNECT WITH US