How Fractional AI Partnered with Zapier to Reduce Hallucinations by Over 80%
Who is Zapier?
Zapier, the leader in easy automation, helps over 2 million businesses automate workflows across 7,000+ apps.
Building and maintaining integrations across an ever-growing number of 7,000+ apps is no small task, so Zapier developed their “spec gen system.” This system uses large language models (LLMs) to make, maintain, and expand these integrations – automatically generating OpenAPI specs from an API’s documentation.
The Challenge
Zapier’s spec gen system was impressive, giving the team a running start to building integrations, but the team wanted it to be even better.
What was holding them back was a clear read of the system’s performance. No LLM system is 100% accurate, and the team wanted to:
Understand the system's current accuracy (50%? 80%?)
Measure the impact of changes to the LLM pipeline (If an engineer tweaked a prompt in the LLM pipeline, did the change improve or worsen the result? And by how much?)
Effectively direct experiments to improve performance (where should the team be focusing its energy?).
Solution
Fractional AI partnered with Zapier to take on this challenge with a 2-part solution -
Define robust LLM evaluations:
We developed a data-driven framework to measure the accuracy of each subcomponent in the spec gen system (e.g., how often the spec gen system hallucinated paths or how often an endpoint described on the webpage showed up in the OpenAPI spec).
Direct experiments based on these evaluations:
With a clear bar for measuring success, we were able to run iterative experiments to test which changes to model selection, prompting, or pipeline order made the spec gen system even better.
The result? Dramatic improvement across key metrics
Take the incidence of hallucinated paths – before, 26% of paths produced by the spec gen system were hallucinated. After Fractional AI’s improvements, less than 1% (!) of paths were hallucinated.
Similarly, we were able to improve the accuracy of automatically detecting field types by nearly 2x (“FieldTyperScorer”). Check out the full breakdown below.
A more reliable spec gen system means more integrations with Zapier and more time saved for engineers building integrations.
Read on for more detail on how we built these specific evals and which experiments yielded the most impact.
Looking under the hood
Project setup
Data – To create the test data, we combined a curated selection of existing OpenAPI specs and Postman collections from representative API docs and then manually reviewed and corrected them.
Models – We used GPT-4o and Claude Sonnet 3.5 for different prompts in the pipeline.
Tooling – We used Braintrust to build our evals and run experiments.
Baseline: Understanding Zapier’s spec gen system
Zapier had built an impressive spec gen system that took as input the url for an API’s documentation and outputted an OpenAPI spec for each endpoint for that API.
Here’s a simplified view of the pipeline:
Here are the key steps of this workflow:
Web Scraping - A web scraper takes developer-facing API doc webpages, scrapes them, creates markdown files based on each webpage, and stores those separately in the database.
List Endpoints - An LLM scans the endpoints and produces a list of short descriptions of endpoints found on the page. These short descriptions are then saved in a database.
Extract Endpoint Documentation - An LLM takes the short description and the full markdown of that webpage and attempts to return just the relevant markdown.
Generate Endpoint Specs - An LLM takes the relevant markdown and all the other information and attempts to generate an OpenAPI specification.
Evaluations: What to measure?
There are many possible ways to measure the quality of an LLM’s output, so the first things we had to decide were:
What were the most important attributes to measure?
How do we measure them?
Ultimately, we landed on the following metrics:
Correctness of Endpoint Paths: For all API endpoints present in source web pages, what percentage of them were generated by the spec gen system?
Hallucinated Endpoint Paths: For all API endpoints generated by the spec gen system, what percentage of them were actually in the source web page?
Required Field Names: For required fields in request bodies, what percentage of them were generated by the spec gen system?
Required Field Types: For required fields in request bodies, what percentage of them were generated by the spec gen system with the correct types?
Field Names: For all fields in request bodies, what percentage of them were generated by the spec gen system?
Field Types: For all fields in request bodies, what percentage of them were generated by the spec gen system with the correct types?
Hallucinated Field Names: What percentage of fields in the generated specs existed in the original API definitions?
In addition to the top level eval score, we also reported details of why each metric didn’t score 100% – for example for field names, we reported the difference between expected field names and generated field names to give us color into where to improve.
Evaluations: How to get the data to measure against?
In order to determine how accurate an LLM’s output is, you need an accurate source of truth to compare it to – test data.
Part of the challenge here was assembling the test data. If we had unlimited perfect OpenAPI specs, then Zapier wouldn’t need to automate the creation of them with their spec gen system.
We built the test data in four steps -
Identifying a representative sample of API docs based on the 5 most common API doc (Readme, Swagger, etc.) frameworks
Looking for already vetted OpenAPI specs / Postman collections for those API docs
Manually reviewing the JSON to fix any mistakes that might be in these files
Supplementing with output from Zapier’s spec gen system that had been manually reviewed and labeled for accuracy
Experimentation: So how did hallucinations go from 26% to less than 1%?
Now that we had metrics defined and test data assembled, we could use the evals to experiment and drive iterative improvements.
Let’s look just at some of the experiments we ran to improve the score on our “Hallucinated Endpoint Path”metric – reminder, this metric looked at which API endpoints generated by the spec gen system were actually in the source web page.
For example, Shopify’s APIs don’t have a “users” endpoint, so if the spec gen system outputs “.../admin/api/{version}/users.json” as a component of the spec, this would be a hallucination. Shopify does have a “customers” endpoint, so the spec gen system outputting “…/admin/api/{version}/customers.json” would not be a hallucination.
When we got started, 26% of endpoint paths were hallucinated – through a bunch of experiments, we brought that down to below 1%.
What worked:
Prompt engineering:
The balance was in adding targeted prompts to address error patterns in a specific subset of APIs without overgeneralizing and affecting APIs outside that subset.
For example, the system sometimes generated endpoints like index.json, but the correct format was index.{format}, where {format} could be either xml or json. By adding language instructing the model to use the more general case when there was a discrepancy between summary documentation and examples, we solved the issue without altering other system behaviors.
Sharing more context to subsequent prompts:
At first, we passed only shorthand names for endpoints (e.g., "Update a project brief") to later prompts in the pipeline. This led to issues where the system would select related but incorrect endpoints, like “Update a project.”
Once we changed the system to pass along more context to the latter LLM calls (Name, HTTP Verb, URL), it correctly identified the part of the webpage from which to read the endpoint information. Using structured output from prior prompts facilitated our ability to pass along this additional context.
Testing different models for different prompts in the pipeline:
The final pipeline uses GPT-4o and Claude Sonnet 3.5 for different steps.
Switching to GPT-4o as our default:
Before having robust evals, we couldn’t know if GPT-4o would degrade performance in a way that would damage the system.
Once we were able to actually measure performance, we saw that the cheaper GPT-4o actually improved performance, giving us the confidence to switch.
Using Claude Sonnet 3.5 for key prompts:
We found that Sonnet worked better for extracting relevant text from documents, reducing errors.
For example, for API docs with multiple endpoints on a single page, the system previously got confused about the correct web address, but switching to Sonnet solved this issue.
What didn’t work as well
Introducing a prompt for the system to correct itself:
We tried a final prompt where we fed the LLM the original markdown and the generated API spec, asking for any corrections.
We found that this step did identify a few mistakes but created more false positives than was worth it.
All of these improvements were only possible by having visibility into the performance of the system. Otherwise, we’d be operating in the dark with no idea if an intervention was helping (e.g., Sonnet 3.5 switch) or actually hurting (e.g., self-correction prompt).
Other tidbits: reducing costs
Another balance we’re always trying to strike when building with LLMs is minimizing costs without sacrificing performance. Two interventions helped here:
Adding prompt caching saved ~25% (~$6) per run
Switching to GPT-4o
Key Takeaways
The Power of evals - you can’t improve an AI system unless you know what you’re measuring. That starts with a robust evaluation framework.
Just keep experimenting - The path to improved accuracy is rapid, iterative experimentation.
Don’t underestimate the burden of assembling good test data - a big part of the time spent on this project was assembling robust rest data. You need to account for this in your applied AI projects.
By clicking "Accept", you agree to the storing of cookies on your device, including to enhance site navigation, analyze usage, and facilitate tailored marketing and communications. More info