How Fractional AI & Airbyte 10x’d the Speed of Building Connectors

Who is Airbyte?

Airbyte is the leading open-source data integration engine that helps you consolidate your data in your warehouses, lakes, and databases.

Imagine you're an e-commerce company looking to combine Shopify sales data with Zendesk customer support data to better understand customer behavior. Airbyte allows you to easily set up a data pipeline to extract customer order data from Shopify, pull customer support tickets from Zendesk, and load all this data into your Snowflake data warehouse.

Extracting this data requires building API integrations (or “connectors”) with your data sources (e.g., Shopify, Zendesk). 

Problem: Building connectors is tedious and complicated

Airbyte already offers an impressive library of pre-built connectors, but there are thousands of connectors left to be built to support data connectivity across all data sources. Many of these are API integrations to SaaS products.

Ask anyone who has spent their day drudging through API documentation to build connectors if they'd like for someone else to handle it, and you'll get a resounding “Yes!”. You have to navigate lengthy API docs – all structured differently (see examples 1, 2, and 3), dig around to find the relevant details (How do I authenticate? How does pagination work for this API?), and then manually configure these and a dozen other fields. Beyond being time-consuming and complex, this process diverts technical talent from higher value work. 

Solution: Get AI to build API integrations for you

Airbyte engaged Fractional AI to help develop an AI-Powered Connector Builder, cutting down the time it takes to build a connector from hours to just a few minutes. Lowering the barrier to building connectors enables Airbyte to power even more data connectivity across more sources -- in fact, Airbyte is already seeing a marked increase in the number of connectors in the wake of the AI Assist release.

This chart details the initial impact of AI Assist on the volume of connectors. You can check out more in Airbyte's Databytes recording.
Here’s how it works:
  1. The user inputs the URL for the API they’re trying to integrate with
  2. The AI Connector Builder crawls those API docs 
  3. The AI Connector Builder then pre-populates all the fields about the connector – API URL base, authentication, etc.
  4. The AI Builder presents the full list of streams for that API (e.g., for Shopify, the streams might include “orders,” “fulfillments”, “discounts”) 
  5. The user selects the streams of interest
  6. For each stream the user selects, the Builder pre-propulates each field (pagination, URL path, etc.) for those streams into Airbyte’s connector builder UI
  7. The user is then able to review the AI’s draft and make any edits before finalizing the connector.  

Check it out:

Building the end-to-end AI-Powered Connector Builder brought up a number of questions – from typical engineering considerations (e.g., How do we think about caching? Testing? Scalability?) to the broad range of AI questions necessary for production-ready AI features (e.g., How do we minimize hallucinations? How do we evaluate accuracy across connectors? How do we minimize model costs?). Read on for more detail on technical tradeoffs. 

Looking Under the Hood

Project Tools

  1. Models –  We use GPT-4o. We explored 4o-mini and a fine-tuned version of 4o-mini as part of our core workflow, but ended up only using 4o-mini for integration tests. We also considered Claude but opted for GPT-4o because of its strict structured output capabilities. 
  1. Tooling – We use OpenAI’s SDK to stitch together our prompts, Langsmith for observability and experimentation, and Jina and Firecrawl for crawling. We also used OpenAI's built-in vector store in places we needed to use RAG and Redis for caching and locking.
  1. Data – We use the catalog of existing Airbyte connectors as benchmarks to measure the accuracy of the AI-powered Connector Builder and improve quality. While we don’t go into detail here, preparing the test data took significant effort (e.g., Were the benchmark connectors from the catalog fully up-to-date? Did they include all the relevant streams?) and should be a significant focus for any applied AI project. 

How do you go from API docs to a working connector?

While the UI aims to make AI-assisted connector building as intuitive as possible, the AI-Powered Connector Builder is a highly complex product under the hood. The key question driving this build was: 

“How can we take a vast array of inputs (e.g., documentation for any API) and reliably generate an equally broad range of outputs (e.g., the configuration for any API integration)?”

The simplified workflow has 5 parts:
  1. Scrape the user-provided documentation page
  2. LLM-powered crawling engine finds additional pages to scrape
  3. Convert HTML to markdown and remove as much noise as possible
  4. Extract the appropriate sections from the scraped pages and include them in carefully crafted, purpose-built prompts
  5. Translate LLM output into appropriate sections of connector definitions

And, of course, there are edge cases. We use a different approach when an OpenAPI spec is provided, when the scraped docs don’t look right for a variety of reasons, or when we don’t find an answer in the section of the docs we’re looking at. 

Two things in particular complicated this workflow: 

Building unique variations of this flow across components of the API 

  • Overall, we have 7 unique flows:
    • 1 for authentication
    • 2 for pagination
    • 1 for finding streams (candidate stream names that show up in the autofill dropdown in the final product)
    • 1 for response structure - to identify parameters that we need to parse (record selector and primary key)
    • 1 for stream metadata (e.g. endpoint path and HTTP verb) 
    • 1 for determining the base url of all streams

Ensuring compatibility with a large range of API docs

  • Developer docs are complicated and inconsistent – in structure, content, and quality. The relevant information is often a couple of sentences buried in a sea of pages and noise. 
  • Take these three examples - each requires a different approach to extract the information some text
    • Confluence - Links don’t appear until you click on something
    • Drift - Crawlers don’t reliably wait long enough for source to appear
    • Gong - Uses custom components and page content is dynamically loaded into view when you click on links

Let’s take a look at a simplified version of the workflow just for authentication

Here’s how we go from the URL to an API’s documentation to a populated connector spec for authentication: 

Step-by-step:

Step 1: Reading API Docs 

  • Challenges: a range of diverse docs, rate limits, latency, difficult to flag garbage output
  • Solution: a waterfall approach:
    • We start with “is there an OpenAPI spec”? If yes, we pull authentication parameters directly from the OpenAPI spec.
    • If there is no API spec, we have a lineup of two crawlers: Firecrawl and Jina, with built in redundancy to address some of the challenges above.
    • Finally, if we are unable to extract the information using the OpenAPI spec, Firecrawl, or Jina, we use a combination of services like Serper and Perplexity as a last-ditch effort to find relevant information to input to later LLMs. 

Step 2: Extracting Relevant Sections

  • Challenges: very large docs that exceeded the context window and identifying the authentication-relevant portion of the docs
  • Solution:
    • Big docs: We use OpenAI's built-in RAG functionality to extract sections in the documentation relating to authentication.
    • Small docs: We built a flow to i) extract links from the HTML, ii) ask an LLM which links looked related to authentication, iii) scrape those pages, and iv) and then embed the content of those scraped pages into future prompts.

Step 3: Parsing and prompting the exact details from the HTML chunks

  • Challenges: coercing the LLM output into the exact format needed for the connector builder specification
  • Solution: We prompt with structured output to determine the authentication method in the specific format to populate the connector builder

Extracting response structure for an endpoint from a messy documentation excerpt.

This is a simplified illustration of the workflow for just authentication –  as you’ll see as you use the product, the AI connector builder autogenerates not only authentication but also: 

  • Base URL
  • Pagination
  • Primary Key
  • Record Selection
  • Available streams
  • Stream configuration

And then, as with any other product build, we had to think about deployment, permissioning, testing, scalability, user experience… the list goes on!

And we love delighting the engineering team!

Key Takeaways

Supercharging developer productivity: There are many high-ROI places where the right AI applications can dramatically increase developer productivity.

Both an engineering and an AI problem: this project is a good reminder that the challenges getting AI into production aren’t pure issues from wrangling LLMs. In this case, quality crawling – a challenge as old as Google – posed a major challenge.

High-risk, high-reward projects: When we first connected with Airbyte less than 6 months ago, we didn’t even know an AI-powered connector builder at the level of accuracy to be userful was possible. Starting with a POC helped us realize that a few months of investment could be a gamechanger for the future of connector building.