How Fractional AI unlocked the value of Sincera's product data

Who is Sincera?

Digital advertising runs on data. Sincera is building the metadata of the internet, decoding the complexity of digital advertising into clean, precise, and actionable data.

Problem: Massive amounts of messy, unstructured data

Clean, structured data is a game-changer. Messy, inconsistent, and unstructured data is a headache – for humans and computers.

Each month, Sincera observes millions of unstructured records that describe products and services. These records come from a variety of sources on the open internet, and as a result, each record might represent something slightly different: one record could describe a customer segment, like “Lipton purchaser,” while another could be a product description, like “pork tenderloin.” 

Here’s a look at the raw data for five of these records:

This data is nearly impossible to use without extensive and time-consuming standardization. However, if organized into a consistent taxonomy, the data could become a valuable asset – almost like recycling data scraps into something new and useful. 

That’s exactly what the Sincera team aimed to do: map the millions of monthly records to Shopify’s product taxonomy (a standard in adtech) to unlock more data utility.  

Here’s a snapshot of Shopify’s product taxonomy.

Achieving this goal is easier said than done. The full taxonomy has 10,000 categories and up to 7 levels of nested hierarchy – a far cry from the starting place of millions of inconsistent records. Without genAI, this feat would be cost-prohibitive, requiring thousands of human hours each month.

Solution: Use AI to unlock the value of this data

Sincera hired Fractional AI to make this massive, unstructured data stream usable. 

The result: Each record is not only outputted to the corresponding Shopify category but it’s also outputted with a level of confidence in the categorization – all in real time and with accuracy consistently above 85%.  More broadly, this monthly stream of messy data is now a valuable data asset. 

To make this data usable, we built an AI categorization system using a multi-step LLM pipeline, where each record (or row in a CSV) is evaluated by several agents.

Here’s how each step works:

  1. CLASSIFY: determines whether a record corresponds to a brand, a segment, or is likely uncategorizable due to a lack of sufficient information. 
  2. ENRICH_BRAND and ENRICH_SEGMENT:  Expand the audience name into a richer description.some text
    • For example, the segment “Pepsi purchasers” will be enriched with a description of the types of products or services “Pepsi” sells, since it was recognized as a brand. 
    • This output is then used to query the vector space of the available categories, and the top neighbors are selected as candidates. 
  3. DISCERN: decides which of the neighbors selected as candidates for available categories is best.
  4. JUDGE: then evaluates `DISCERN`'s decision based on the original data and the arguments given by each previous agent, providing an overall confidence measurement. 
  5. FINAL OUTPUT: is the final result along with the confidence score for that output.

Check out the “Looking under the hood” section below for more detail on this pipeline and the methodologies used.

Taking Sincera’s AI Engineering skills to the next level

While the primary goal was to build an AI system to normalize messy data, perhaps the most valuable outcome was the Sincera team's increased confidence in their own AI capabilities. 

By working closely together—through twice-weekly standups, AMAs, and deep dives into the reasoning behind each AI decision, the Sincera team became better equipped for future projects.

One area of particular focus was LLM evaluations. We worked with the Sincera team to show them how to build robust evals, the right tooling to use (Braintrust), and how to iterate on these evals on their own (without us) in the future (more on evals here). 

Looking Under the Hood

Project Tools

  1. Models – We experimented with Claude-3.5-Sonnet and Llama-3.1 and ended up using GPT-4o-mini.
  1. Tools – We used Braintrust for building LLM evaluations and running experiments, and we made extensive use of OpenAI’s Structured Output. 
  1. Data – We take as input the monthly stream of 1-2M records. A challenge here is the lack of robust, hand-labeled ground-truth data to serve as training, test, or validation data.   

Why canonical approaches didn’t work & what we did about it 

You might be wondering “isn’t data classification a solved problem?” or “aren’t there known methods that can help with this?”.  

The short answer: while this looks like a canonical problem, there are a few key reasons why conventional approaches didn’t work for this particular application. We’ll focus on two of them: 

  1. Very little ground truth data
  2. The records themselves were very short, lacking information, and didn’t reliably look like the output

Challenge: Very little ground truth data

Conventionally, this is a basic classification project and would be a good candidate for a fine-tuned model or traditional ML model.

Unfortunately, we had very little ground truth data (only 160 labeled examples). To fully leverage the benefits of fine-tuning in this case, you’d need roughly ~20,000-40,000 labeled examples, distributed across the potential 10,000 categories. 

This challenge is representative of most companies’ reality: most companies don’t have perfect data and accumulating labeled data isn’t always the best way to get to a desired outcome. For our purposes, it meant that we quickly ruled out fine-tuning as a suitable technique and focused more on a modified RAG workflow. 

Challenge: The records themselves were very short, lacking information, and didn’t reliably look like the output

The typical RAG (Retrieval-Augmented Generation) workflow involves breaking down content into smaller chunks, vectorizing these chunks, and then comparing a given input’s vector to find the most similar chunks amongst the vectorized content. These retrieved chunks are used as context for generating responses, making the system more accurate and relevant.

The difference here is that each record is context-poor, and nothing like the output, since it was often labeled for an unrelated purpose. If handed a record labeled by an advertising manager with something like  “Mars/Snickers/KitKat - Holiday purchaser” we discovered a typical embedding model won’t associate that record ‘near’ “Food > Candy > Candy Bars”.  

Taking all this into consideration, here’s our solution: 

  • Identify any records that are so lacking in information that they can’t be made sense of 
  • Enrich the sparse or vague records to add more detail, and make them look more like the output space, so we can use a vector system 
  • Vectorize this enriched version
  • Identify a large group of candidate categories by searching the vector space with this ‘enriched’ input
  • Use another LLM to judge which of this much smaller subset of identified categories was best

Let’s look at an example: 

Key Takeaways

Data normalization and extrapolation is a strong use case for genAI.

Increasing AI readiness – this project is a good example of ‘bottom-up AI-transformation’: starting with a narrow automation goal and leveraging that project to increase the AI confidence and skillfulness of the team for future projects.

Driving results outside the laboratory setting – Getting real results in production often means finding creative workarounds when conventional methodologies fail due to the realities of real enterprise AI projects (e.g., lack of labeled training data).