Digital advertising runs on data. Sincera is building the metadata of the internet, decoding the complexity of digital advertising into clean, precise, and actionable data.
Clean, structured data is a game-changer. Messy, inconsistent, and unstructured data is a headache – for humans and computers.
Each month, Sincera observes millions of unstructured records that describe products and services. These records come from a variety of sources on the open internet, and as a result, each record might represent something slightly different: one record could describe a customer segment, like “Lipton purchaser,” while another could be a product description, like “pork tenderloin.”
Here’s a look at the raw data for five of these records:
This data is nearly impossible to use without extensive and time-consuming standardization. However, if organized into a consistent taxonomy, the data could become a valuable asset – almost like recycling data scraps into something new and useful.
That’s exactly what the Sincera team aimed to do: map the millions of monthly records to Shopify’s product taxonomy (a standard in adtech) to unlock more data utility.
Achieving this goal is easier said than done. The full taxonomy has 10,000 categories and up to 7 levels of nested hierarchy – a far cry from the starting place of millions of inconsistent records. Without genAI, this feat would be cost-prohibitive, requiring thousands of human hours each month.
Sincera hired Fractional AI to make this massive, unstructured data stream usable.
The result: Each record is not only outputted to the corresponding Shopify category but it’s also outputted with a level of confidence in the categorization – all in real time and with accuracy consistently above 85%. More broadly, this monthly stream of messy data is now a valuable data asset.
To make this data usable, we built an AI categorization system using a multi-step LLM pipeline, where each record (or row in a CSV) is evaluated by several agents.
Here’s how each step works:
Check out the “Looking under the hood” section below for more detail on this pipeline and the methodologies used.
While the primary goal was to build an AI system to normalize messy data, perhaps the most valuable outcome was the Sincera team's increased confidence in their own AI capabilities.
By working closely together—through twice-weekly standups, AMAs, and deep dives into the reasoning behind each AI decision, the Sincera team became better equipped for future projects.
One area of particular focus was LLM evaluations. We worked with the Sincera team to show them how to build robust evals, the right tooling to use (Braintrust), and how to iterate on these evals on their own (without us) in the future (more on evals here).
You might be wondering “isn’t data classification a solved problem?” or “aren’t there known methods that can help with this?”.
The short answer: while this looks like a canonical problem, there are a few key reasons why conventional approaches didn’t work for this particular application. We’ll focus on two of them:
Conventionally, this is a basic classification project and would be a good candidate for a fine-tuned model or traditional ML model.
Unfortunately, we had very little ground truth data (only 160 labeled examples). To fully leverage the benefits of fine-tuning in this case, you’d need roughly ~20,000-40,000 labeled examples, distributed across the potential 10,000 categories.
This challenge is representative of most companies’ reality: most companies don’t have perfect data and accumulating labeled data isn’t always the best way to get to a desired outcome. For our purposes, it meant that we quickly ruled out fine-tuning as a suitable technique and focused more on a modified RAG workflow.
The typical RAG (Retrieval-Augmented Generation) workflow involves breaking down content into smaller chunks, vectorizing these chunks, and then comparing a given input’s vector to find the most similar chunks amongst the vectorized content. These retrieved chunks are used as context for generating responses, making the system more accurate and relevant.
The difference here is that each record is context-poor, and nothing like the output, since it was often labeled for an unrelated purpose. If handed a record labeled by an advertising manager with something like “Mars/Snickers/KitKat - Holiday purchaser” we discovered a typical embedding model won’t associate that record ‘near’ “Food > Candy > Candy Bars”.
Taking all this into consideration, here’s our solution:
Let’s look at an example:
Data normalization and extrapolation is a strong use case for genAI.
Increasing AI readiness – this project is a good example of ‘bottom-up AI-transformation’: starting with a narrow automation goal and leveraging that project to increase the AI confidence and skillfulness of the team for future projects.
Driving results outside the laboratory setting – Getting real results in production often means finding creative workarounds when conventional methodologies fail due to the realities of real enterprise AI projects (e.g., lack of labeled training data).