Market Research Automation
An n8n workflow that mines public market data into a structured, deduplicated database of potential customers. Built for teams that did this by hand.
- · n8n
- · JavaScript
- · Residential proxies
- · Google Sheets
1 person, ~4 weeks, in production
400+ keywords processed per run; clean provider list with partner attribution, no duplicates, delivered by email
The biggest lesson from this build: not every workflow needs AI. The first version used a vision model to extract data. The final version uses no AI at all. Plain JavaScript, rule-based. Cheaper, faster, more reliable.
The problem
The team had to figure out, on a recurring basis, which providers were active in a given market and which of them looked like potential customers. The information is public: search results, directories, vertical portals. But gathering it by hand takes forever.
Dozens of results per keyword. Each result has to be checked, the provider identified, information compared. At 400–500 keywords and several thousand results per run, doing this manually is not realistic.
Why it ended up without AI
My first approach was an AI flow: screenshot result pages, feed them to a vision model. It didn't work. The model hallucinated data, invented provider names, mismatched fields, dropped entries. Depending on viewport, the same site renders content as carousel, sidebar, or grid. The model couldn't keep up with the shifting layouts.
So I rebuilt it. Structured data extraction from HTML doesn't need AI. A JavaScript parser looking for fixed patterns in the markup returns consistent results, regardless of how the page renders. No tokens spent, no hallucination risk.
The approach
Two phases. Phase 1 collects the public market data and extracts the relevant fields. Phase 2 aggregates, deduplicates, and enriches. A staging store sits between them so collection and analysis can run independently.
Keywords get processed in batches to respect API limits. Each batch runs to completion before the next starts. Requests route through proxy services; a cookie-banner handler keeps content reachable.
The hard parts
Writing the parser was the easy part. Most of the dev time went into edge cases, especially when scaling from small test runs to the full 400+ keywords.
The index problem
Under 10 results, everything was clean. The moment a new batch with another 10 results came in, fake duplicates appeared. If 9 of 10 requests come back (timeout, blocked request, empty response), the index shifts. Keyword 5 gets matched to keyword 6's results. The entire mapping breaks.
I fixed it by tying every result back to its original input directly. Not by array position. By explicit reference. Each result knows which keyword and which row it belongs to, no matter how many other requests failed.
The duplicate problem
A provider often shows up several times: once through their own account, once through a partner channel, sometimes both at once for different products. Dropping one of the rows loses the partner attribution.
Dedup fell out of the index fix. I first tried batching on other parameters, but in the end a "cluster and enrich" approach was needed: all entries for a provider get grouped. From the cluster, a master record is picked, with the most direct entry winning. The remaining entries are scanned for additional information and merged into the master. Two entries for the same provider get merged, not deleted.
Fuzzy duplicates
Tiny differences block clean grouping. Example GmbH and Example GmbH (trailing space) get treated as two providers. Casing, special characters, same problem. I normalise aggressively when grouping: lowercase, strip whitespace, strip special characters. Different spellings end up in the same cluster.
API constraints
Plenty of sites block automated requests without residential proxies. Cookie banners have to be accepted programmatically before content shows up. And export APIs often reject strings over a certain length. Long tracking URLs or HTML artefacts crash the export. Every field gets checked for length and special characters before it's saved.
What surprised me most: how hard it is to make this kind of automation run reliably at that size. Writing the parser is an afternoon. Validating the results until 400+ keywords and thousands of rows come through clean took weeks.
The result
What lands in the spreadsheet:
- One provider per row, no duplicates
- Partner attribution: which tools or services does the provider already use?
- Multiple mappings if a provider operates across channels
- Direct links to provider profiles
- Keyword mapping: which search term surfaced which provider
- Email notification when the run finishes
Feedback was strong. What used to be hours of manual research now runs end to end and produces a ready contact list. I'd set up flows like this faster today. I work through the problems more logically now. But every flow is different.
