Web Data Extract Summit 2024 Recap
The 2024 Web Data Extract Summit celebrated both its debut in Austin, USA and its sixth year since launching in 2019.
The two-day event began with a day of hands-on technical workshops, followed by an action-packed second day of curated sessions:
Four talks explored AI-related applications of web data,
Two talks addressed business strategies for leveraging web data,
Three sessions focused on the infrastructure driving web scraping operations, and
Two sessions delved into legal, ethical, and compliance considerations.
Before we move on, let’s acknowledge that you may be exhausted from reading and hearing about AI at this point. However, its potential to impact your bottom line is difficult to ignore especially if you’re leveraging web data extraction in any capacity.
What are others doing with their data extraction practices? What opportunities could you be missing? And most importantly, what should you be doing differently to stay ahead?
We hear you—and we get the AI fatigue. That’s why our team at Zyte made sure that we cover all the bases, from technical, business, and legal aspects of web scraping when we curated this year’s Extract Summit lineup of 11 talks.
Each technical talk offered unique perspectives and complemented the others well. However we noticed a couple of recurring themes. Keep reading to find out what those are.
To provide a clear overview, we've divided this recap into two main sections: one focusing on technical insights and the other on business implications.
Technical Insights for Developers Doing Web Data Extraction
Here are the five recurring topics across the technical talks:
Infrastructure as a service: We now have increasingly-intelligent and distributed computational infrastructure, all available on-demand. There are three talks on this topic from proxies, browsers, and distributed compute.
Tapping into AI: We get a view of how AI changes the economics of build vs buy at Zyte and how Neelabh Pant’s team at Walmart uses AI agents to streamline their data pipeline orchestration.
Managing LLM costs: At this moment, LLMs + HTML = pricey. How to best manage this? How does this change with multimodal models? These are some of the motivating questions that Iván Sánchez from Zyte and Asim Shrestha from Reworkd unpacked in their talks.
The importance of domain-specificity when approaching prompt engineering and writing evals. To unlock the potential of LLMs for your data, you still need people with domain knowledge. Here we hear Neelabh share the one technique that landed his team at a sweet spot by designing domain-specific prompts and leveraging AI agents. And then we have Asim from Reworkd highlighting the importance of writing domain-specific evals to simplify the problem space.
Retrieval-augmented generation (RAG): Neelabh highlighted how RAG helped his team in identifying top similar products, stressing the need for careful experimentation with the number of items retrieved to maintain contextual accuracy and relevance. Jan from Apify then positioned RAG as a game-changer for commercial LLM applications. He demonstrated a website content crawler that is integrated with RAG pipelines and a vector database backend, Pinecone.
Ready to zoom in? Let’s—but not without a map.
If we think of the web data extraction stack as a layered structure, then we can map the different sessions onto its key components. Here’s a visual breakdown of how these sessions align:

You can also watch the respective sessions for each topic on demand here.
Beyond the technically focused talks, one session proved equally significant for developers as well as business leaders: A Practical Demonstration of How to Responsibly Use Big Data to Train LLMs by Joachim Masar.
Joachim offered three key actionable recommendations valuable for anyone involved in implementing a web data extraction stack.
AI-Assisted Data Cleaning: Use LLMs to assist in cleaning data by identifying and removing sensitive information like names and phone numbers.
Privacy, Anonymisation, and Bias Mitigation: Prioritize filtering out Personally Identifiable Information (PII) during the data collection stage. This involves more than just removing usernames but also thorough examination of the content to ensure no sensitive information is inadvertently included. Be aware of potential biases in scraped data, such as demographic overrepresentation. Techniques like word clouds can help identify biases.
Data Security and Privacy Practices: Use techniques like differential privacy and human-in-the-loop systems to improve data handling processes.
He also delved into the challenges of using publicly scraped data, particularly the risk of a model memorising specific data points, which could compromise privacy and violate ethical guidelines. He highlighted key considerations for deploying models in low-resource contexts, where constraints like limited computational power and sparse training data demand creative and efficient solutions.
These thought-provoking questions were also addressed during the talk:
Should companies behind LLMs make their models open source to foster transparency and community collaboration?
Can niche models be improved by retrofitting context using datasets with similar themes, thereby enhancing their applicability to specialised tasks?
How can overfitting be mitigated when incorporating human-in-the-loop feedback, ensuring the model remains generalisable while benefiting from nuanced corrections?
Business Insights For Business Leaders Buying Web Data
If you're a business leader working with web data, these sessions are a must.
Closing Thoughts
Hope this snapshot piqued your curiosity and gave you a good foundation to start surfacing the rich insights from each talk. You can find the full playlist of the day-two sessions here.
You can also watch the on-demand talks from the past six years of Extract Summit here to gain a deeper perspective of how the landscape has evolved throughout the years.
If you didn’t manage to attend in 2024, here is your chance to register for the 2025 event. Do consider applying for a speaking slot!
