Pioneering web scraping technologies and building a developer community- Part 1
Try Zyte API
Pioneering web scraping technologies and building a developer community- Part 1
Welcome to our latest initiative, where we delve into the minds of industry innovators and leaders. Today, we are thrilled to introduce you to Voy Zeglin, the dynamic CEO of Data Miners. Voy has not only pioneered advanced web scraping technologies but also nurtured a thriving community of over 8,000 developers on Facebook’s Web Scraping World. In this exclusive interview, we explore Voy’s journey from being a seasoned software developer in Europe to becoming a market leader in data extraction.
Note: In light of the interview's comprehensive nature, we've thoughtfully divided it into two parts for a more in-depth exploration of the topics. Stay tuned for both Part 1 and Part 2.
The first part of our comprehensive interview with Voy Zeglin focuses on the foundational elements that have shaped his company and the web scraping industry. Voy takes us through his background and experience, shedding light on the initial projects that sparked Data Miners' inception.
Through his approach, we'll learn about the meticulous initiation and planning stages of their operations, the technical approach that underpins their sophisticated data extraction methods, and the challenges they've encountered — along with the innovative solutions they've implemented. Voy also discusses the crucial scalability strategies that have allowed Data Miners to adapt and thrive in an ever-evolving digital environment.
Join us as we unpack the layers of expertise and strategic thinking that have propelled Data Miners to the forefront of the web scraping field.
Background and Experience:
Could you share a bit about your background and how you got involved in web scraping?
With pleasure! It’s been a winding road that finally led me to focus on data extraction. I am a veteran software developer, originally from Poland. I spent a few years in London working with a fantastic selection of high-end clients for Joe Navin and his Winona eSolutions.
Then another company I worked for, managed by the brilliant Alexander Sharp, moved to Hong Kong, so I went along before returning to Europe around ten years ago and starting DataMiners, my own business. One thing led to another and after an initial project with a single client, I identified a market gap and a growing demand for data extraction services, which was not really a thing back then. It was a completely different landscape, but starting early allowed me to better position my company and build a gratifying portfolio of prominent clients that I’m proud of.
What types of web scraping projects have you worked on, and in what capacities?
I believe I went through the whole palette by starting small, as a freelancer, back when there were no significant obstacles in mass-retrieving data from the Internet. Just think, no captchas, no bot detection mechanisms. No significant expenditure for infrastructure either.
Things started changing and gradually all methods needed to be adjusted. Companies realized the value of data, so dedicated budgets had to be assigned to address this area separately, which allowed my company to grow. Right now we focus mainly on mid to large scale data extraction projects, sometimes reaching that intensity is not for the faint-hearted, haha.
It’s difficult to pinpoint one specific kind of thing we do, as we are also moving toward delivering insights, so processing and interpreting the data, beyond merely aggregating it. Just as an example, imagine that you run a manufacturing company that has tens of thousands of products — it can be anything, like household appliances, cosmetics, clothing, car parts. It would be great to know how much your products are sold for in different places on the web, and maybe also to know what your competitors are doing, preferably in real-time. Maybe they have cheaper products? Maybe they introduce new product lines? We can help you with all that. Insights like these greatly help C-level executives and managers to make better decisions, based on hard evidence coming from up-to-date information, from their market environment. Knowing certain things can help companies save millions of dollars per year.
Project Initiation and Planning:
How do you typically identify and define the scope of a new web scraping project?
It is case-by-case with each client. It’s impossible to give a one-size-fits-all recipe on how to do that. Communication is key, also identifying specific goals and results for each client will be different. Sometimes the clients only have a fuzzy idea about what they want to achieve, or they operate inside an isolated perception bubble, which is normal for many companies. Fresh, outside perspectives can shed some new light on their functioning and help them move into an optimal direction.
I often spend weeks talking with my clients before we start the implementation phase. This is crucial, I strongly believe in the value of effective planning, it’s a paramount factor contributing to the success of any project.
Can you walk us through the initial steps you take when starting a large-scale web scraping initiative?
It’s like building a house. You need to decide if you want a small cabin in the woods, or a skyscraper in the middle of Manhattan. There are countless aspects that need to be considered. Bridging the gap between people from high-level decision-making spheres and technical engineers who look through a different lens is quite a challenge.
Project management structures can sometimes grow, but that’s something I’m trying to avoid as much as possible. I acknowledge the fact that companies often have to work with limited budgets, so clear business objectives need to be laid out and trimmed before moving further. In principle, it comes down to deciding what needs to be done, how fast it has to be delivered and how much resources can be assigned. Reiterate through that several times and you have a complete project planning life-cycle. And you understand project management memes better than normal people, as a result.
Technical Approach:
What tools and technologies do you prefer for web scraping projects and why?
Data extraction professionals can now indulge themselves with a myriad of languages, tools and open source libraries serving all purposes. This segment is growing and developers publish new gems every day, especially with the advent of AI technology.
There is no need to develop your whole web scraping toolset from scratch. Even with a very limited budget, you can construct quite a decent workshop. In general, there are two main programming languages utilized in data extraction, and that is Python and JavaScript (JS). I’m personally a bigger fan of the latter, since it’s the native language of the Internet and sticking to one language simply makes sense, along the lines of Occam’s razor principle. GitHub is a definitive go-to place for finding hidden treasures. And yes, ChatGPT, Copilot and Gemini recently. They help aspiring developers greatly. We use these tools at my company as well.
How do you approach scraping dynamically generated content on websites?
Dynamically generated content can mean at least two things. Getting slightly more technical, this kind of content, as opposed to static content, appears on JS-heavy websites, which are most of the websites really. Dynamically here means pulled from some kind of database or a backend engine, not visible straight away. Sometimes it takes milliseconds, but sometimes longer, before a fully rendered page appears on your screen.
Another meaning of dynamically generated content is the one that changes according to user behavior, location, interest, and other parameters. For example, suggested movies on Netflix or music playlists on streaming platformsSpotify fall into that category. You get different suggestions based on what you watched or listened to previously. This is only going to evolve.
With AI, we are moving towards a world where everyone gets not only personalized suggestions of existing content, like pre-made movies or music, or even games, but at some point the content will be generated on-the-fly, in my opinion leading to a society where we’re more isolated in some way, because we won’t have a common reference framework that binds us together. Everyone will have their own, uniquely generated Star Wars, Minecrafts, or music artists which they will be able to interact with. Crazy future ahead. But we’re getting a bit off-track here.
In general, technically speaking, you need to adapt your toolset to the changing environment. If simple tools like CURL or request libraries don’t work, try Playwright, Puppeteer or other headless browsers. They are the solution, most of the time, backed by geographically relevant IP addresses matched with your target website. This is only scratching the surface, but dive deeper and you’ll discover more.
Challenges and Solutions:
What are the most significant challenges you've faced in web scraping, and how have you overcome them?
I think it’s the amount and complexity of data that are most significant. It’s not a problem to extract some initial data, even by using low-code/no-code tools or browser plugins and add-ons, even manually. The great challenge is to orchestrate larger scale operations, when you have thousands or millions of requests to perform in a short time period. This is where planning and preliminary assessment plays a great role. This is often a situation where you have to resort to commercially available tools, like website unblockers, proxy clusters and sophisticated queueing methods, which will save you heaps of time. It’s not really possible to do all that on your own any more.
How do you deal with the uncertainty of large-scale projects? How do you ensure the system is anti-bot resistant, field break resistant and website layout resistant?
It’s always good to be prepared for unexpected situations and it’s always good to include that in your budget proposal for the client. Otherwise you may end up paying extra just to deliver the results and, in extreme cases, generating loss instead of generating income. Make sure to test your solution using two or three methods, at least. It will take some more time to prepare, but will leave you with options on the table. It may happen that you won’t know what anti-bot techniques will be brought against you right until the very end. You may have the whole system ready and set up to press the “launch” button, but then the target website will detect some patterns in your traffic and block you. If you have some auxiliary backup ideas in place, you’re good to go.
Scalability Strategies:
How do you ensure scalability in your web scraping projects?
That’s an important issue. We need to remember several key components, starting with choosing the right tools and paying attention to architectural solutions. It’s getting more difficult to introduce significant changes down the line, when we already start building.
First of all, focus on the modularity of your architecture and ensure the components may be modified independently. This will save you from technological debt, which is writing faster, but uglier code. Spend some more time on it, break down your software into smaller pieces, wrap key functions in libraries. In case something needs to be fixed, you can easily and quickly perform necessary adjustments.
Another thing that can’t be underestimated is so simple and obvious, but so often overlooked. Keep the tab on the amounts of data you download and store locally. You can run out of space blazingly fast when you launch web scraping projects at scale — remember about ensuring you have enough gigabytes of space if necessary, sometimes it can be more. Measure it beforehand.
The other side of the same coin is the target server capacity, in terms of amounts of resources it can allocate to accommodate your requests. Run some tests first and don’t lock yourself out, so to speak. Sometimes our data sources run on a surprisingly weak underlying hardware. But if you aim at highly popular websites, make sure you have enough resources on your side. For example, a big enough IP address pool at your disposal.
What factors do you consider critical for maintaining performance in large-scale scraping operations?
From my early days I remember making the mistake of just writing and launching the scrapers and going home for the night, the fire-and-forget technique, which often backfired. This taught me the lesson about the importance of monitoring, right now even proactive monitoring, not so much reactive monitoring.
Proactive means preventing potential issues before they happen. Although, some basic alerting system, letting you know that your scrapers stopped, is also great to start with. Monitor whatever you can, if you can. Like Reagan said – trust, but verify. We can talk about resource planning and allocation and many other factors here. Maintaining performance is an art in itself.
Are there potential changes in data volume or sources that might impact scalability requirements?
Yes. As a rule of thumb, the industry identifies what is called the “three Vs” when it comes to big data, and these are volume, velocity and variety. They are pretty self-explanatory, in our setting volume is the amount of data that we retrieve, velocity is the speed at which we do that and variety can be understood as the datapoint range per single source. All of the above can fluctuate and change during the process of data extraction.
If you write your software in the right way and think preemptively, it should only retrieve the essential information, not the whole possible data streams. After that, you should also structure your data in the right way. Hash and compress information, minimize your footprint. We came up with this poster idea at my company some time ago, it is still somewhere in our office — “less code, less CO2”. Optimize your code, save computational resources, and conserve energy. A kind of Zen Buddhist philosophy applied to programming.
Continue reading the next part of the interview.
Technical Approach:
What tools and technologies do you prefer for web scraping projects and why?
Data extraction professionals can now indulge themselves with a myriad of languages, tools and open source libraries serving all purposes. This segment is growing and developers publish new gems every day, especially with the advent of AI technology.
There is no need to develop your whole web scraping toolset from scratch. Even with a very limited budget, you can construct quite a decent workshop. In general, there are two main programming languages utilized in data extraction, and that is Python and JavaScript (JS). I’m personally a bigger fan of the latter, since it’s the native language of the Internet and sticking to one language simply makes sense, along the lines of Occam’s razor principle. GitHub is a definitive go-to place for finding hidden treasures. And yes, ChatGPT, Copilot and Gemini recently. They help aspiring developers greatly. We use these tools at my company as well.
How do you approach scraping dynamically generated content on websites?
Dynamically generated content can mean at least two things. Getting slightly more technical, this kind of content, as opposed to static content, appears on JS-heavy websites, which are most of the websites really. Dynamically here means pulled from some kind of database or a backend engine, not visible straight away. Sometimes it takes milliseconds, but sometimes longer, before a fully rendered page appears on your screen.
Another meaning of dynamically generated content is the one that changes according to user behavior, location, interest, and other parameters. For example, suggested movies on Netflix or music playlists on streaming platformsSpotify fall into that category. You get different suggestions based on what you watched or listened to previously. This is only going to evolve.
With AI, we are moving towards a world where everyone gets not only personalized suggestions of existing content, like pre-made movies or music, or even games, but at some point the content will be generated on-the-fly, in my opinion leading to a society where we’re more isolated in some way, because we won’t have a common reference framework that binds us together. Everyone will have their own, uniquely generated Star Wars, Minecrafts, or music artists which they will be able to interact with. Crazy future ahead. But we’re getting a bit off-track here.
In general, technically speaking, you need to adapt your toolset to the changing environment. If simple tools like CURL or request libraries don’t work, try Playwright, Puppeteer or other headless browsers. They are the solution, most of the time, backed by geographically relevant IP addresses matched with your target website. This is only scratching the surface, but dive deeper and you’ll discover more.
Challenges and Solutions:
What are the most significant challenges you've faced in web scraping, and how have you overcome them?
I think it’s the amount and complexity of data that are most significant. It’s not a problem to extract some initial data, even by using low-code/no-code tools or browser plugins and add-ons, even manually. The great challenge is to orchestrate larger scale operations, when you have thousands or millions of requests to perform in a short time period. This is where planning and preliminary assessment plays a great role. This is often a situation where you have to resort to commercially available tools, like website unblockers, proxy clusters and sophisticated queueing methods, which will save you heaps of time. It’s not really possible to do all that on your own any more.
How do you deal with the uncertainty of large-scale projects? How do you ensure the system is anti-bot resistant, field break resistant and website layout resistant?
It’s always good to be prepared for unexpected situations and it’s always good to include that in your budget proposal for the client. Otherwise you may end up paying extra just to deliver the results and, in extreme cases, generating loss instead of generating income. Make sure to test your solution using two or three methods, at least. It will take some more time to prepare, but will leave you with options on the table. It may happen that you won’t know what anti-bot techniques will be brought against you right until the very end. You may have the whole system ready and set up to press the “launch” button, but then the target website will detect some patterns in your traffic and block you. If you have some auxiliary backup ideas in place, you’re good to go.
Scalability Strategies:
How do you ensure scalability in your web scraping projects?
That’s an important issue. We need to remember several key components, starting with choosing the right tools and paying attention to architectural solutions. It’s getting more difficult to introduce significant changes down the line, when we already start building.
First of all, focus on the modularity of your architecture and ensure the components may be modified independently. This will save you from technological debt, which is writing faster, but uglier code. Spend some more time on it, break down your software into smaller pieces, wrap key functions in libraries. In case something needs to be fixed, you can easily and quickly perform necessary adjustments.
Another thing that can’t be underestimated is so simple and obvious, but so often overlooked. Keep the tab on the amounts of data you download and store locally. You can run out of space blazingly fast when you launch web scraping projects at scale — remember about ensuring you have enough gigabytes of space if necessary, sometimes it can be more. Measure it beforehand.
The other side of the same coin is the target server capacity, in terms of amounts of resources it can allocate to accommodate your requests. Run some tests first and don’t lock yourself out, so to speak. Sometimes our data sources run on a surprisingly weak underlying hardware. But if you aim at highly popular websites, make sure you have enough resources on your side. For example, a big enough IP address pool at your disposal.
What factors do you consider critical for maintaining performance in large-scale scraping operations?
From my early days I remember making the mistake of just writing and launching the scrapers and going home for the night, the fire-and-forget technique, which often backfired. This taught me the lesson about the importance of monitoring, right now even proactive monitoring, not so much reactive monitoring.
Proactive means preventing potential issues before they happen. Although, some basic alerting system, letting you know that your scrapers stopped, is also great to start with. Monitor whatever you can, if you can. Like Reagan said – trust, but verify. We can talk about resource planning and allocation and many other factors here. Maintaining performance is an art in itself.
Are there potential changes in data volume or sources that might impact scalability requirements?
Yes. As a rule of thumb, the industry identifies what is called the “three Vs” when it comes to big data, and these are volume, velocity and variety. They are pretty self-explanatory, in our setting volume is the amount of data that we retrieve, velocity is the speed at which we do that and variety can be understood as the datapoint range per single source. All of the above can fluctuate and change during the process of data extraction.
If you write your software in the right way and think preemptively, it should only retrieve the essential information, not the whole possible data streams. After that, you should also structure your data in the right way. Hash and compress information, minimize your footprint. We came up with this poster idea at my company some time ago, it is still somewhere in our office — “less code, less CO2”. Optimize your code, save computational resources, and conserve energy. A kind of Zen Buddhist philosophy applied to programming.
Continue reading the next part of the interview.