PINGDOM_CHECK
9 Mins

Web snapshots? The what, the why, and the how

You may have come across this term or are already familiar with it. However, creating perfect web snapshots is not simple because website pages are complicated, and of course, we are not the first to try this.

For starters, it is key to identify the important section of a website to capture.

In this article, we will explore the various aspects involved to help you create web snapshots of web pages.

What does web snapshots actually mean?

Simply put, web snapshots means using an application that creates an archive of a live page in such a way that you can perfectly restore it later, offline. From the snapshot, you can extract text, HTML, or take print screens, completely offline.

There are a few reasons why this is needed:

  • Link rot: pages are removed from websites all the time, websites go offline and domains disappear
  • Websites are changing all the time and it can be useful to take snapshots over time and observe the differences
  • Ensure the information is preserved in an archive for future research, history, or the public
  • Save long articles for reading in places where the Internet is not available (such as a plane, or train)
  • Fight censorship by making sure that sensitive information and vital proof is not lost

At Zyte we need this, because we are developing an automated data extraction API that uses AI to recognize parts of the page, such as title, images, price, etc.

To do this, we have a few tens of thousands of web snapshots of pages that we use for training and testing the model, and they need to capture the original state of the page as closely as possible.

Creating perfect web snapshots is not a simple problem because website pages are complicated, and of course, we are not the first to try this.

Let’s look at the current methods and formats.

WARC

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file, together with related information.

The WARC format is inspired by HTTP/1.0 streams, with a similar header and it uses CRLFs as delimiters, making it very conducive to crawler implementations.

Files are often compressed using GZIP, resulting in a .warc.gz extension.

WARC is considered the golden standard for archiving, used by all serious archiving initiatives.

Because of that, there are lots and lots of applications and programming libraries to parse this format. The format is definitely very elegant and it generally works. Where it fails is on websites with lots of Javascript, especially if the Javascript is non-deterministic.

Unfortunately, it’s easy to break the restoring of the page with just 1 line of Javascript. If the page creation date is saved either in the HTML as a tag, or a Javascript constant, or extracted from the headers of a resource, and there’s an if condition checking that date against the current date, it can refuse to render the page. This means the restored page will be blank, or show an error.

Another big problem is that after you create a WARC file, it’s very hard to find a reader to look at the file. People capturing WARC files are mostly focusing on archiving pages, not restoring them.

There’s ReplayWeb which is nice and anyone can use it without having to run lots of scary CLI commands in the terminal. And there’s Web Recorder PYWB which is also nice, but needs some setup and technical know-how.

That’s it basically. Both of them have their problems, some important shopping, real-estate and job-posting domains can’t be saved or restored for different reasons.

In their defense, the archiving solutions only care about preserving the knowledge of humanity, not the prices of random products, so they make sure that articles, forum posts and Twitter threads can be saved at least.

To summarize, WARC captures only resources, so it must re-execute all JS on replay and this can lead to inconsistencies. Of course, some of these limitations may change in the future.

rrWeb: record and replay the web

Unfortunately rrWeb, is a very technical programming library, it’s not a final product. There’s no easy way of capturing a page, or looking at the result, you need to know exactly what to do when it comes to applying to web snapshots. 

The documentation to get started with rrWeb may not seem that helpful. I had to dig into the source, the tests, issues and try different methods to guess the best way to capture, or restore pages. It is written in Typescript and it involves calling the snapshot(document, { ... }) function, when the page is completely loaded… Sounds easy enough!

I got really excited about this format, as soon as I spent some time understanding it. It has the potential to be what PDF is for document archival, basically a perfect copy of the structure of the DOM.

Then, I started using it and I noticed that very important resources were missing from the rrWeb snapshot. The images were not captured. This means that the web snapshot was not completely offline; if the domain goes dark, your web snapshots will lose all the images.

So I made a PR to implement capturing images and a series of fixes after that.

Capturing images is not hard per se, if you could easily capture the image data. 

With that in mind, there’s 2 ways to go around it:

  • Capture the binary stream of data by making a request with the image URL
  • Create a Canvas object, inject the image as background and call Canvas .toDataURL() function

Both methods are tricky to do. So you can then try the following: 

  • Capturing the binary stream absolutely requires that rrWeb snapshot function is async, or at least callback based – and it’s not as I’m writing this article, so you would have to guess when the request is finished
  • Calling .toDataURL() requires the images to be from the same domain as the website, otherwise a normal browser will complain that “Tainted canvases may not be exported” and the function will fail

Of course, there are plenty of hacks you can do to overcome that, start the browser with --disable-web-security and apply crossOrigin="anonymous" to all images.

Allowing cross-origin use of images and canvas can serve as a useful guide for this. 

Then, I discovered that the background images from the CSS stylesheets are not captured in the rrWeb snapshot. The background images can be very important for some pages and if they are missing, some buttons or product pictures will become invisible.

There are 2 types of background images: normal, because they are applied on a real DOM node; :before: or :after: a DOM node, basically they don’t exist in the DOM and cannot be accessed from Javascript.

Capturing background images and web-fonts

To overcome this, I had to post-process the pages before calling rrWeb, to replace all background image URLs with their base64 representation, so that they are included in the snapshot, offline.

From there I created new CSS classes derived from the URLs to apply on the actual nodes, because you can’t hack the styles of invisible DOM nodes. To then discover that the web-fonts are not captured in web snapshots from rrWeb.

The web-fonts are not important for text content styling. You can definitely have a working page with just the “Times Roman” font, that’s not the issue. The problem is that web-fonts with icons like Font Awesome, Material Design Icons, or Bootstrap Icons are so important that some pages look completely broken without them.

From there, I had to post-process the pages again, to replace all WOFF2, WOFF, OTF and TTF URLs with their base64 representation.Then, I realized that rrWeb doesn’t store the Javascript from the page. Yes, confusing, I know. 

This is more like a rrWeb feature, because the snapshot already contains the whole DOM tree and doesn’t need Javascript to restore the structure. Running the Javascript twice can break the page, so it makes sense.

However in our case, keeping some Javascript, specifically type="application/ld+json" is critical.

This was the final straw.

The whole series of workarounds to make this work was too big, like a completely new library.

Deep objects in rrWeb

Another potential issue of the rrWeb format is that it has very very deep objects. So deep that Python’s JSON library crashes, but at least it has an option to go deeper by setting sys.setrecursionlimit(10**5). Popular libraries like UJSON and ORJSON will just crash loudly without giving you any option to handle it.

To overcome this, I just keep the JSON data as a binary stream, I don’t parse it in Python at all, unless I really have to.

After all of these issues, there were still edge cases and some domains were still consistently broken.

Despite all the problems I encountered, I have to say that rrWeb is impressive, it has tremendous potential and it seems to move the right way, eg: the developers are thinking of making an async API.

To recap: rrWeb captures the DOM, but doesn’t embed all resources by default, such as images and fonts.

HTML

I have to mention this format and a few tools, even if it doesn’t work well for many pages. It’s generally more suited for articles, forum posts or documentation.

If you don’t care about maintaining the original page structure and resources, HTML can be the best option, because web pages are HTML by definition. However, you can’t simply download the HTML + all resources and expect perfect results, the Javascript will most likely break after some time, or it may be already broken as soon as it’s downloaded.

In other words, the HTML and the resources must be processed before saving for archival. That’s why people are trying to create other kinds of formats, to solve this exact problem.

A few key tools to consider:

  • Y2Z/Monolith – Rust CLI tool to save complete web pages as a single HTML file
  • go-shiori/Obelisk – Go package and CLI tool for saving web page as single HTML file, inspired by Monolith
  • wabarc/Cairn – Typescript implementation of Obelisk
  • danBurzo/Percollate – CLI tool to turn web pages into beautiful, readable PDF, EPUB, or HTML docs
  • croqaz/Clean-Mark – Convert an article into clean text as Markdown or HTML

All these tools have the exact same problems as WARC tools, in the sense that a simple Javascript if condition to check the date can completely break a page on restore, even if all the resources are present and could potentially be restored.

Another easy way to break pages, which doesn’t happen with WARC, is that all Javascript requests are broken, because the portable HTML is just a simple file and there’s no server to handle the requests. But these cases don’t usually happen on article pages, which is great.

So basically, the HTML format is a great idea, but the tools I checked so far are not working with all kinds of pages.

Enter the “recorded” format.

The “recorded” format

This was the format that we used initially at Zyte, but it was generated with an old browser called Splash and the pages were not complete.

The name “recorded” is pretty uninspired, but it wasn’t meant to be public, that’s what we called internally.

This format is similar to WARC and HAR, but with a simple twist: the HTML saved in the snapshot is the final, settled HTML, after all the page was loaded and Javascript was run.

I made a public repository with an open-source implementation of this, similar to what we use on Github: Croqaz Web-Snap.

When recording a page, a browser window will open and you can wait enough time to interact with the page, close popups, hover over some images, scroll the page, even inspect the HTML and delete parts of the page. In the end, Web-Snap will capture the document’s inner HTML as is, and save it in the web snapshot.

On restore, a browser will also open and you can see the captured page and interact with it, also offline. Of course, you can’t navigate to links that were never captured, but you can still interact with popups, buttons and hover images.

Web-Snap is written in Node.js as a command line app. It’s easier to use than webrecorder/pywb and much easier than the rrWeb library for sure. The captured page could be very easily saved in a WARC file format and it would be an elegant implementation!

Why I have decided to use JSON instead of WARC is that the current WARC players don’t give you the option to enable or disable different features, eg: you can’t restore a page and disable Javascript, which is vital.

With Web-Snap restore, running the Javascript on restore is optional, and you also have the option to go online if you want. This format works really well.

How it solves WARC and rrWeb issues:

  • Captures absolutely all resources: CSS, JS, web fonts and images
  • Resources can be post processed optionally, to decrease their size
  • Generally the pages can be restored with JS disabled, but if that doesn’t work, JS can be run without many issues
  • Many pages from important domains can be captured this way

We use this format to capture live pages and we can later apply different methods to extract text, HTML, print screens, etc.

All that being said, this is still not perfect. I think the idea is great and it has huge potential, but the current implementation can still be improved. For example, currently we don’t capture Iframes.

To recap: The “recorded” format captures settled HTML and resources, so JS doesn’t need to be re-executed, and implementation is simple if we can intercept requests and replay responses.

Others

The above formats are by no means all of them! To name a few: HAR, EPUB, MAFF, MHTML, PDF, print-screens, ZIM.

They are generally focused on capturing pages for debug (HAR), or for long documents like books.

They won’t handle Javascript, so it’s impossible to capture many of the modern web pages.

Conclusion

There’s no silver bullet solution when it comes to web snapshots. We still have work to do. 

The archival of web pages is important, I’m passionate about the subject and I will personally continue to think about it and work on it. 

This article is a conclusion resulting from 6+ months of work, trying to find the best way to create “perfect” web snapshots of web pages.

Get in touch to learn more about how we can help with your data extraction projects.