Inside Zyte's System Design Process: How We Build Scalable, Reliable Solutions
Try Zyte API
During the interviews before joining Zyte, some developers were interested in their developer experience if they joined Zyte. I’m Alexander, and I work on systems architecture at Zyte. In this article, I’m going to explain how we do system design for our products.
Creating an Effective PRD
Making a PRD - a product requirements document, where product owners specify the feature, how it should work, how customers will interact with it, etc. The most important thing to understand at this point is the changes in user experience to be delivered and the cost, which dictates the amount we can spend on developing and maintaining this feature. The audience of the document is the product team.
Here’s an example of the PRD:
The platform team is responsible for developing a sign-up flow and Zyte API dashboard, where the API key is created and needs to be passed to the ZAPI backend to inform the system that there is a new user, to grant access, set up rate limits, and apply specific organization discounts. The development time should not exceed one week per person, and maintenance costs should be negligible, compared to the main workflow. In other words, there is no dedicated budget for the maintenance of this functionality.
Technical requirements for PRD
Making technical requirements - a document containing functional and other requirements, formally explaining the functionality intended for developers. Usually,we use a template to make things easier for writers. When filling out the template fields, one has to decide what the availability, scalability, failover, and other concerns a system will need to meet.
Here’s an example of the technical requirements for the above PRD:
Functional requirements
The provisioning event for the user is generated in the dash web worker, as a result of processing the response from the payments gateway. The structure to be passed will contain a flat JSON with a 20-character API key hex string, three integer rate limits (1, 5, 15 minutes) and a float representing the organization discount applied for this account. The architecture of the web worker is made in a way that it will be difficult to arrange retries.
The system should be able to perform up to 10K API key checks per second, for 1K users.
Non-functional requirements:
The provisioning process should not exceed seconds. The process should be as reliable as possible because the loss of provisioning events results in a very poor user experience and is very hard for support to troubleshoot, and quite likely ending up in developer's sprints. In the case of failures, the functionality is expected to recover itself. Data loss is unacceptable. The functionality could be extended in the future by adding more fields.
The following service level indicators must be introduced and monitored:
Number of provisioning events generated in the dashboard/sign-up system
Number of provisioning events accepted by ZAPI
Time required for a generated provisioning event to be accepted by the ZAPI
The fundamental difference between the two documents is the intended audience, and as a result, the level of detail and concepts used to describe the feature being designed.
Finally, when technical requirements are ready, we ensure that everyone involved in the design process understands them the same way.After we are done with the requirements, we start collecting possible solutions.
Any developer may come up with a half-page proposal explaining the core idea, and we will add them to a document outlining all the options the team has developed.
Here are example half-pager ideas for the above technical requirements:
Use Kafka topic to transfer the provisioning message. The web worker will produce the provisioning message to the topic, and the Zyte API Server will consume it.
Pros: Very low latency, low computational overhead.
Cons: A need for a healthy Apache Kafka instance and the cost associated with running it.Generate a list of provisioned accounts from an async job in the web worker and upload it to block storage like Amazon S3 or Google GCS, and signal to ZAPI API Server by means of Pub/Sub to download and update.
Pros: Transparency, easy to troubleshoot.
Cons: The lock-in on Google/amazon’s pub/sub service for notification and block storage.Periodically request the full user's list from Zyte API Server and update.
Pros: Easy integration, controlled frequency and timing of the updates
Cons: Limited latency reduction options, scalability challenges, network/CPU overhead.Microservice requests full user lists synchronously from web workers and caches them, and has an API Server to request them on a per-request basis.
Suboptions include implementing using various languages and frameworks.Pros: same as 3, but the solution is optimized for API key lookups, therefore fewer of option 3’s cons.
Cons: The need for maintaining and developing a separate component.Use local MySQL replica of the users table from web worker, and have API Server directly query it in read-only mode
Pros: (Kind of) easy to integrate,
Cons: MySQL replicas would have to be tuned to handle the load, replication needs to be monitored and synchronized in the case of failures.Use Change-Data-Capture to populate the topic with the changes in the web worker users table, using Debezium
Pros: No need to do anything on the web worker side,
Cons: Maintenance of Debezium and Kafka, the generation and handling of the event is non-transparent
Once we have a document with possible solutions, we start to compare them. There is no single way of doing it, and sometimes it becomes frustrating, like comparing apples to parrots, but here are a few tips, on how to make sense of it:
Discard minor details, and concentrate on the critical aspects.
Decide which critical aspects are more important than others (development time vs. cost for example).
Collapse similar solutions and make variations.
Summarize the options and their critical aspects in a single table, with as few words as possible, so the table fits a screen or a sheet of paper.
For the product, the critical aspects were the latency, development time, and reliability of the solution.
For example, the summary table for the above solutions could look like this
Kafka-based
1. Kafka topic to transfer the provisioning message
Good
Low (because of Kafka)
Low
Average (library + fixing issues on the web worker side)
2. Use Change-Data-Capture to populate the topic with the changes in the web worker users table, using Debezium
Good
Low (the CDC will produce noise, schema migration issues, Deb. is hard to monitor)
Low
High (Debezium setup, testing various scenarios, learning Debezium format)
Synchronous
1. Periodically request the full user list from Zyte API Server and update.
Bad
Good (if we discard the scaling issues)
High (network traffic)
High (the system would have to be rebuilt on the web worker side to support the new requirements)
2. Use a local MySQL replica of the user table from the web worker, and have the API Server to directly query it in read-only mode
Bad
Average (replication over public network)
High (local replica maintenance)
High (deploying new HA component and client to access it)
Mixed
1. Microservice requests full user list synchronously from web worker and cache them, and has API Server to request them on a per-request basis.
Bad
Depends on impl. details
Average (microservice presence)
Depends on impl. Details, but MS would have to be developed.
Other
1. Generate a list of provisioned accounts from an async job in the web worker and upload it to block storage like Amazon S3 or Google GCS, and signal to ZAPI API Server by means of Pub/Sub to download and update.
Good (if using Pub/Sub, Bad otherwise)
High (if we exclude the Pub/Sub)
Around $370
Average (GCS and Pub/Sub are having Java libs)
Finally, we selected option one, because we already had Kafka provisioned at Zyte, and the latency and development time was low.
Conclusion
The above process is an example of the design process developed and used at Zyte for delivering new functionality and maintaining the system. This design process has taken several evolutionary steps, before reaching its current form. For example, initially, there were no technical requirement documents, so a PRD was sent straight to the team for design.
It turned out that technical requirements were understood differently by various team members, and as a result, it took more time to discuss solutions and even led to situations where there was no agreement. To fix this, a new stage of writing technical requirements using a template was introduced. We run weekly open design on-demand sessions, where we’re able to provide quick feedback on the artifacts, and there is an internal knowledge base where one can learn from examples of various artifacts when doing design work.