Site Reliability Engineering at Starship | Author: Martin Pihlak | Starship Technologies
[ad_1]
Running autonomous robots on city streets is a complete software engineering challenge. Some of this software runs on the robot itself, but many actually run on the backend. Remote control, finding routes, pairing robots with customers, managing fleet health, but also interactions with customers and merchants. All of this must be run 24×7, seamlessly and dynamically scaled to match the workload.
Starship’s SRE is responsible for providing cloud infrastructure and platform services to run these backend services. We have normalized it Governors for our Microservices and we are running AWS. MongoDb it’s the main database for most backend services, but we also like it PostgreSQL, especially where strong typing and transaction guarantees are required. For asynchronous messages Kafka it is the chosen messaging platform and in addition to sending video streams from robots we are using it for almost everything. We rely on observability Prometheus and Grafana, Loki, left and Jaeger. It is managed by the CICD Jenkins.
A large part of the SRE’s time is spent maintaining and improving the Kubernetes infrastructure. Kubernetes is our main implementation platform and there is always something to improve, whether it’s adjusting autoscaling settings, adding Pod pause policies, or optimizing the use of Spot instances. Sometimes it’s like laying bricks; Installing a helm diagram is simply to provide unique functionality. But often the “bricks” need to be carefully chosen and evaluated (Is it a good place to manage records, a Service Mesh thing and then what) and occasionally there is no functionality in the world and it has to be written from scratch. When this happens we usually refer to Python and Golang, but also when needed to Rust and C.
Another major infrastructure that SRE is responsible for is data and databases. Starship started with a single monolithic MongoDb, a strategy that has worked well so far. However, as the business grows we need to review this architecture and start thinking about helping robots a thousand times. Apache Kafka is part of the story of scaling, but we also need to invent the architecture of sharding, regional clustering, and microservice databases. In addition, we are constantly developing tools and automation to manage our current database infrastructure. Examples: add MongoDb availability with a custom sidecar proxy to analyze database traffic, enable PITR support for databases, automate routine failure and recovery tests, collect Kafka re-sharding measurements, enable data retention.
Finally, one of the most important goals of Site Reliability Engineering is to reduce downtime for Starship’s production. Although the SRE is called upon to deal with occasional infrastructure disruptions, more effective work is being done to prevent disruptions and ensure that we can recover quickly. This can be a very broad topic, from having a strong K8s infrastructure to engineering practices and business processes. There are great opportunities to make an impact!
A day in the life of an SRE
Arriving at work, 9am to 10am (sometimes working remotely). Have a coffee, check out Slack’s posts and emails. Review the alerts that go on at night to see if there’s anything interesting there.
Find out if the MongoDb connection latency has increased overnight. By delving deeper into Prometheus ’metrics with a graph, find out that this is happening at the time the backups are underway. Why is the problem all of a sudden, we’ve been doing these backups for a long time? Apparently, we are compressing backups very aggressively to save network and storage costs and this consumes all available CPUs. It seems that the database load has increased a bit to make this noticeable. This is happening at a node in standby, it has no effect on production, but it remains a problem if the primary fails. Add a Rotate item to fix this.
Instead, change the MongoDb prober code (Golang) to add more histogram vessels to better understand the latency distribution. Run a Jenkins pipe to put a new probe into production.
There’s a Standup meeting at 10am, share your updates with the team and learn what others have been up to: set up monitoring of a VPN server, set up a Python application with Prometheus, set up ServiceMonitors for external services, troubleshoot MongoDb connection issues, Canary deployments with Flagger piloting.
After the meeting, start the work planned for the day. One of the things I planned to do today was to implement an additional Kafka cluster in the test environment. We are running Kafka Kubernetes, so it should be easy to take the existing YAML files and adjust them for the new cluster. Or, if you think we should use Helm instead, or maybe a good Kafka operator is available now? No, don’t go, too much magic, I want more explicit control of my state sets. YAML is gross. An hour and a half later a new cluster is up and running. The setup was pretty straightforward; The init containers that Kafka brokers register with DNS required a configuration change. Creating credentials for applications required a small bash script to set up Zookeeper accounts. Hanging out a bit, it was setting up Kafka Connect to capture database change log events – proof that databases don’t run in ReplicaSet mode and Debezium can’t get an oplog from it. Go back and forward.
Now is the time to prepare a scenario for the Wheel of Misfortune exercise. At Starship we are running these to improve systems understanding and share problem-solving techniques. It works by breaking a part of the system (usually in the test) and trying to fix and alleviate problems for an unfortunate person. In this case I will set up a load test Hey you overload the microservice to calculate routes. Implement this as a Kubernetes work called “haymaker” and hide it well enough so that it doesn’t immediately appear on the Linkerd service network (yes, evil ????). Then run the “Wheel” exercise and consider the gaps we have in brochures, measurements, alerts, and so on.
In the last hours of the day, block all interruptions and try to do a coding. I have re-implemented the Mongoproxy BSON analyzer as asynchronous streaming (Rust + Tokyo) and I want to know how it works with real data. It gives the impression that there is an error in the guts of the analyzers and I need to add a deep record to clarify this. Find a wonderful site for Tokyo and get excited about it …
Note: The events described here are based on a true story. Not everything happened on the same day. Some meetings and interactions with colleagues have been edited. We are hiring.
[ad_2]
Source link