Avoiding Data Bottle Neck – Starship Data Network | Author: Taavi Pungas | Starship Technologies
[ad_1]
One gigabyte for a bag of food. This is what you get when you make a robot delivery. It’s a lot of data, especially if you repeat it more than a million times as we have.
But the rabbit hole goes deeper. The data is also incredibly diverse: robot sensor and image data, user interactions with our apps, order transaction data, and much more. There are also a variety of use cases, from preparing deep neural networks to creating smooth displays for our trading partners, and everything in between.
So far, we have been able to manage all of this complexity with our centralized data team. By now, the constant exponential growth has led us to look for new ways to work to keep pace.
We have found that the data network paradigm is the best way to move forward. Below I will describe the Starship data network view, but first, let’s look at a brief summary of the view and why we decided to go with it.
What is a data network?
It was the scope of the data network first described By the hand of Zhamak Dehghani. The paradigm is based on the following basic concepts: data products, data domains, data platform and data governance.
The key intent of the data network framework has been to help large organizations remove data engineering bottlenecks and deal with complexity. Therefore, it addresses many of the details that are important in business implementation, from data quality, architecture and security, to governance and organizational structure. Thus, alone a couple of companies they have publicly announced their adherence to the data network paradigm: all the big billions of big companies. However, we believe that it can also be successfully applied in smaller companies.
Data network in Starship
Data works closely with the people who produce or consume the information
In order to launch hyperlocal markets for robotic shipments around the world, we need to turn various data products into valuable ones. Data comes from robots (e.g., telemetry, routing decisions, ETAs), merchants and customers (with their applications, requests, supply, etc.) and all operational aspects of the business (from the short distance operator task to global alternative logistics). parts and robots).
The diversity of use cases is the main reason that has attracted us to the data network perspective: we want to make data work very close to the people who produce or consume information. Following the principles of the data network, we hope to meet the various data needs of our teams, while maintaining a relatively clear central supervision.
Since Starship is not yet on a business scale, it is not practical for us to implement all aspects of a data network. Instead, we have settled on a simplified approach that makes sense to us now and puts us on the right path for the future.
Data products
Define what your data products are, with each owner, interface, and user
Applying product thinking to our data is the foundation of the whole approach. We think of everything that reveals data from other users or processes as a data product. It can explain its data in any way: as a BI panel, as a Kafka theme, as a data warehouse view, as a response to a predictive microservice, and so on.
A simple example of a Starship data product could be the BI panel for tracking the site’s turnover. A more elaborate example would be a self-service pipeline for robot software engineers to send driving information of any kind from robots to our data lake.
In any case, we do not treat our data warehouse (actually a Databricks lakehouse) as a single product, but as a platform that supports multiple interconnected products. Granular products are typically built and maintained by data scientists / engineers, not dedicated product managers.
The owner of the product is expected to know what its users are and the needs it addresses with the product, and to define and meet the product’s quality expectations accordingly. Perhaps as a result, we have begun to pay more attention to interfaces, components that are crucial for usability but difficult to change.
Most importantly, understanding the value that each user and product creates for them makes it much easier to prioritize ideas. This is essential in a launch context where you need to move fast and you don’t have time to make everything perfect.
Data domains
Get your data products into domains that reflect the organizational structure of the company
Before we got to know the data network model, we used the format successfully lightly embedded data scientists for a time in Starship. Effectively, some key groups had a data team member working with them on a part-time basis, which meant in any given group.
We started defining data domains in line with our organizational structure, this time being careful to cover all sections of the company. After mapping data products to domains, we assigned a data team member to manage each domain. This person is responsible for the entire data set of the domain; some of them belong to the same person, others to other engineers in the domain group, or others to members of the data group (e.g., for resource reasons).
We like several things in our domain setup. First, there is now a person in every area of the company who cares about their data architecture. Given the intricacies inherent in each domain, it is only possible because we have divided the work.
Creating a structure in our data products and interfaces has also helped us make better sense of our data world. For example, in a situation where there are more domains than data team members (currently 19 vs 7), we are now doing a better job of ensuring that each of us works on a set of related topics. And now we understand that in order to alleviate the growing pains, we should reduce the number of interfaces used across domain boundaries.
Finally, a more subtle bonus to using data domains: we now feel like we have a recipe for dealing with new situations. Every time a new initiative is created, everyone sees much more clearly where it is and who needs to deal with it.
There are also some open questions. Some domains are naturally inclined mostly to explain source data and others to consume and transform, there are both that are sufficient. Should we distribute these when they are too large? Or should we have larger subdomains? We will have to make these decisions in the future.
Data platform
Empower people by building your data products by standardizing without centralizing
The goal of the Starship data platform is straightforward: to make it possible for a single data person (usually a data scientist) to take care of a domain to the extreme, which is to keep the central data platform out of the group. today’s work. This requires domain engineers and data scientists to provide good tools and standard building blocks for their data products.
Does this mean that you need a whole bunch of data platforms for your data network perspective? Not really. Our data platform team is comprised of a single data platform engineer who is at the same time providing half the time embedded in a domain. The main reason we can be so thin in data platform engineering is to choose Spark + Databricks as the backbone of our data platform. The more traditional architecture of our previous data warehouse put us at a high cost in data engineering due to the diversity of our data domains.
We found it useful to make a clear distinction in the data stack between the components that are part of the platform and everything else. Some examples of what we offer domain groups as part of our data platform:
- Databricks + Spark as a versatile work environment and computing platform;
- Single-line data entry functions, for example, from Mongo collections or Kafka themes;
- an instance of Airflow for organizing data tubes;
- templates for building and deploying predictive models as a microservice;
- monitoring data product costs;
- BI and display tools.
As a general view, our goal is to standardize to the extent that it makes sense in our current context, as we also know parts that will not be standardized forever. If it helps productivity at the moment and doesn’t centralize part of the process, we’re happy. And of course, some elements are completely missing from the platform today. For example, data quality assurance tools, data retrieval, and data lineage are things we have left for the future.
Data governance
Strong personal property accompanied by opinionated eyes
Having fewer people and groups is actually beneficial in some aspects of government, for example, making it much easier to make decisions. On the other hand, the key question of our government is also a direct consequence of our size. If there is only one data person per domain, it cannot be expected to be an expert in all potential technical aspects. However, they are the only person who understands their domain in detail. How do we maximize our chances of making good choices in their domain?
Our response: through a culture of ownership, discussion, and inner group opinion. We have largely borrowed from the management philosophy Netflixen and worked on:
- personal responsibility for the result (personal products and domains);
- seeking different opinions before making decisions, especially those that affect other domains;
- soliciting feedback and code revisions, both as a quality mechanism and as an opportunity for personal growth.
We have also made some specific agreements on how we deal with quality, write down our best practices (including naming conventions), etc. But good feedback loops are a key component in making guidelines a reality.
These principles also apply outside of the “construction” work of our data set – that’s the focus of this blog post. Of course, there is much more rather than providing data products that create value for our data scientists.
A final reflection on governance: we will continue to repeat our ways of working. There will never be a single “good” way to do things and we know we need to adapt over time.
Last words
This is! These were the 4 basic concepts of the data network applied in Starship. As you can see, we have found an approach to the data network that suits us as a lightweight growth company. If you find it appealing in your context, I hope it has been helpful to read about our experience.
If you want to start our work, see our professional page for the list of open positions. Or look our Youtube channel to learn more about the world’s leading robot delivery service.
Ask me if you have any questions or thoughts and let’s learn from each other!
[ad_2]
Source link