Hi! Thank you for your comment. I'm glad you enjoyed the article.
I have good experience with two tools for engineering and product metrics; each is a bit different, but they complement each other nicely.
The first one is DataDog. It's effectively an industry standard for engineering monitoring and observability. It is great for system reliability, error rates, monitoring, and alerting.
As an example of reliability calculation, you can start in a simple way by picking your most important user scenarios or API endpoints and calculate reliability over some time period as:
100 * (number of successful scenarios) / (number of successful + number of failed scenarios) -> this will give you a percentage, let's say 97%, which is the reliability of that scenario. Then error rate can be:
100 - reliability (in the example above 3%) over the same time period.
In DataDog you can build both dashboards and monitoring on top of such calculations. It scales well because you can create Terraform templates for it.
For product-oriented metrics, Amplitude can be a good pick. Feature adoption can be as simple as:
100 * (number of users using the feature)/(overall number of active users) over a time period. Then, you can observe on a dashboard whether the percentage increases over time or not. Of course, this can also be done in DataDog, but Amplitude is good in segmentations, funnels, longer data retention, etc., if you need that.
Of course, you can go with ELK stack, Prometheus, or similar for on-prem solutions. It all depends on the needed scale and budget.
Hi! Thank you for your comment. I'm glad you enjoyed the article.
I have good experience with two tools for engineering and product metrics; each is a bit different, but they complement each other nicely.
The first one is DataDog. It's effectively an industry standard for engineering monitoring and observability. It is great for system reliability, error rates, monitoring, and alerting.
As an example of reliability calculation, you can start in a simple way by picking your most important user scenarios or API endpoints and calculate reliability over some time period as:
100 * (number of successful scenarios) / (number of successful + number of failed scenarios) -> this will give you a percentage, let's say 97%, which is the reliability of that scenario. Then error rate can be:
100 - reliability (in the example above 3%) over the same time period.
In DataDog you can build both dashboards and monitoring on top of such calculations. It scales well because you can create Terraform templates for it.
For product-oriented metrics, Amplitude can be a good pick. Feature adoption can be as simple as:
100 * (number of users using the feature)/(overall number of active users) over a time period. Then, you can observe on a dashboard whether the percentage increases over time or not. Of course, this can also be done in DataDog, but Amplitude is good in segmentations, funnels, longer data retention, etc., if you need that.
Of course, you can go with ELK stack, Prometheus, or similar for on-prem solutions. It all depends on the needed scale and budget.
Thanks for this detailed reply and explanation with examples I deeply appreciate your help!!! I will look into DataDog and the others to learn more…