Grafana is the defacto dashboarding solution for time-series data. It supports over 40 datasources (as of this writing), and the dashboarding story has matured considerably with new features, including the addition of teams and folders. We now want to move on from being a dashboarding solution to being an observability platform, to be the go-to place when you need to debug systems on fire.
Observability. There are a lot of definitions out there as to what that means. Observability to me is visibility into your systems and how they are behaving and performing.
I quite like the model where observability can be split into 3 parts (or pillars): metrics, logs and traces; each complimenting each other to help you figure out whatâs wrong quickly. Prometheus sends me an alert that something is wrong and I open the relevant dashboard for the service. If I find a panel or graph anomalous, Iâll open the query in Grafanaâs new Explore UI for a deeper dive.
For example, if I find that one of the services is throwing 500 errors, Iâll try to figure out if a particular handler/route is throwing that error or if all instances are throwing the error, etc. Next up, once I have a vague mental model as to what is going wrong or where it is going wrong, Iâll look at logs. Pre Loki, I used to use kubectl to get the relevant logs to see what the error is and if I could do something about it.
This works great for errors, but sometimes I get paged due to high latency. In this situation I get more info from traces regarding what is slow and which method/operation/function is slow. We use Jaeger to get the traces.
Source: grafana.com