Troubleshooting Database Issues with Lightrun

Eran Kinsbruner
6 min readJun 2, 2023

Databases are fundamental components of many applications. It persists data reliably so other components can remain stateless for better scalability, while having the ability to access data quickly when needed. Thus, it is crucial to keep the database performant and resolve issues in a timely manner to meet SLAs and keep the entire application stack humming.

There are several common database related issues that can arise, including:

  1. Connectivity issues: clients may not be able to connect to the database due to wrong credentials, IAM permissions, or private network permissions.
  2. Performance issues: databases may degrade over time due to load or non optimal design leading to slow query times or timeouts.
  3. Data inconsistency: downstream components may encounter issues if the data is in a format that is incompatible with its systems (e.g., null values, wrong type, etc)

A recent article that addressed the debugging of performance issues within databases in mobile applications also revealed many of the above challenges and others.

Oftentimes, a comprehensive investigation is needed to identify database issues and re-architect the system to resolve them. In this article, we will dive into some common challenges in troubleshooting database related issues as well as the shortcomings of traditional tooling to combat them. Then we will dive into how developers can use Lightrun’s dynamic instrumentation to make this process easier.

Challenges in Troubleshooting Database Issues

Troubleshooting database issues are often challenging since application developers are usually not experts in administering and operating databases. It takes in-depth knowledge to operate the different types of databases (e.g., relational, NoSQL, timeseries, graph, etc) and architect the data formats for peak performance. For example, choosing the right index determines the performance of relational queries significantly. Also, limiting the degree of cardinality by designing the table structure can impact query performance for other databases.

Besides the challenges of simply operating the database, developers usually struggle with debugging database issues as the database they deal with might be different locally versus in production. For example, locally, an ephemeral in-memory database or a small dockerized database may be used. For teams not using Docker, there may even be a mocked database layer for running component tests. Then in production, a managed database (e.g., AWS RDS, DynamoDB) may be used, changing how client libraries must connect and authenticate. Lastly, the scale and complexity of the data in production is hard to replicate across other environments.

For most popular databases, using the native SDK solves some of these issues. However, catching these issues before deployment is often hard. Also, there could be changes in the infrastructure (e.g., database upgrade, rotation of authentication secrets, IAM changes) that may be opaque to the developer trying to debug for the first time.

Shortcomings of Standard Tooling

Most standard observability tooling focuses on monitoring the database and relying on existing logs to help debug database issues. While having metrics on the database is nice to catch CPU/memory spikes and slow queries, it only surfaces top-level symptoms. It is then up to the engineer to try to fix that without affecting production workloads that are already impacted. One common way to do this is to export the database or work with a read-only database to run test queries against. Once the issue is found, a new version of the server or client software must be redeployed to verify the results. Keep in mind that doing such in a production environment is extremely hard.

Also, when it comes to network or config related issues, it’s really hard to surface these issues without getting shell access to the underlying services. Most of the time, the errors are due to invalid IAM credentials or TLS settings. Since this is sensitive information, it is not logged, leaving the engineer to either look over Terraform configs or exec into the server and inspect the environment variable or loaded secret values. Neither of these avenues are ideal as they require the engineer to take more time to either find where those values are set or get access to the production servers via break glass procedures, which could take time.

Dynamic Instrumentation with Lightrun

On the other hand, Lightrun provides a safe way to dynamically allow developers to debug database issues. First, with Lightrun Snapshots, developers can set virtual breakpoints from their IDEs and get read-only access to any code-level object. This can quickly help them rule out misconfigurations or authentication issues without having to wait for access or fetch secret values from a secret store.

Next, by adding dynamic logs, developers can quickly add debug logs to endpoints or functions that are causing the slow queries. This can help explain the source of the queries, especially if dynamic SQL queries are being constructed.

Examples

Let’s take a look at some examples of these Lightrun features. We’ll start by locally running a Postgres database with Docker:

```

docker run — name postgres -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -d postgres

```

Let’s also do some setup via `psql`:

```

psql -h localhost -p 5432 -U postgres

```

We will create a table called `users` with id, name, and age columns:

Then to simulate fetching data from Postgres, let’s have a local function that loops over 50 times to select all the rows from our table. In production, this will probably be triggered by some cronjob or external trigger, but to make testing easy we are creating a for-loop here.

Notice that we aren’t logging anything besides errors. At this point, in order for the engineer to start debugging, they would need access to the database or this function. But with Lightrun, we can easily create a virtual snapshot.

We can create a virtual breakpoint on the SQL query:

Once the agent hits that line in the code, we can get a sense of all the objects at that point. We can see the counter variable as well as the response object from the Postgres query.

If we’re interested in Postgres client settings, we can also create a dynamic log:

Then we can see the details in the Lightrun console:

Conclusion

Even though database usage is ubiquitous, debugging remains a challenge for most engineers. Not only does it require in-depth knowledge, but debugging with traditional tooling takes time to get the context and the access to really be useful. With Lightrun, developers can quickly create virtual breakpoints (Lightrun Snapshots) or use the dynamic logging feature to get that information on the fly without redeployment. Shorten the mean time to resolution with tricky database issues by giving Lightrun a try!

--

--

Eran Kinsbruner

Global Head of Product Marketing at Lightrun, Best Selling Author, Patent holder, CMMI and FinOps Practitioner Certified (https://lightrun.com)