A software engineer's role goes beyond just writing code to build a functional application that meets business requirements. In addition to development, engineers must also focus on system operation, monitoring, and applying well-chosen design patterns to enhance maintainability and extensibility.

Among these responsibilities, monitoring the system, detecting performance issues early, and addressing them effectively are crucial tasks.

Performance issues in a software application refer to problems that negatively affect its speed, responsiveness, or efficiency. These issues can manifest in various ways, including:

Slow Response Time – The application takes too long to respond to user inputs or requests.
High Resource Consumption – Excessive CPU, memory, disk, or network usage degrades performance.
Scalability Problems – The application struggles to handle an increasing number of users or requests.
Concurrency Issues – Poor handling of multiple simultaneous operations leads to bottlenecks or deadlocks.
Unoptimized Code – Inefficient algorithms, redundant operations, or memory leaks slow down execution.
Database Bottlenecks – Slow queries, improper indexing, or excessive locking impact performance.
…

Performance issues can lead to poor user experience, increased operational costs, and system failures, making performance optimization a critical aspect of software development.

This blog will have three sections:

Monitoring and Detection
- Use tools like Prometheus, Grafana, and New Relic to track key metrics (latency, errors, resource usage).
- Leverage logging and tracing (OpenTelemetry, Jaeger) to analyze request flows and pinpoint bottlenecks.
- Set up automated alerts for anomalies such as high latency or resource exhaustion.
Resolution Process
- Diagnose root causes using monitoring data and structured checklists.
  - Apply fixes, including code optimization, query tuning, and scaling strategies.
- Validate improvements through load testing to prevent regressions.
Preventative Methods
- Plan for capacity growth based on usage patterns.
- Optimize performance with caching, CDNs, and edge computing.
- Design systems with horizontal scalability for long-term stability.

1. Monitoring and Detection

1.1 Methods

1.1.1 Automated Monitoring

Utilizing tools and scripts to track performance metrics and detect issues in real-time, continuously.

Real-Time Alerts: Automated alerts notify teams of threshold breaches, enabling immediate response.
- Examples: Rollbar alerts, Datadog Watchdog, etc.

1.1.2 Statistical Analysis

Applying statistical methods to identify performance anomalies, trends, and potential bottlenecks.

Dashboards: Visualizing key performance metrics for real-time assessment.
- RDS Dashboard:
- Service Dashboard:
  - CPU & Memory Usage
  - Response Time & API Latency
  - Error Rate

1.1.3 Human Evaluation

Incorporating manual assessment by engineers and feedback from users to identify performance issues.

Developer Assessment: Engineers manually test and analyze system performance. Such as using K6 for writing performance testing script.
User Feedback Channels: Establishing reporting channels for users to submit performance-related concerns.

1.2 Performance Metrics

1.2.1 Tools for Benchmarking and Monitoring

Monitoring Tools: DataDog, AWS, Rollbar
Benchmarking Tools: JMeter, K6, etc.

1.2.2 Benchmarking Metrics

Select appropriate metrics based on the feature being analyzed. Fill in data size, threshold values, and current performance measurements.

Metric	Threshold	Current Value	Notes
Response Time	> [X] ms	[Current Value]	-
Error Rate	> [Y]%	[Current Value]	-
Throughput (Concurrent Users/Jobs)	N/A	[Current Value]	-
App CPU & Memory Usage	N/A	[Current Value]	-
DB CPU & Memory Usage	N/A	[Current Value]	-
P90 Request Latency	N/A	[Current Value]	-
P95 Request Latency	N/A	[Current Value]	-
P99 Request Latency	N/A	[Current Value]	-
Availability (Uptime %)	> 99.95%	[Current Value]	-
DB Query Performance	Queries > [Z] ms	[Current Value]	-
Queue Processing Time	< [W] ms	[Current Value]	-
Cache Hit Rate	> [N]%	[Current Value]	-
Network Latency	< [M] ms	[Current Value]	-
Cold Start Time (for Serverless)	< [K] ms	[Current Value]	-
Traffic Spikes & Auto-scaling Efficiency	N/A	[Current Value]	-
Time to First Byte (TTFB)	< [L] ms	[Current Value]	-
First Contentful Paint (FCP)	< [O] ms	[Current Value]	-
Largest Contentful Paint (LCP)	< [P] ms	[Current Value]	-
HTTP 5xx Errors	< [Q]%	[Current Value]	-
DB Connection Failures	< [R]%	[Current Value]	-
API Rate Limits / Throttling Issues	< [S]%	[Current Value]	-

2. Resolution Process

This guide provides a structured approach to identifying and analyzing performance issues across different layers of a system, including databases, applications, and infrastructure. It helps diagnose and resolve bottlenecks to improve efficiency and scalability. Identifying the type of issue is essential for applying the appropriate analysis and resolution method.

2.1 Tools for Investigation

2.1.1 Application

Enable APM Profiling in Datadog to gather more detailed performance insights. Datadog Profiler Guide
Use distributed tracing and add more spans to your API to identify slow steps. Trace Context Propagation Guide
Monitor system resources (CPU, memory usage) of both the application and database to identify peak-time performance issues. Sample Monitoring Dashboard

Performance Debugging Tools

For Golang:
- pprof
- pprof middleware for Gin
For Java:
- JProfiler

2.1.2 Database

Analyze slow queries with EXPLAIN → EXPLAIN Query Guide
Check database engine details → SHOW ENGINE Statement
Monitor database deadlocks → Deadlock Monitoring Dashboard

2.2 Checklist for Identifying Root Causes & Suggested Solutions

2.2.1 Slow Queries in DB

Checklist to Find Root Cause	Suggested Solutions
Analyze the query execution plan (`EXPLAIN`)	Create or optimize indexes
Check for missing indexes (primary keys, foreign keys, etc.)	Refactor queries (use `JOIN` instead of subqueries, fetch only necessary data)
Identify table scans (avoid full table scans)	Optimize table structure (partition tables, use appropriate data types)
Monitor high I/O operations (disk reads/writes)	Batch load & insert data
Review `SELECT *` queries (fetch only required fields)	Implement caching strategies (Redis, Memcached, etc.)
Check for excessive JOINs/subqueries	Tune database configuration (`max_connections`, `buffer_pool_size`, etc.)
Identify functions in `WHERE` clauses that disable indexes (`LOWER()`, `CAST()`, etc.)	-
Analyze query performance using replicas	Use read replicas (distribute read/write operations)
Monitor database locks (deadlocks, lock contention)	Enable database monitoring (Prometheus, Grafana, Datadog)
Identify long-running transactions	Optimize transaction handling (commit early, split large transactions)

2.2.2 Deadlocks & Lock Wait Timeouts

Checklist to Find Root Cause	Suggested Solutions
Identify deadlocked queries (`SHOW ENGINE INNODB STATUS`)	Use shorter transactions (commit early, split transactions, use primary keys)
Analyze lock contention and blocked queries	Use proper indexing to reduce locked rows
Check for unnecessary pessimistic locks	Use `SELECT FOR UPDATE` only when necessary, consider optimistic locking
Review isolation levels (`READ COMMITTED`, `REPEATABLE READ`, etc.)	Lower isolation level if possible (e.g., `READ COMMITTED` instead of `SERIALIZABLE`)
Monitor lock wait times and blocked processes	Optimize query execution time (avoid long-held locks)
Analyze large transactions	Break large transactions into smaller ones
Performance issues with `INSERT` queries	Use database partitioning to reduce contention
Check `UPDATE`/`DELETE` queries for poor conditions (locking too many rows)	Use primary key or unique indexes to minimize locked rows
Evaluate auto-incremented primary keys (bottleneck in inserts)	Consider using ULID

2.3 Application Performance Issues

2.3.1 Memory & CPU Usage

Checklist to Find Root Cause	Suggested Solutions
Monitor memory usage trends	Optimize memory allocation (use efficient data structures, object pooling)
Identify high CPU-consuming processes	Optimize CPU-bound tasks (refactor algorithms, parallelize tasks)
Detect memory leaks (`pprof` for Go)	Use memory profiling tools
Analyze excessive garbage collection (GC logs)	Tune garbage collection (adjust GC thresholds)
Identify inefficient loops or busy-waiting	Implement timeouts or backoff strategies

2.3.2 Slow API Response Time

Checklist to Find Root Cause	Suggested Solutions
Identify slow API endpoints	Refactor inefficient logic, optimize database queries
Detect slow database queries	Optimize queries, use indexes
Examine large response payloads	Reduce response size (pagination, compression)
Identify synchronous blocking operations	Use async processing (move long-running tasks to background)
Monitor third-party API dependencies	Cache API responses (Redis, in-memory cache, cache promises if needed)
Analyze peak request times	Implement auto-scaling
Check load balancing performance	Optimize request distribution

2.4 Scaling for Concurrent Users & Multi-Tenant Systems

Checklist to Find Root Cause	Suggested Solutions
Monitor current user connections	Use connection pooling
Check for resource contention (CPU, memory)	Optimize resource allocation
Identify overwhelmed API/DB connections	Scale horizontally (add more servers, increase DB connections)
Evaluate system response to sudden spikes	Implement auto-scaling
Monitor slow/blocking operations	Optimize caching, async processing
Track number of concurrent users	Plan system capacity scaling

3. Preventative Methods

3.1 System Design

Best Practices / Patterns	Description
Microservices Architecture	Break down the application into smaller, independent services that can scale individually, improving performance and maintainability.
Load Balancing	Distribute incoming traffic evenly across multiple servers to avoid overloading any single instance, ensuring better system responsiveness.
Caching (Client, Server, CDN)	Store frequently accessed data closer to the user (client-side or edge with CDN) to reduce latency and load on backend services.
Database Sharding	Split large databases into smaller, manageable pieces (shards) to reduce query load and improve query performance across distributed systems.
Event-Driven Architecture	Use asynchronous messaging (e.g., Kafka, RabbitMQ) to decouple services, which can reduce response time and avoid blocking operations.
CQRS (Command Query Responsibility Segregation)	Separate reading and writing operations to optimize query performance and ensure write-heavy operations don’t affect read performance.
Auto-scaling & Elasticity	Automatically scale resources up or down based on load, ensuring your system handles traffic spikes efficiently.
API Rate Limiting	Prevent API abuse and overloading by limiting the number of requests a user can make within a specific time frame.
Failure Isolation (Circuit Breaker Pattern)	Isolate failing services to prevent them from impacting the rest of the system, allowing graceful degradation during failures.

3.2 Measuring and Improving System Performance

Action / Best Practice	Description
Establish a Performance Baseline	Measure current system performance using profiling tools (e.g., `New Relic`, `Datadog`, `Prometheus`) to understand the starting point for improvements.
Identify Performance Bottlenecks	Use tools like `APM` (Application Performance Management) and load testing tools (e.g., `JMeter`, `Gatling`) to identify slow endpoints, database queries, or other bottlenecks.
Set Performance KPIs	Define clear KPIs (e.g., response time, throughput, availability) to track performance improvement over time.
Perform Load and Stress Testing	Simulate heavy traffic to assess how the system behaves under load and identify areas that may fail under high demand.
Optimize Critical Bottlenecks	Prioritize optimizing the most impactful components (e.g., slow database queries, inefficient algorithms).
Plan for Scaling	Based on current performance, design an auto-scaling or horizontal scaling plan to ensure the system can handle future traffic loads.
Continuous Monitoring & Alerts	Set up monitoring systems to track performance in real-time, enabling proactive action when performance degrades.
Implement Caching Strategy	Review data access patterns and identify opportunities to implement caching (e.g., `Redis`, `Memcached`) to reduce load on backend systems.
Optimize Database Queries	Analyze query performance with tools like `EXPLAIN` and `SQL Profiler` to identify and optimize slow queries.
Continuous Improvement	Performance tuning is an iterative process—establish a feedback loop to refine and enhance system performance continuously.

4. Key takeaways

Think Global, Fix Local: Adopt a "think global, fix local" approach when solving performance issues. This means considering the broader, system-wide impact of changes, but focusing on solving specific, localized problems first. By addressing issues in smaller, more manageable areas, you can minimize risk and deliver performance improvements efficiently.
Identify Performance Bottlenecks Early: Use profiling tools and monitoring systems to pinpoint where performance issues occur, whether in CPU usage, memory leaks, or database queries.
Optimize Algorithms and Data Structures: A slow algorithm or inefficient data structure can significantly affect application performance. Reviewing and optimizing code can yield substantial improvements.
Database Optimization: Slow database queries, inefficient joins, and poor indexing are common performance killers. Regularly review database queries, use indexing wisely, and consider query caching where appropriate.
Load Testing: Conduct load testing to simulate real-world usage and identify how your application behaves under heavy traffic. This can help in determining capacity limits and areas that need optimization.
Asynchronous Processing: Offload heavy or time-consuming tasks to background processes using asynchronous patterns to ensure the main thread remains responsive.
Caching: Implement caching strategies (e.g., in-memory caches) to reduce redundant operations, such as database queries or expensive computations, improving response times.
Avoid Premature Optimization: Focus on solving the actual bottlenecks first. Premature optimization can lead to unnecessary complexity without improving performance.
Resource Management: Efficiently manage system resources, like memory and network bandwidth, and use them only when needed to avoid overloading the system.
Profiling and Benchmarking Tools: Use tools like profilers, debuggers, and A/B testing to continuously monitor performance, enabling quick identification and resolution of issues.
Scale Horizontally or Vertically: If necessary, consider scaling your application to handle higher loads—either by scaling vertically (upgrading hardware) or horizontally (adding more machines).
Continuous Improvement: Performance optimization is an ongoing process. Regularly monitor performance and tweak the application as it evolves and grows.

This keeps it concise while covering all key points. Let me know if you'd like further refinements! 🚀

For more information, let's Like & Follow MFV sites for updating blog, best practices, career stories of Forwardians at:

Facebook: https://www.facebook.com/moneyforward.vn

Linkedin: https://www.linkedin.com/company/money-forward-vietnam/

Youtube: https://www.youtube.com/channel/UCtIsKEVyMceskd0YjCcfvPg

Understanding and Solving Performance Issues in Software Applications