I'm working on a project that involves a high number of concurrent file operations on a Linux system (Ubuntu 24.04). The application is multithreaded and handles thousands of file transfers simultaneously. Occasionally we encounter performance bottlenecks and errors related to file system limits (e.g., "too many open files").
Here’s what I've tried so far:
- Increased ulimit and adjusted fs.file-max through
sysctl
. - Ensured proper locking mechanisms to prevent race conditions.
- Monitored disk I/O using tools like iotop and strace.
- Used systemd settings to increase limits for the specific process.
Despite these efforts, we're still seeing occasional failures and reduced performance under high load.
What are some advanced techniques or best practices for debugging and optimizing such scenarios? Are there specific tools, system configurations, or architectural patterns that can help?
I'm especially interested in tips for systems that need to scale efficiently with minimal downtime.