Stack Overflow is the largest online community of developers, and the performance and reliability of our SQL Server databases are crucial to serving over 1.3 Billion page views each month with each rendered in ~10-20 milliseconds. The Site Reliability Engineering (SRE) team is responsible for designing and maintaining our infrastructure (SQL Server, Redis, IIS, Haproxy, ElasticSearch, Fastly CDN) and finding ways of getting the most performance from a minimal amount of physical hardware. We try and design all of our solutions to be as simple as possible, and we believe that troubleshooting should be a first class feature of any critical system.
This high-level session will show how we use Availability Groups (AGs) to scale out the SQL workload and meet our current HA/DR needs, as well as future plans to use Distributed AGs to solve a pesky issue called “the speed of light”. It also will show the monitoring systems we use, namely OpServer and Bosun (both Open Source!), to quickly identify production SQL issues and ensure that performance is a top-priority feature.
I will also outline some of our key development principles, such as collecting exceptions in a central location, the benefits of using a Micro-ORM like Dapper, and the pros and cons of using the new JSON T-SQL functions. I will describe our experience finding the best tool for the job (like database migrations), where single purpose tools often beat all-encompassing frameworks. Keep in mind this information is heavily weighted toward the needs of a large .NET based web property, so YMMV. But my hope is you will leave with a few new tricks and a better understanding of how far you can go with a simple SQL Server architecture.