I was recently contacted by Amazon Web Services to provide feedback on their relational database-as-a-service (RDS). I took the opportunity to outline my usage and suggestions for improvements. As there is so much “black magic” behind the scenes as AWS, I was really happy to receive an in-depth response from an executive there (shown below my response)
Overview of My RDS Usage
I manage the infrastructure for a consumer website offering games, trivia, surveys and other content. We have hundreds of thousands of users and rely on RDS for all our relational DB needs.
- Full stack running on EC2
- One master DB and one read-replica
- 100-200GB of data
- 350GB disk allotted (due to rumors on StackOverflow that 300gb+ drives have striping)
- Not utilizing Multi-AZ. Due to large cost increase, and not sure that I can trust it will work. Last major outage from that storm hitting VA showed that even. Multi-AZ doesn’t help if none of the infrastructure is available.
- Memcache (via Elasticache) utilized
- MongoDB used for unstructured data requiring heavy writes
Suggestions for Service Improvements and New Features
- Direct URL to Cloudwatch graphs with public access. This would allow me to distribute bookmarks to my developers and other staff without giving them access to AWS. There is no private data shown on the graphs, so it should be quite safe to do.
- Access to bin logs for forensic dives into issues. I.E. how was all data dropped from every table? We recently had an outage where our entire database was wiped. We had no log of queries, and therefore couldn’t find out if it was a bug in our code, or a failure with MySQL that corrupted all the data. I had to do a “restore to point in time”, which took 2.5 hours to complete. During this time I had no idea how long it would take to come back online, or if the database would just empty again.
- Feedback in the GUI or API as to how far along (%) a DB Restore is. Currently there is no information and I can’t ever give anyone an ETA on when service might return.
- When creating a database, could you explicitly tell me at what disk size I will get higher performance from striping. I read on StackOverflow that it is enabled on disks larger than 300GB.
- I have setup a DB using the new PiOPs environment and began testing the load time. It seems that even with the largest instances available and best practices, it will still mean at-least 5 hours of downtime for us. For this reason we’ll like have to wait to take advantage of it. Do you have any idea when the ability to boot standard snapshots into PiOPS will become available?
Response from AWS
October 3rd 2012
I’ve passed this doc to our development team and documented these requests into a requirements doc that we maintain for them. Some of the items are already on their radar, so your input will influence their priority. One point you raise (in both your configuration detail and in one question towards the end of the doc) is in regards to scaling storage:
You will realize improvements with RDS throughput by scaling storage as high as 500GB, and this effect starts at a level well under 100GB (ie: striping occurs at a far lower level than 300GB).
The most important factor in realizing this throughput potential is the instance class. Specifically, the following instance classes are considered High I/O instances:
These instances have large network bandwidth available to them, so the upgrade that you mentioned on stackoverflow (to the m2.2xlarge instance) was likely the main reason you saw a leap in throughput. If you stripe your current storage as high as 500GB, this will continue to increase. With provisioned IOPS support for RDS (PIOPS-announced last night), throughput will now scale linearly all the way to 1TB.
With PIOPS, the throughput rate you can expect is currently associated with the amount of allocated storage. For Oracle and MySQL databases, you will realize a very consistent 1,000 IOPS for each 100GB you allocate – resulting in a potential throughput max of 10K IOPS. The (current, temporary) downside is that you will need to unload/load data to migrate an existing app to the PIOPS RDS.
Loading snapshots into PIOPS instances is still a few months away, but the team is committed to delivering this as quickly as possible. We understand the downtime impact and recommend that PIOPS instances be used for testing, benchmarking and new workloads. Existing workloads that need PIOPS are mostly sensitive to downtime, so we don’t anticipate a lot of migration until we can provide a more seamless transition.
Regarding Multi-AZ deployment… we’re constantly improving the back-plane of RDS to ensure that MAZ is failure-proof. Until we’re at 100% protection, however, the work continues – to the point that it often pushes back more visible roadmap features.
My Thoughts on the AWS response
The AWS team is very sharp. The speed at which they iterate their products with customer demand is incredible. Their response to my concerns with the RDS product clearly demonstrate this.
RDS has been a huge success for me. Though there have been a couple periods of downtime due to EC2 apocalypse like events, the ability to focus on product development instead of mundane DB/sysadmin tasks is priceless. Even more important is the peace of mind I can have as a sysadmin. Typically database backup, storage, rotation, testing and recovery is an arduous process requiring constant attention. Giving up a couple control knobs for all this automation is absolutely the right decision for any startup.
I’m excited to see what the AWS team comes out with next.