What is a database?
At any given time, they're storing a tremendous amount of information - inventory, in the case of Amazon, messages and interrelationships in the case of Twitter - and making it available via the web. Further, the information is fairly fluid; every minute, many thousands of requests pour in, each causing information to be accessed, changed, added, or deleted.
Companies like these store their information in databases. A database is a collection of data, often (though not always) organized into tables consisting of rows and columns. There are many important requirements guiding the handling of the data in a company's databases; some of the most important requirements are:
- Correctness. Obviously, the database needs to store and retrieve information correctly in all cases. For example, if I buy a book from Amazon, the purchase shouldn't be charged to someone else's credit card or sent to someone else's address.
- Low latency. The database should be able to handle a request as quickly as possible, so that the web site can be as responsive as possible. If I perform a search on eBay for all auctions with the words "counting crows tickets," I should get a response within a few seconds, even if millions of auctions are in progress and millions of searches are executed per hour.
- Persistence and fault tolerance. All of the data should be stored in permanent storage, typically on some combination of many hard disks. Additionally, even in the event of a power failure, or a failure of a hard disk, the data should remain intact; care must be taken to ensure this.
- Transactional integrity. The database will never be corrupted by a partially-completed series of operations when one of the operations fails while being processed. The simplest example is a bank database. If I attempt to transfer money from one account to another, the transaction consists of a withdrawl from one account and a deposit into another. If the withdrawl works but the deposit doesn't, it's in my best interests as the customer that the withdrawl should not take effect. Similarly, if the deposit works, but the withdrawldoesn't, it's in the bank's best interest that the deposit should not take effect. To solve this problem, a well-designed database will allow you to designate a series of operations as being part of one transaction and, on failure of any of the operations in that series, will automatically roll back all of the operations, as though the transaction had never happened. Only if all of the operations are completed successfully will the transaction be committed, meaning that the changes will be saved to the database permanently and become visible to others.
- Security. It's important to ensure that sensitive information can only be accessed by those with the proper authorization. (Of course, corporations and individuals don't always agree on what information is sensitive and what information isn't.)
Naturally, this is a complex set of requirements; implementing such a system is not trivial. Fortunately, most companies that need to store large amounts of data share a similar set of requirements; correctness, low latency, persistence, fault tolerance, transactional integrity, and security are important for everybody! So, rather than every company implementing its own database system, a few companies (and a few open-source efforts) have implemented database management systems (DBMS's). A DBMS does all of the dirty work involved with managing a database: organizing the data into rows and columns, storing the data so that it can be efficiently accessed, updated, added, or removed, handling persistence, fault tolerance, and transactional integrity requirements, among others. Companies like Oracle, IBM, and Microsoft have built well-known, battle-tested DBMS systems, and sell them for sometimes as much as hundreds of thousands of dollars. This was a worthwhile investment on their part, as the core of most web-based businesses includes at least one database.
While there is a great deal of complexity in a DBMS, far more than we're equipped to handle in this course, two requirements that we can address with the tools we've learned are correctness and low latency. For this project, you'll implement a very rudimentary database, capable of storing data in tables stored in memory, consisting of rows and columns, quickly looking that data up based on a search key, updating the data, and removing it.