In corporate IT, databases have become synonymous with such products as Oracle, MySQL, and DB2 — all so-called relational databases that use a variant on standard query language, or SQL. However, if you remember what you might have learned in an information technology degree program, relational databases are not the only type. Depending on what you do, other types of databases might be better choices. With them come different approaches to handling, managing, and storing data.
For example, lower level programming often requires developers to create databases. You might store data in an array or in a structure like a b-tree for speed of access and sorting. It’s possible to write such data out to a file for persistent access between times of running the application. Although you could use a relational database through an application programming interface (API), that would add significant performance overhead and might require additional license fees.
There are times that linking to an RDBMS from some code will make sense, but quite often there are better ways to approach the problem. There is also a growing body of NoSQL databases, as Andrew Oliver points out in InfoWorld, that are particular helpful in big data analysis and use. Big data, some of which may exist in an RDBMS, frequently lacks the structure that is a must in relational databases. To use an RDMBS in big data applications, or in anything that uses a variety of data types that you may not be able to define in advance, can mean massive work in cleaning and formatting the data just to import it into the RDBMS. In addition, there are also tradeoffs that you make, whether you realize it or not. According to the CAP theorem in computer science theory, it is impossible in any distributed computer system — and that includes distributed big data analysis — to guarantee the following three conditions:
- consistency, where data and operations are undertaken and recognized in an absolute way, so every point in the system sees the same thing;
- availability, where the service is available for use; and
- partition tolerance, meaning that even if some segments of the system fail, the overall system can still respond correctly.
This is the computer science equivalent of the old saying that if you want work done, you can choose any two of speed, quality, and low cost. Depending on the type of data you use and what you want to do with it, a relational database might provide the right set of capabilities, or it might not. Distributed data becomes much harder to manipulate because an RDBMS depends on centralized control of the data. And different NoSQL databases offer various mixes of capabilities, and so might lend themselves to certain implementation types. And even if you have an application that lends itself to relational databases, do you need a full-blown Oracle, DB2, and their equivalents? Would an open source product be a better choice? Or maybe you need something simpler, like Microsoft Access. Before starting that next database application, take some time to be sure you’re using the right database tool for the job in question.