Thursday, October 29, 2015

The DBA’s guide to the [new] galaxy: Part II

| Part 2: Adopt to the change

On the previous post we talked about how the database world has changed and is constantly changing. With this in mind, let’s continue the discussion.

Knowledge & Expertise
Let’s take me for example -
I’ve been working with SQL Server for many years (and still am). Loving this platform, developing tools that make it more fun to work with, active in the great SQL Community and making sure to always be updated with the latest builds coming out.

For many years, I’ve spent a lot of free time to extend my knowledge around SQL Server, while some of it came as part of my day-to-day roles.
Just like any other technology, the knowledge and expertise one can gather (in SQL Server for example) if almost endless - given the amount of time dedicated to do so.

But when you try to look at this from 30,000ft you come to wonder -
At which point knowing the really deep-dive material  (like memorizing trace-flags, knowing the internals of resource allocation and lock acquisition, mastering the in-depth tricks to force the server to build a specific execution plan, and the list goes on…)
At which point many of this becomes a classic case of the Diminishing returns? <wiki>

I argue that in most cases, knowledge gained beyond the orange marker is rarely required in practice.
Of course, given unlimited learning time and resources, go ahead and acquire all possible knowledge, but when this is not the case - maybe this time should be spent to learn new things?
Instead of a DBA, why not become a multicultural DBA?

(And just to be clear - getting to the orange marker takes many years of experience!)

Getting out of your comfort zone

So to rephrase, what I’m trying to say is the following:
Given the available data stores today, and given a limited amount of learning time an average full time employee has (say, per week) -
There is a higher probability you will need to understand how to work with, say, Apache Spark for example, rather than knowing how to force an execution plan to use parallelism in SQL Server. Again, just an example.
Expanding your knowledge on various data storage/processing/retrieval engines for at least some of the leading new platforms available today is extremely important!

Please read the two lines ^above^ again. This is pretty much the most important essence of this article.

Being familiar with various data stores, as well as understanding theories like the “CAP Theorem” (below) is significant in order to being able to choose the right platform for your next project.

Where should I begin?

451 Research conducted a beautiful map of data platforms:
Besides the actual content, it is apparent that the list is huge! Not only that - it is rapidly growing (consider the report was created sometime around last year, so the list is bigger today)
It is fairly impossible to dive into all of these products, let alone master one.
So my approach to you would be as follows:

  • Understand the concept and the key points of different data stores, as well as where they would fit in a high-level theory (like the CAP theorem above)
    So, assuming you already know ‘relational’ pretty well - focus on non-relational, key-value, document based, distributed frameworks.
  • Choose a leading platform from each area. This can be a popular/trending platform(such as Spark, Hadoop, MongoDb, Redis, Cassandra and others)
    or a framework that is already being used within your organization (which can help you connecting the learning material with practical implementation)
  • Practice:
    Reading is fine, but hands-on experience is extremely important. If you are a Windows user you may need to install a Linux VM as many of the new platforms are Linux natives.
    Some vendors offer a ready-to-go image. Some offer online simulators. Use those!

Emphasizing the last item, if your natural [os] habitat is Windows be aware that the vast majority of new systems are not windows natives
In fact, it makes sense that open-source services are running on a free operation system. It’s time to refresh your Linux skills!

One more thing - Consider your existing knowledge as a great advantage!

For example, SQL (in different variations), as a querying language, is still one of the most popular languages to query data.
You will find a lot of SQL implementations well integrated into newer technologies (Hive, Spark SQL only to name a few)
Opening up to learning new data products will probably boost your Résumé, but this is only a side effect. The main advantage is that it will help you choosing the right data platform(s) for your next project!

More about that in Part 3...

Tuesday, October 20, 2015

The DBA’s guide to the [new] galaxy

The DBA’s guide to the [new] galaxy

| Prologue

I’ve been wanting to write this article for quite a while. Consider this as some sort of a high-level summary of today’s data world, given from my own personal perspective.
Specifically, an overview for all of my fellow SQL DBA’s (admins, developers, architects, warehouse/BI specialists), who get exposed to new data platforms starting to get built around them
Then quietly wonder how to adjust to these changes, and what should be their right approach (I’ll give you a quick tl;dr hint - denial is not the right one!)

Are you ready? Let’s go!

| Part 1: A quick “12-steps” phase to acknowledge the data world is changing

Until not too long ago, a typical organization would hold its entire back-end data stack in one or more relational databases.
To handle the data, that typical organization would hire one or more DBA’s responsible for tasks such as storage planning, administration, development, tuning, DR and more.
Also, a typical DBA would usually specialize in one (usually relational) database.

While this is still relevant, reality forced changes to the traditional data world; here’s a brief of what and why:

Over the years, data collection volume is growing exponentially!
There’s an ever-growing need to collect and store more data, while persisting historical data.
The new data is often getting richer in content and structure, or sometimes intentionally lacks any formal pre-defined structure.
Storage pricing are constantly dropping, especially commodity storage, 
which is becoming more popular to use instead of having a very high-end server connected to a high-end storage.

The need to facilitate these ever-growing requirements while having minimal cost friction had led to new platforms, services and frameworks being built, whether cloud-based or on premise.

So, the data world has changed, dramatically!

Let’s look at these changes from a different angle:
  •  New products within new startups, as well as existing organizations will not necessarily (gently put) choose a highly priced relational data store (not naming names, but you know which ones), unless the business model specifically requires one.
  •  There are a *lot* of new technologies, mostly open-source, that have already reached enough maturity level in such way that many organizations trust these technologies/platforms as their production source of record.
  • Often, regardless to pricing, a required solution does not even fit inside a relational model and as a result, unlike a decade ago, you will see less and less relational databases trying to imitate processes that are not initially intended to be done inside a relational database. Need some examples? Key-Value stores, Document-based databases, unstructured data, true scale-out (share nothing) architectures, queues, graph data and more
  •  In addition to cutting down licensing costs, scaling out the data (both storage and processing) reduces the hardware cost. High-end server can easily cost like a new house!

Given the above, with the assurance a certain *free technology can be better (let alone - ‘good enough’) to handle its data services - it is very likely such technology will be chosen by almost any organization over its costly rivals.

*This “Free” has its costs, but I’ll get to that later in the article.

Do you see this happening in your organization? If not, it’s just a question of when, not if.
If any of the above is news to you, or if you knew ‘something was going on’ but was gently ignoring it, you may feel a bit of a discomfort. (But if you are, that is totally fine)

First, it is important to be aware of what’s out there
Second, it is also important to keep in mind that Relational databases are still very strong and dominant, and will stay there for many good reasons.
In fact, for any structured data, with constraints, relationship to other data objects that need to be consistent, isolated, and transaction-safe, there are no better solutions than the relational model.

Let’s have a look at some graphs, shall we?

 (Note: relevant to this article date, so - 2015-ish)

So, breath in, breath out! Here are some “perspective” graphs --

The first one, coming from “DB-engines” website’s ‘popularity and trends’ shows that the commercial relational databases are flying up above everyone else.
However, they do keep their stable value, if not decreasing slowly, while other newer “players” gradually increase

The second graph is coming from Google Trends:

The trends graph shows the same picture - while “SQL Server” search term which was used here is still above typical newer databases & services - the trend is pretty clear.
We can clearly assume that popularity = estimated usage

OK, so the data world has changed. Now what?

Continued in Part 2