DoccoLabs Blog: 2015

Wednesday, November 18, 2015

The DBA’s guide to the [new] galaxy: Part III

| Part 3: Use the force, Luke!

Choosing the right architecture --

Experience with a wide variety of data stores doesn’t only help with choosing the right architecture, but also being confident about vetoing bad designs.

This goes both ways; so for the following (intentionally over-exaggerated) scenarios -

Scenario 1:

Requirement:

A new service which keeps track of your company’s orders.

Data specs: there are supposed to be about 100 orders per day, the information must be instantly consistent. Orders relate to products and customers.

Occasionally, some ad-hoc reports are needed, mostly to aggregate orders per customers, products or date range.

Selected solution:

the development team decided to save the orders to Redis, then use a queue to transform the data to an HDFS store,
out of which some map-reduce jobs will be used to return the relevant data.

So yes, the dev team got their hands all over a new trending technologies, and they can add “NoSQL” and “BIG”-data, but in fact, this is a very inappropriate solution for the problem.

Here’s another example -

Scenario 2:

Requirement:

Track the company app’s site visitor interaction in real-time on a heat map. In addition, show geo-locations trend over time.

Data specs: the app is extremely popular, gaining ~2500 interaction per second

Purposed solution:

Use a relational database on a very strong, high-end machine to able to support the 2500/sec OLTP rates

Create additional jobs running every few minutes, in order to aggregate the time frames & performing IP lookups.

Again, although this might work, this is again not a preferred solution.

In this case a real scale-out architecture will perform better at a lower cost.

A typical ELK stack for example could handle this scenario more naturally.

Can you see where this is going? Knowledge is important to divert the architecture in the right way.

Different specs require different platform implementation and in many cases - a mix.

Inappropriate platform selection can come from many sources, including:

Developers who want to experience with different technologies
PMs/POs lacking the detailed knowledge
CTOs wishing to brag about using some new technology (preferably while playing golf with other CTOs)

View image | gettyimages.com

Choosing the right platform for a given solution is extremely important, even when the recommended technology is not the trending one.

| Epilogue

The world of data is constantly changing. The variety of solutions for storing & retrieving data in various shapes and sizes is rapidly growing.
Relational databases are strong, and will stay strong - for storing and retrieving relational data, usually when no scale-out is required. For all the other types - the world has already changed.

If there’s one thing you should take away with you, is that looking forward - you should embrace these new technologies, get to know them and understand the architectural position of the dominant ones.

Links to Part 1, Part 2

Wednesday, November 11, 2015

How ClipTable works?

A few days ago I was approached by someone at work, who got some files needed to be exported into one of our SQL Servers.
The file format wasn't very intuitive, and caused this coworker a lot of different errors trying to use the "Import and Export Wizard". (Which is a basic SSIS task behind the scene)

Anyway, after being asked to "give it a shot", I naturally went to ClipTable, and of course, got this done in less than 2 minutes.

So this coworker asked me how this was done so fast, and what he was doing wrong.
I've explained that I was using a different tool. (and explained a bit about ClipTable)

"Yeah, but how was ClipTable able to determine how to turn the file into a table without any configuration?" she asked.

So, I answered and decided to write this post about some behind-the-scene ClipTable techniques.

If you're new to the tool -
ClipTable is a very fast SQL Import tool, which allows you to turn almost any tabular data into a SQL Server/Oracle/MySQL/PostgreSQL/Firebird Table.

Here is a behind-the-scenes glimpse to ClipTable's engine (that could) and some of the things it's able to do:

View image | gettyimages.com

So, when you import from memory (clipboard) or file, here is a partly list of what's happening:

[The engine analyzes the data, answering the following questions]

Is this an XML string?

Can this turn into a 2-d table?

Is this a JSON data?

Is this a serial JSON data? (repeated columns)
Which parts of the data can turn into a 2-d table?

Is this a textual data?

Does the data seem valid?
Analyzing the string pattern,
what is most likely to be the column/row delimiters?
Does the data include headers?
Are there white-spaces to be removed?

Are there any duplicate field names?
What is the best Data Type for each column?
What's the fastest method to write the table into the destination server?

Does this server support Bulk Copy/Insert?
Can insert statement sent in a big batch?

These engine features help ClipTable turning this:

Into this:

In just a few seconds.

Wednesday, November 4, 2015

SQL ClipTable now offers better JSON handling and SQL Server 2016 support

There's a new version of SQL ClipTable released today, which has multiple engine updates as well as two important features:

SQL Server 2016 support (tested against CTP's, will make sure RTM is fully supported upon release)
JSON format support has been extended to support a 'serial' JSON file, which is basically a file on which each line represents a new JSON-based row, with a similar schema.

As a reminder, ClipTable lets you import almost any data, of any shape and form into your database. Supporting all major relational database engines available today, including SQL Server, Oracle, MySQL, PostgreSQL, FireBird and more.

Want to see the extended JSON support in action?

Here's an example for such a file: Try Copy the code below & Paste into ClipTable

Thursday, October 29, 2015

The DBA’s guide to the [new] galaxy: Part II

| Part 2: Adopt to the change

On the previous post we talked about how the database world has changed and is constantly changing. With this in mind, let’s continue the discussion.

Knowledge & Expertise

Let’s take me for example -

I’ve been working with SQL Server for many years (and still am). Loving this platform, developing tools that make it more fun to work with, active in the great SQL Community and making sure to always be updated with the latest builds coming out.

For many years, I’ve spent a lot of free time to extend my knowledge around SQL Server, while some of it came as part of my day-to-day roles.

Just like any other technology, the knowledge and expertise one can gather (in SQL Server for example) if almost endless - given the amount of time dedicated to do so.

But when you try to look at this from 30,000ft you come to wonder -
At which point knowing the really deep-dive material (like memorizing trace-flags, knowing the internals of resource allocation and lock acquisition, mastering the in-depth tricks to force the server to build a specific execution plan, and the list goes on…)

At which point many of this becomes a classic case of the Diminishing returns? <wiki>

I argue that in most cases, knowledge gained beyond the orange marker is rarely required in practice.

Of course, given unlimited learning time and resources, go ahead and acquire all possible knowledge, but when this is not the case - maybe this time should be spent to learn new things?

Instead of a DBA, why not become a multicultural DBA?

(And just to be clear - getting to the orange marker takes many years of experience!)

View image | gettyimages.com

Getting out of your comfort zone

So to rephrase, what I’m trying to say is the following:

Given the available data stores today, and given a limited amount of learning time an average full time employee has (say, per week) -

There is a higher probability you will need to understand how to work with, say, Apache Spark for example, rather than knowing how to force an execution plan to use parallelism in SQL Server. Again, just an example.

Expanding your knowledge on various data storage/processing/retrieval engines for at least some of the leading new platforms available today is extremely important!

Please read the two lines ^above^ again. This is pretty much the most important essence of this article.

Being familiar with various data stores, as well as understanding theories like the “CAP Theorem” (below) is significant in order to being able to choose the right platform for your next project.

Where should I begin?

451 Research conducted a beautiful map of data platforms: https://451research.com/images/Marketing/dataplatformsmapoctober2014.pdf

Besides the actual content, it is apparent that the list is huge! Not only that - it is rapidly growing (consider the report was created sometime around last year, so the list is bigger today)

It is fairly impossible to dive into all of these products, let alone master one.
So my approach to you would be as follows:

View image | gettyimages.com

Understand the concept and the key points of different data stores, as well as where they would fit in a high-level theory (like the CAP theorem above)
So, assuming you already know ‘relational’ pretty well - focus on non-relational, key-value, document based, distributed frameworks.
Choose a leading platform from each area. This can be a popular/trending platform(such as Spark, Hadoop, MongoDb, Redis, Cassandra and others)
or a framework that is already being used within your organization (which can help you connecting the learning material with practical implementation)
Practice:
Reading is fine, but hands-on experience is extremely important. If you are a Windows user you may need to install a Linux VM as many of the new platforms are Linux natives.
Some vendors offer a ready-to-go image. Some offer online simulators. Use those!

Emphasizing the last item, if your natural [os] habitat is Windows be aware that the vast majority of new systems are not windows natives

In fact, it makes sense that open-source services are running on a free operation system. It’s time to refresh your Linux skills!

One more thing - Consider your existing knowledge as a great advantage!

For example, SQL (in different variations), as a querying language, is still one of the most popular languages to query data.
You will find a lot of SQL implementations well integrated into newer technologies (Hive, Spark SQL only to name a few)

Opening up to learning new data products will probably boost your Résumé, but this is only a side effect. The main advantage is that it will help you choosing the right data platform(s) for your next project!

More about that in Part 3...

Tuesday, October 20, 2015

The DBA’s guide to the [new] galaxy

| Prologue

I’ve been wanting to write this article for quite a while. Consider this as some sort of a high-level summary of today’s data world, given from my own personal perspective.

Specifically, an overview for all of my fellow SQL DBA’s (admins, developers, architects, warehouse/BI specialists), who get exposed to new data platforms starting to get built around them

Then quietly wonder how to adjust to these changes, and what should be their right approach (I’ll give you a quick tl;dr hint - denial is not the right one!)

Are you ready? Let’s go!

| Part 1: A quick “12-steps” phase to acknowledge the data world is changing

Until not too long ago, a typical organization would hold its entire back-end data stack in one or more relational databases.

To handle the data, that typical organization would hire one or more DBA’s responsible for tasks such as storage planning, administration, development, tuning, DR and more.

Also, a typical DBA would usually specialize in one (usually relational) database.

While this is still relevant, reality forced changes to the traditional data world; here’s a brief of what and why:

Over the years, data collection volume is growing exponentially!

There’s an ever-growing need to collect and store more data, while persisting historical data.

The new data is often getting richer in content and structure, or sometimes intentionally lacks any formal pre-defined structure.

Storage pricing are constantly dropping, especially commodity storage,

which is becoming more popular to use instead of having a very high-end server connected to a high-end storage.

The need to facilitate these ever-growing requirements while having minimal cost friction had led to new platforms, services and frameworks being built, whether cloud-based or on premise.

So, the data world has changed, dramatically!

Let’s look at these changes from a different angle:

New products within new startups, as well as existing organizations will not necessarily (gently put) choose a highly priced relational data store (not naming names, but you know which ones), unless the business model specifically requires one.
There are a *lot* of new technologies, mostly open-source, that have already reached enough maturity level in such way that many organizations trust these technologies/platforms as their production source of record.
Often, regardless to pricing, a required solution does not even fit inside a relational model and as a result, unlike a decade ago, you will see less and less relational databases trying to imitate processes that are not initially intended to be done inside a relational database. Need some examples? Key-Value stores, Document-based databases, unstructured data, true scale-out (share nothing) architectures, queues, graph data and more
In addition to cutting down licensing costs, scaling out the data (both storage and processing) reduces the hardware cost. High-end server can easily cost like a new house!

Given the above, with the assurance a certain *free technology can be better (let alone - ‘good enough’) to handle its data services - it is very likely such technology will be chosen by almost any organization over its costly rivals.

*This “Free” has its costs, but I’ll get to that later in the article.

Do you see this happening in your organization? If not, it’s just a question of when, not if.

If any of the above is news to you, or if you knew ‘something was going on’ but was gently ignoring it, you may feel a bit of a discomfort. (But if you are, that is totally fine)

First, it is important to be aware of what’s out there

Second, it is also important to keep in mind that Relational databases are still very strong and dominant, and will stay there for many good reasons.
In fact, for any structured data, with constraints, relationship to other data objects that need to be consistent, isolated, and transaction-safe, there are no better solutions than the relational model.

Let’s have a look at some graphs, shall we?

(Note: relevant to this article date, so - 2015-ish)

So, breath in, breath out! Here are some “perspective” graphs --

The first one, coming from “DB-engines” website’s ‘popularity and trends’ shows that the commercial relational databases are flying up above everyone else.

However, they do keep their stable value, if not decreasing slowly, while other newer “players” gradually increase

Source: http://db-engines.com/en/ranking

The second graph is coming from Google Trends:

Source: [google trends]

The trends graph shows the same picture - while “SQL Server” search term which was used here is still above typical newer databases & services - the trend is pretty clear.

We can clearly assume that popularity = estimated usage

“OK, so the data world has changed. Now what?”

Continued in Part 2…