BREAKING NEWS
latest

728x90

468x60

Showing posts with label nosql. Show all posts
Showing posts with label nosql. Show all posts

Thursday, November 21, 2013

Rise Of Big Data On Cloud


Growing up as an engineer and as a programmer I was reminded every step along the way that resources—computing as well as memory—are scarce. The programs were designed on these constraints. Then the cloud revolution happened and we told people not to worry about scarce computing. We saw rise of MapReduce, Hadoop, and countless other NoSQL technology. Software was the new hardware. We owe it to all the software development, especially computing frameworks, that allowed developers to leverage the cloud—computational elasticity—without having to understand the complexity underneath it. What has changed in the last two to three years is a) the underlying file systems and computational frameworks have matured b) adoption of Big Data is driving the demand for scale out and responsive I/Os in the cloud.

Three years back, I wrote a post, The Future Of The BI In Cloud where I had highlighted two challenges of using cloud as a natural platform for Big Data. The first one was to create a large scale data warehouse and the second was lack of scale out computing for I/O intensive applications.

A year back Amazon announced RedShift, a data warehouse service in the cloud, and last week they announced high I/O instances for EC2. We have come a long way and more and more I look at the current capabilities and trends, Big Data, at scale, on the cloud, seems much closer to reality.

From a batched data warehouse to interactive analytic applications:

Hadoop was never designed for I/O intensive applications, but Hadoop being a compelling computational scale out platform developers had a strong desire to use it for their data warehousing needs. This made Hive and HiveQL popular analytic frameworks but this was a sub optimal solution that worked well for batch loads and wasn't suitable for responsive and interactive analytic applications. Several vendors realized there's no real reason to stick to the original style of MapReduce. They still stuck to the HDFS but significantly invested into alternatives to Hive which are way faster.

There are series of such projects/products that are being developed on HDFS and MapReduce as a foundation but by adding special data management layers on top of it to run interactive queries much faster compared to plain vanilla Hive. Some of those examples are Impala from Cloudera and Apache Drill from MapR (both based on Dremel), HAWQ from EMC, Stinger from Hortonworks and many other start-ups. Not only vendors but the early adopters such as Facebook created Hive projects such as Presto, an accelerated Hive, which they recently open sourced.

From raw data access frameworks to higher level abstraction tools: 

As vendors continue to build more and more Hive alternatives I am also observing vendors investing in higher level abstraction frameworks. Pig was amongst those first higher level frameworks that made it easier to express data analysis programs. But, now, we are witnessing even higher layer rich frameworks such as Cascading and Cascalog not only to write SQL queries but write interactive programs in higher level languages such as Clojure and Java. I'm a big believer in empowering developers with right tools. Working directly against Hadoop has a significant learning curve and developers often end up spending time on plumbing and other things that can be abstracted out in a tool. For web development, popularity of Angular and Bootstrap are examples of how right frameworks and tools can make developers way more efficient not having to deal with raw HTML, CSS, and Javascript controls.

From solid state drives to in-memory data structures: 

Solid state drives were the first step in upstream innovation to make I/Os much faster but I am observing this trend go further where vendors are investing into building in-memory resident data management layers on top of HDFS. Shark and Spark are amongst the popular ones. Databricks has made big bets on Spark and recently raised $14M. Shark (and hence Spark) is designed to be compatible with Hive but designed to run queries 100x times faster by using in-memory data structures, columnar representation, and optimizing MapReduce not to write intermediate results back to disk. This looks a lot like MapReduce Online which was a research paper published a few years back. I do see a UC Berkeley connection here.

Photo courtesy: Trey Ratcliff

Wednesday, January 16, 2013

A Journey From SQL to NoSQL to NewSQL


Two years back I wrote that the primary challenge with NoSQL is that it's not SQL. SQL has played a huge rule in making relational databases popular for the last forty years or so. Whenever the developers wanted to design an(y) application they put an RDBMS underneath and used SQL from all possible layers. Over a period of time, the RDBMS grew in functions and features such as binary storage, faster access, clusters, sophisticated access control etc. and the applications reaped these benefits. The traditional RDBMS became a non-fit for cloud-scale applications that fundamentally required scale at whole different level. Traditional RDBMS could not support this scale and even if they could it became prohibitively expensive for the developers to use it. Traditional RDBMS also became too restrictive due to their strict upfront schema requirements that are not suitable for modern large scale consumer web and mobile applications. Due to these two primary reasons and a lot more other reasons we saw the rise of NoSQL. The cloud movement further fueled this growth and we started to see a variety of NoSQL offerings.

Each NoSQL store is unique in which how a programmer would access it. NoSQL did solve the scalability and flexibility problems of a traditional database, but introduced a set of new problems, primary ones being lack of ubiquitous access and consistency options, especially for OLTP workload, for schema-less data stores.

This has now led to the movement of NewSQL (a term initially coined by Mat Aslett in 2011) whose working definition is: "NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for OLTP workloads while still maintaining the ACID guarantees of a traditional single-node database system." NewSQL's focus appears to be on gaining performance and scalability for OLTP workload by supporting SQL as well as custom programming models and eliminating cumbersome error-prone management tasks such as manual sharding without breaking the bank. It's a good first step in the direction of a scalable distributed database that supports SQL. It doesn't say anything about mixed OLTP and OLAP workload which is one of the biggest challenges for the organizations who want to embrace Big Data.

From SQL to NoSQL to NewSQL, one thing that is common: SQL.

Let's not underestimate the power of a simple non-procedural language such as SQL. I believe the programmers should focus on what (non-procedural such as SQL) and not how. Exposing "how" invariably ends up making the system harder to learn and harder to use. Hadoop is a great example of this phenomenon. Even though Hadoop has seen widespread adoption it's still limited to silos in organizations. You won't find a large number of applications that are exclusively written for Hadoop. The developers first have to learn how to structure and organize data that makes sense for Hadoop and then write an extensive procedural logic to operate on that dataset. Hive is an effort to simplify a lot of these steps but it still hasn't gained desired populairty. The lesson here for the NewSQL vendors is: don't expose the internals to the applications developers. Let a few developers that are closer to the database deal with storing and configuring the data but provide easy ubiquitous access to the application developers. The enterprise software is all about SQL. Embracing, extending, and augmenting SQL is a smart thing to do. I expect all the vendors to converge somewhere. This is how RDBMS and SQL grew. The initial RDBMS were far from being perfect but SQL always worked and the RDBMS eventually got better.

Distributed databases is just one part of the bigger puzzle. Enterprise software is more about mixing OLAP and OLTP workload. This is the biggest challenge. SQL skills and tools are highly prevalent in this ecosystem and more importantly people have SQL mindset that is much harder to change. The challenge to vendors is to keep this abstraction intact and extend it without exposing the underlying architectural decisions to the end users.

The challenge that I threw out a couple of years back was:

"Design a data store that has ubiquitous interface for the application developers and is independent of consistency models, upfront data modeling (schema), and access algorithms. As a developer you start storing, accessing, and manipulating the information treating everything underneath as a service. As a data store provider you would gather upstream application and content metadata to configure, optimize, and localize your data store to provide ubiquitous experience to the developers. As an ecosystem partner you would plug-in your hot-swappable modules into the data stores that are designed to meet the specific data access and optimization needs of the applications."

We are not there, yet, but I do see  signs of convergence. As a Big Data enthusiast I love this energy. Curt Monash has started his year blogging about NewSQL. I have blogged about a couple of NewSQL vendors, NimbusDB (NuoDB) and GenieDB, in the past and I have also discussed the challenges with the OLAP workload in the cloud due to its I/O intensive nature. I am hoping that NewSQL will be inclusive of OLAP and keep SQL their first priority. The industry is finally on to something and some of these start-ups are set out to disrupt in a big way.

Photo Courtesy: Liz

Friday, May 06, 2011

Disruptive Cloud Start-Ups - Part 1: NimbusDB

Being at Under The Radar (UTR), watching disruptive companies present and network with entrepreneurs, thought leaders, and venture capitalists is an annual tradition that I don't miss. I have blogged about disruptive start-ups that I saw in the previous years. The biggest exit out of UTR, that I have witnessed so far, is Salesforce.com's $212 million acquisition of Heroku. This post is about one of the disruptive start-ups that I saw at UTR this year - NimbusDB.

I met with Barry Morris, the CEO and Founder of NimbusDB at a reception the night before. I had long conversation with him around the issues with legacy databases, NoSQL, and of course NimbusDB. I must say that, after long time, I have seen a company applying all the right design principles to solve a chronic problem - how can you make SQL databases scale so that they don't suck.

One of the main issues with the legacy relational databases is that they were never designed to scale out to begin with. A range of NoSQL solutions addressed the scale-out issue, but the biggest problem with a NoSQL is that NoSQL is not SQL. This is why I was excited when I saw what NimbusDB has to offer: it's a SQL database at the surface but has radically modern architecture underneath that leverages MapReduce to divide and conquer queries, BitTorrent for messaging, and Dynamo for persistence.

NimbusDB's architecture isolates transactions from storage and uses asynchronous messaging across nodes - a non-blocking atom commit protocol - to gain horizontal scalability. At the application layer, it supports the "most" of SQL 99 features and doesn't require the developers to re-learn or re-code. The architecture doesn't involve any kind of sharding and the nodes can scale on any commodity machine on a variety of operating systems. This eliminates an explicit need of a separate hot back-up since any and all nodes serve as a live database in any zone. This makes NimbusDB an always live system, which also solves a major problem with traditional relational databases - high availability. It's an insert only database and it versions every single atom/record. That's how it achieves MVCC as well. The data is compressed on a disk and is accessed from an in-memory node.

I asked Barry about using NimbusDB as an analytic database and he said that the database is currently not optimized for analytic queries, but he does not see why it can't be tuned and configured as an analytic database since the inherent architecture doesn't really have to change. Though, during his pitch, he did mention that NimbusDB may have challenges with heavy reads and heavy writes. I personally believe that solving a problem of analytic query on large volume of data is a much bigger challenge in the cloud due to the inherent distributed nature of the cloud. Similarly, building a heavy-insert system is equally difficult. However, most systems fit somewhere in between. This could be a great target market for NimbusDB.

I haven't played around with the database, but I do intend to do so. On a cursory look, it seems to defy the CAP theorem. Barry seems to disagree with me. The founders of NimbusDB have great backgrounds. Barry was the CEO of IONA and Streambase and has extensive experience in building and leading technology companies. If NimbusDB can execute based on the principles it is designed on, this will be a huge breakthrough.

As a general trend, I see a clear transition, where people finally agree that SQL is still a preferred interface, but the key is to rethink the underlying architecture.

Update: After I published the post, Benjamin Block raised concerns around NimbusDB not getting the CAP theorem. As I mentioned in the post, I also had the same concern, but I would give them benefit of doubt for now and watch the feedback as the product goes into beta.

Check out their slides and the presentation:

Slides:



Presentation:







Tuesday, March 01, 2011

Making The Cut To Favorite Cloud, SaaS, And Tech Bloggers

The Dealmaker Media has published a list of their favorite Cloud, SaaS, and Tech bloggers. Once again I am happy to report that I made the cut. I am also glad to see my fellow bloggers Krishnan and Zoli on this list who are the driving force behind Cloudave. I was on a similar list of top cloud, virtualization, and SaaS bloggers that they had published in the past.

Under The Radar is one of the best conferences that I go to. This is the best place for disruptive start-ups to pitch an get noticed. They make a great attempt to connect entrepreneurs with investors and blogger like me. I have blogged about the disruptive early stage cloud computing start-ups as well as the disruptive start-ups in the categories of NoSQL and virtualization. Most of these start-ups have either had a good exit or have been doing well. The best example so far is Heroku's $212M exit. I met the Heroku founders at Under The Radar a couple of years back.

I am looking forward to soaking up even more innovation this year!

Tuesday, November 09, 2010

Challenging Stonebraker’s Assertions On Data Warehouses - Part 2

Check out the Part 1 if you haven’t already read it to better understand the context and my disclaimer. This is the Part 2 covering the assertions from 6 to 10.

Assertion 6: Appliances should be "software only."

“In my 40 years of experience as a computer science professional in the DBMS field, I have yet to see a specialized hardware architecture—a so-called database machine—that wins.”

This is a black swan effect; just because someone hasn’t seen an event occur in his or her lifetime, it doesn’t mean that it won’t happen. This statement could also be re-written as “In my 40 years of experience, I have yet to see a social network that is used by 500 million people.” You get the point. I am the first one who would vote in favor of commodity hardware against a specialized hardware, but there are very specific reasons why the specialized hardware makes sense in some cases.

“In other words, one can buy general purpose CPU cycles from the major chip vendors or specialized CPU cycles from a database machine vendor.”

Specialized machines don’t necessarily mean specialized CPU cycles. I hope the word “CPU cycle” is used as metaphor and not to indicate its literal meaning.

“Since the volume of the general purpose vendors are 10,000 or 100,000 times the volume of the specialized vendors, their prices are an order of magnitude under those of the specialized vendor.”

This isn’t true. The vendors who make general-purpose hardware also make specialized hardware, and no, it’s not an order of magnitude expensive.

“To be a price- performance winner, the specialized vendor must be at least a factor of 20-30 faster.”

It’s a wrong assumption that BI vendors use specialized hardware just because of the performance reasons. The “specialized” in many cases for an appliance is simply a specialized configuration. The appliance vendors also leverage their relationship with the hardware vendors to fine tune the configuration based on their requirements, negotiate a hefty discount, and execute a joint go-to-market strategy.

The enterprise software follows value-based pricing and not cost-based pricing. The price difference between a commodity and a specialized appliance is not just the difference of the cost of hardware that it runs on.

“However, every decade several vendors try (and fail).”

Not sure what is the success criteria behind this assertion to declare someone a winner or a failure. Acquisitions of Netezza, Greenplum, and Kickfire are recent examples of how well the appliance companies have performed. The incumbent appliance vendors are doing great, too.

“Put differently, I think database appliances are a packaging exercise”

The appliances are far more than a packaging exercise. Other than making sure that the software appliance works on the selected hardware, commoditized or otherwise, they provide a black box lifecycle management approach to the customers. The upfront cost of an appliance is a small fraction of the overall money that the customers would end up spending during the entire lifecycle of an appliance and the related BI efforts. The customers do welcome an approach where they are responsible for managing one appliance against five different systems at ten different levels with fifteen different technology stack versions.

Assertion 7: Hybrid workloads are not optimized by "one-size fits all."

Yes, I agree, but that’s not the point. It’s difficult to optimize hybrid workloads for a row or a column store, but it is not as difficult, if it’s a hybrid store.

“Put differently, two specialized systems can each be a factor of 50 faster than the single "one size fits all" system in solution 1.”

Once again, I agree, but it does not apply to all the situations. As I discussed earlier, the performance is not the only criteria that matters in the BI world. In fact, I would argue the opposite. Just because the OLTP and OLAP systems are orthogonal, the vendors compromised everything else to gain the performance. Now that’s changing. Let’s take an example of an operational report. This is the kind of report that only has the value if consumed in realtime. For such reports, the users can’t wait until the data is extracted out of the OLTP system, cleaned up, and transferred into the OLAP system. Yes, it could be 50 times faster, but completely useless, since you missed the boat.

The hybrid systems, the once that combine OLTP and OLAP, are fairly new, but they promise to solve a very specific problem, which is real real-time. While the hybrid systems evolve, the computational capabilities of OLTP and OLAP systems have started to change as well. I now see OLAP systems supporting write-backs with a reasonable throughput and OLTP systems with good BI style query performance, all of these achieved through modern hardware and clever use of architectural components.

Let’s not forget what optimization really is. It means desired functionality at reasonable performance. A real-time report, that takes 10 seconds to run could be far more valuable than a report that runs under ten milliseconds, three days later.

“A factor of 50 is nothing to sneeze at.”

Yes, point taken. :-)

Assertion 8: Essentially all data warehouse installations want high availability (HA).

No, they don’t. This is like saying all the customers want five 9 SLA on the cloud. I don’t underestimate the business criticality of a DW if it goes down, but not all the DW are being used 24x7 and are mission critical. One size doesn’t fit all. And, if your DW is not required to be highly available, you need to ask yourself, whether it is fair for you to pay for the HA architectural cost, if you don’t want it. Tiered SLAs are not new, and tiered HA is not a terrible idea.

Let’s talk about the DWs that do require to be highly available.

“Moreover, there is no reason to write a DBMS log if this is going to be the recovery tactic. As such, a source of run-time overhead can be avoided.”

I am a little confused how this is worded. Which logs are we referring to - the source systems or the target systems? The source systems are beyond the control of a BI vendor. There are newer approaches to design an OLTP system without a log, but that’s not up for discussion for this assertion. If the assertion is referring to the logs of the target system, how does that become a run-time overhead? Traditional DW systems are a read-only system at runtime. They don’t write logs back to the system. If he is referring to the logs while the data is being moved to DW, that’s not really run-time, unless we are referring to it as a hot-transfer.

There is one more approach, NoSQL, where eventual consistency is achieved over a period of time and the concept of a “corrupted system” is going away. Incomplete data is an expected behavior and people should plan for it. That’s the norm, regardless of a system being HA or not. Recently Netflix moved some of its applications to the cloud, where they have designed a background data fixer to deal with data inconsistencies.

HA is not black and white, and there are way more approaches, beyond the logs, to accomplish to achieve desired outcome.

Assertion 9: DBMSs should support online reprovisioning.

“Hardly anybody wants to take the required amount of down time to dump and reload the DBMS. Likewise, it is a DBA hassle to do so. A much better solution is for the DBMS to support reprovisioning, without going offline. Few systems have this capability today, but vendors should be encouraged to move quickly to provide this feature.”

I agree. I would add one thing. The vendors, even today, have a trouble supporting offline provisioning to cater to the increasing load. On-line reprovisioning is not trivial, since in many cases, it requires to re-architect their systems. The vendors typically get away with this, since the most customers don’t do capacity planning in real-time. Unfortunately, traditional BI systems are not commodity where the customers can plug-in more blades when they want and take them out when they don’t.

This is the fundamental premise behind why cloud makes it a great BI platform to address such re-provisioning issues with elastic computing. Read my post “The Future Of BI In The Cloud”, if you are inclined to understand how horizontal scale-out systems can help.

Assertion 10: Virtualization often has performance problems in a DBMS world.

This assertion, and the one before this, made me write the post “The Future Of BI In The Cloud”. I would not repeat what I wrote there, but I will quickly highlight what is relevant.

“Until better and cheaper networking makes remote I/O as fast as local I/O at a reasonable cost, one should be very careful about virtualizing DBMS software.”

Virtualizing I/O is not a solution for large DW with complex queries. However, as I wrote in the post, a good solution is not to make the remote I/O faster, but rather tap into the innovation of software-only SSD block I/O that are local.

“Of course, the benefits of a virtualized environment are not insignificant, and they may outweigh the performance hit. My only point is to note that virtualizing I/O is not cheap.”

This is what a disruption initially looks like. You start seeing good enough value in an approach, for certain types of solutions, that seems expensive for other set of solutions. Over a period of time, rapid innovation and economies of scale remove this price barrier. I think that’s where the virtualization stands, today. The organizations have started to use the cloud for IaaS and SaaS for a variety of solutions including good enough self-service BI and performance optimization solutions. I expect to see more and more innovation in this area where traditional large DW will be able to get enough value out of the cloud, even after paying the virtualization overhead.

Thursday, October 28, 2010

Challenging Stonebraker’s Assertions On Data Warehouses - Part 1

I have tremendous respect for Michael Stonebraker. He is an apt visionary. What I like the most about him is his drive and passion to commercialize the academic concepts. ACM recently published his article “My Top 10 Assertions About Data Warehouses." If you haven’t read it, I would encourage you to read it.

I agree with some of his assertions and disagree with a few. I am grounded in reality, but I do have a progressive viewpoint on this topic. This is my attempt to bring an alternate perspective to the rapidly changing BI world that I am seeing. I hope the readers take it as constructive criticism. This post has been sitting in my draft folder for a while. I finally managed to publish it. This is Part 1 covering the assertions 1 to 5. The Part 2 with the rest of the assertions will follow in a few days.

“Please note that I have a financial interest in several database companies, and may be biased in a number of different ways.”

I appreciate Stonebraker’s disclaimer. I do believe that his view is skewed to what he has seen and has invested into. I don’t believe there is anything wrong with it. I like when people put money where their mouth is.

As you might know, I work for SAP, but this is my independent blog and these are my views and not those of SAP’s. I also try hard not to have SAP product or strategy references on this blog to maintain my neutral perspective and avoid any possible conflict of interest.

Assertion 1: Star and snowflake schemas are a good idea in the data warehouse world.

This reads like an incomplete statement. The star and snowflake schemas are a good idea because they have been proven to perform well in the data warehouse world with row and column stores. However, there are emergent NoSQL based data warehouse architectures I have started to see that are far from a star or a snowflake. They are in fact schemaless.

“Star and Snowflake schemas are clean, simple, easy to parallelize, and usually result in very high-performance database management system (DBMS) applications.”

The following statement contradicts the statement above.

“However, you will often come up with a design having a large number of attributes in the fact table; 40 attributes are routine and 200 are not uncommon. Current data warehouse administrators usually stand on their heads to make "fat" fact tables perform on current relational database management systems (RDBMSs).”

There are a couple of problems with this assertion:
  1. The schema is not simple; 200 attributes, fact tables, and complex joins. What exactly is simple?
  2. Efficient parallelization of a query is based on many factors, beyond the schema. How the data is stored and partitioned, performance of a database engine, and hardware configuration are a few to name.
"If you are a data warehouse designer and come up with something other than a snowflake schema, you should probably rethink your design.”

Really?

The requirement, that the schema has to be perfect upfront, has introduced most of the problems in the BI world. I call it the design time latency. This is the time it takes after a business user decides what report/information to request and by the time she gets it (mostly the wrong one.) The problem is that you can only report based what you have in your DW and what’s tuned.

This is why the schemaless approach seems more promising as it can cut down the design time latency by allowing the business users to explore the data and run ad hoc queries without locking down on a specific structure.

Assertion 2: Column stores will dominate the data warehouse market over time, replacing row stores.

This assertion assumes that there are only two ways of organizing data, either in a row store or in a column store. This is not true. Look at my NoSQL explanation above and also in my post “The Future Of BI In The Cloud”, for an alternate storage approach.

This assertion also assumes that the access performance is tightly dependent on how the data is stored. While this is true in the most cases, many vendors are challenging this assumption by introducing an acceleration layer on top of the storage layer. This approach makes is feasible to achieve consistent query performance, by clever acceleration architecture, that acts as an access layer, and does not depend on how data is stored and organized.

“Since fact tables are getting fatter over time as business analysts want access to more and more information, this architectural difference will become increasingly significant. Even when "skinny" fact tables occur or where many attributes are read, a column store is still likely to be advantageous because of its superior compression ability."

I don’t agree with the solution that we should have fatter fact tables when business analysts want more information. Even if this is true, how will column store be advantageous when the data grows beyond the limit where compression isn’t that useful?

“For these reasons, over time, column stores will clearly win”

Even if it is only about rows versus columns, the column store may not be a clear commercial winner in the marketplace. Runtime performance is just one of many factors that the customers consider while investing in DW and business intelligence.

“Note that almost all traditional RDBMSs are row stores, including Oracle, SQLServer, Postgres, MySQL, and DB2.”

Exactly!

The row stores, with optimization and acceleration, have demonstrated reasonably good performance to stay competitive. Not that I favor one over the other, but not all row-based DW are that large or growing rapidly, and have serious performance issues, warranting a switch from a row to a column.

This leads me to my last issue with this assertion. What about a hybrid store – row and column? Many vendors are trying to figure this one out and if they are successful, this could change the BI outlook. I will wait and watch.

Assertion 3: The vast majority of data warehouses are not candidates for mainmemory or flash memory.

I am assuming that he is referring to the volatile flash memory and not flash memory as storage. Though, the SSD block storage have huge potential in the BI world.

“It will take a long time before main memory or flash memory becomes cheap enough to handle most warehouse problems.”

Not all DW are growing at the same speed. One size does not fit all. Even if I agree that the price won’t go down significantly, at the current price point, main memory and flash memory can speed up many DW without breaking the bank.

The cost of DW, and especially the cost of flash memory, is a small fraction of the overall cost; hardware, license, maintenance, and people. If the added cost of flash memory makes business more agile, reduces maintenance cost, and allows the companies to make faster decisions based on smarter insights, it’s worth it. The upfront capital cost is not the only deciding factor for BI systems.

“As such, non-disk technology should only be considered for temporary tables, very "hot" data elements, or very small data warehouses.”

This is easier said than done. The customers will spend significant more time and energy, on a complicated architecture, to isolate the hot elements and running them on a different software/hardware configuration.

Assertion 4: Massively parallel processor (MPP) systems will be omnipresent in this market.

Yes, MPP is the future. No disagreements. The assertion is not about on-premise or the cloud, but I truly believe that cloud is the future for MPP. There are other BI issues that need to be addressed before cloud makes it a good BI platform for a massive scale DW, but the cloud will beat any other platform when it comes to MPP with computational elasticity.

Assertion 5: "No knobs" is the only thing that makes any sense.

“In other words, look for "no knobs" as the only way to cut down DBA costs.”

I agree that “no knobs” is what the customers should thrive for to simplify and streamline their DW administration, but I don’t expect these knobs to significantly drive down the overall operational cost, or even the cost just associated with the DBAs. Not all the DBAs have a full time job to manage and tune the DW. The DW deployments go through a cycle where the tasks include schema design, requirements gathering, ETL design etc. Tuning or using the “knobs” is just one of many tasks that the DBAs perform. I absolutely agree that the no-knobs would certainly take some burden off the shoulders of a DBA, but I disagree that it would result into significant DBA cost-savings.

For a fairly large deployment, there is significant cost associated with the number of IT layers
that are responsible to channel the reports to the business users. There is an opportunity to invest into the right kind of architecture, technology-stack for the DW, and the tools on top of that to help increase the ratio of Business users to the BI IT. This should also help speed up the decision-making process based on the insights gained from the data. Isn’t that the purpose to have a DW to begin with? I see the self-service BI as the only way to make IT scale. Instead of cutting the DBA cost, I would rather focus on scaling the BI IT with the same budget and a broader coverage amongst the business users in an organization.

Monday, October 25, 2010

The Future Of BI In The Cloud



Actual numbers vary based on whom you ask, but the general consensus is that the Business Intelligence (BI) and Analytics in the cloud is a fast growing market. IDC expects a compounded annual growth rate (CAGR) of 22.4% through 2013. This growth is primarily driven by two kinds of SaaS applications. The first kind is a purpose-specific analytics-driven application for business processes such as financial planning, cost optimization, inventory analysis etc. The second kind is a self-service horizontal analytics application/tool that allows the customers and ISVs to analyze data and create, embed, and share analysis and visualizations.

The category that is still nascent and would require significant work is the traditional general-purpose BI on large data warehouses (DW) in the cloud. For the most enterprises, not only all the DW are on-premise, but the majority of the business systems that feed data into these DW are on-premise as well. If these enterprises were to adopt BI in the cloud, it would mean moving all the data, warehouses, and the associated processes such as ETL in the cloud. But then, the biggest opportunities to innovate in the cloud exist to innovate the outside of it. I see significant potential to build black-box appliance style systems that sit on-premise and encapsulate the on-premise complexity – ETL, lifecycle management, and integration - in moving the data to the cloud.

Assuming that the enterprises succeed in moving data to the cloud, I see a couple of challenges, if treated as opportunities, will spur the most BI innovation in the cloud.

Traditional OLAP data warehouses don’t translate well into the cloud:

The majority of on-premise data warehouses run on some flavor of a relational or a columnar database. The most BI tools use SQL to access data from these DW. These databases are not inherently designed to run natively on the cloud. On top of that, the optimizations performed on these DW such as sharding, indices, compression etc. don’t translate well into the cloud either since cloud is a horizontally elastic scale-out platform and not a vertically integrated, scale-up, system.

The organizations are rethinking their persistence as well as access languages and algorithms options, while moving their data to the cloud. Recently, Netflix started moving their systems into the cloud. It’s not a BI system, but it has the similar characteristics such as high volume of read-only data, a few index-based look-ups etc. The new system uses S3 and SimpleDB instead of Oracle (on-premise). During this transition, Netflix picked availability over consistency. Eventual consistency is certainly an option that BI vendors should consider in the cloud. I have also started seeing DW in the cloud that uses HDFS, Dynamo, and Cassandra. Not all the relational and columnar DW systems will translate well into NoSQL, but I cannot overemphasize the importance of re-evaluating persistence store and access options when you decide to move your data into the cloud.

Hive, a DW infrastructure built on top of Hadoop, is a MapReduce meet SQL approach. Facebook has a 15 petabytes of data in their DW running Hive to support their BI needs. There are a very few companies that would require such a scale, but the best thing about this approach is that you can grow linearly, technologically as well as economically.

The cloud does not make it a good platform for I/O intensive applications such as BI:

One of the major issues with the large data warehouses is, well, the data itself. Any kind of complex query typically involves an intensive I/O computation. But, the I/O virtualization on the cloud, simply does not work for large data sets. The remote I/O, due to its latency, is not a viable option. The block I/O is a popular approach for I/O intensive applications. Amazon EC2 does have block I/O for each instance, but it obviously can’t hold all the data and it’s still a disk-based approach.

For BI in the cloud to be successful, what we really need is ability for scale-out block I/O, just like scale-out computing. Good news is that there is at least one company, Solidfire, that I know, working on it. I met Dave, the founder, at the Structure conference reception. He explained to me what he is up to. Solidfire has a software solution that uses solid state drives (SSD) as scale-out block I/O. I see huge potential in how this can be used for BI applications.

When you put all the pieces together, it makes sense. The data is distributed across the cloud on a number of SSDs that is available to the processors as block I/O. You run some flavor of NoSQL to store and access this data that leverages modern algorithms and more importantly horizontally elastic cloud platform. What you get is commodity and blazingly fast BI at a fraction of cost with pay-as-you-go subscription model.
Now, that’s what I call the future of BI in the cloud.

Thursday, April 22, 2010

Disruptive Cloud Computing Startups At Under The Radar - NoSQL - Aspirin, Vicodin, and Vitamin

It was great to be back at Under The Radar this year. I wrote about disruptive cloud computing start-ups that I saw at Under The Radar last year. Since then the cloud computing has gained significant momentum. This was evident from talking to the entrepreneurs who pitched their start-ups this year. At the conference there was no discussion on what is cloud computing and why anyone should use it. It was all about how and not why. We have crossed the chasm. The companies who presented want to solve the “cloud scale” problems as it relates to database, infrastructure, development, management etc. This year, I have decided to break down my impressions into more than one post.

NoSQL has seen staggering innovation in the last year. Here are the two companies in the NoSQL category that I liked at Under The Radar:

Northscale was in stealth mode for a while and officially launched four weeks back. Their product is essentially a commercial version of memcached that sits in front of an RDBMS to help customers deal with the scaling bottlenecks of a typical large RDBMS deployment. This is not a unique concept – the developers have been using memcached for a while for horizontal cloud-like scaling. However it is an interesting offering that attempts to productize an open source component. Cloudera has achieved a reasonable success with commercializing Hadoop. It is good to see more companies believing in open source business model. They have another product called membase, which is a replicated persistence store for memcached – yes, a persistence layer on top of a persistence layer. This is designed to provide eventual consistency with tunable blocking and non-blocking I/Os. Northscale has signed up Heroku and Zynga as customers and they are already making money.

As more and more deployments face the scaling issues, Northscale does have an interesting value proposition to help customers with their scaling pain by selling them an aspirin or vicodin. Northscale won the best in category award. Check out their pitch and the Q&A:




GenieDB is a UK-based start-up that offers a product, which allows the developers to use mySQL as a relational database as well as a key-value store. It has support for replication with immediate consistency. Few weeks back I wrote a post - NoSQL is not SQL and that’s a problem. GenieDB seems to solve that problem to some extent. Much of the transactional enterprise software still runs on an RDBMS and depends on the data being immediately consistent. The enterprise software can certainly leverage the key-value stores for certain features where RDBMS is simply an overhead. However using a key-value store that is not part of the same logical data source is an impediment in many different ways. The developers want to access data from single logical system. GenieDB allows table joins between SQL and NoSQL stores. I also like their vertical approach of targeting specific popular platforms on top of mySQL such as Wordpress and Drupal. They have plans to support Rails by supporting ActiveRecord natively on their platform. This is a vitamin, if sold well, has significant potential.

They didn’t win any prize at the conference. I believe it wasn't about not having a good product but they failed to convey the magnitude of the problem that they could help solve in their pitch. My advice to them would be to dial up their marketing, hone the value proposition, and set up the business development and operations in the US. On a side note the founder and the CEO Dr. Jack Kreindler is a “real” doctor. He is a physician who paid his way through the medical school by building healthcare IT systems. Way to go doc! Check out their pitch and the Q&A:

Friday, March 05, 2010

NoSQL Is Not SQL And That’s A Problem

I do recognize the thrust behind the NoSQL movement. While some are announcing an end of era for MySQL and memcached others are questioning the arguments behind Cassandra’s OLTP claims and scalability and universal applicability of NoSQL. It is great to see innovative data persistence and access solutions that challenges the long lasting legacy of RDBMS. Competition between HBase and Cassandra is heating up. Amazon now supports a variety of consistency models on EC2.

However none of the NoSQL solutions solve a fundamental underlying problem – a developer upfront has to pick persistence, consistency, and access options for an application.

I would argue that RDBMS has been popular for the last 30 years because of ubiquitous SQL. Whenever the developers wanted to design an application they put an RDBMS underneath and used SQL from all possible layers. Over a period of time the RDBMS grew in functions and features such as binary storage, faster access, clusters etc. and the applications reaped these benefits.

I still remember the days where you had to use a rule-based optimizer to teach the database how best to execute the query. These days the cost-based optimizers can find the best plan for a SQL statement to take guess work out of the equation. This evolution teaches us an important lesson. The application developers and to some extent even the database developers should not have to learn the underlying data access and optimization techniques. They should expect an abstraction that allows them to consume data where consistency and persistence are optimized based on the application needs and the content being persisted.

SQL did a great job as a non-procedural language (what to do) against many past and current procedural languages (how to do). SQL did not solve the problem of staying independent of the schema. The developers did have to learn how to model the data. When I first saw schema-less data stores I thought we would finally solve the age-old problem of making an upfront decision of how data is organized. We did solve this problem but we introduced a new problem - lack of ubiquitous access and consistency options for schema-less data stores. Each of these data stores came with its own set of access API that are not necessarily complicated but uniquely tailored to address parts of the mighty CAP theorem. Some solutions even went further and optimized on specific consistencies such as eventually consistency, weak consistency etc.

I am always in favor of giving more options to the developers. It’s usually a good thing. However what worries me about NoSQL is that it is not SQL. There isn’t simply enough push for ubiquitous and universal design time abstractions. The runtime is certainly getting better, cheaper, faster but it is directly being pushed to the developers skipping a whole lot of layers in between. Google designed BigTable and MapReduce. Facebook took the best of BigTable and Dynamo to design Cassandra, and Twitter wanted scripting against programming on Hadoop and hence designed Pig. These vendors spent significant time and resources for one reason – to make their applications run faster and better. What about the rest of the world? Not all applications share the same characteristics as Facebook and Twitter and certainly enterprise software is quite different.

I would like to throw out a challenge. Design a data store that has ubiquitous interface for the application developers and is independent of consistency models, upfront data modeling (schema), and access algorithms. As a developer you start storing, accessing, and manipulating the information treating everything underneath as a service. As a data store provider you would gather upstream application and content metadata to configure, optimize, and localize your data store to provide ubiquitous experience to the developers. As an ecosystem partner you would plug-in your hot-swappable modules into the data stores that are designed to meet the specific data access and optimization needs of the applications.

Are you up for the challenge?