BREAKING NEWS
latest

728x90

468x60

Showing posts with label big data. Show all posts
Showing posts with label big data. Show all posts

Thursday, July 09, 2015

The Discriminatory Dark Side Of Big Data


It has happened again. Researchers have discovered that Google’s ad-targeting system is discriminatory. Male web users were more likely to be shown high paying executive ads compared to female visitors. The researchers have published a paper which was presented at the Privacy Enhancing Technologies Symposium in Philadelphia.

I had blogged about the dark side of Big Data almost two years back. Latanya Sweeney, a Harvard professor Googled her own name to find out an ad next to her name for a background check hinting that she was arrested. She dug deeper and concluded that so-called black-identifying names were significantly more likely to be the targets for such ads. She documented this in her paper, Discrimination in Online Ad Delivery. Google then denied AdWords being discriminatory in anyway and Google is denying to be discriminatory now.

I want to believe Google. I don’t think Google believes they are discriminating. And, that’s the discriminatory dark side of Big Data. I have no intention to paint a gloomy picture and blame technology, but I find it scary to observe that technology is changing much faster than the ability of the brightest minds to comprehend the impact of it.

A combination of massively parallel computing and sophisticated algorithms to leverage this parallelism as well as ability of algorithms to learn and adapt without any manual intervention to be more relevant, almost in real-time, are going to cause a lot more of such issues to surface. As a customer you simply don't know whether the products or services that you are offered or not at a certain price is based on any discriminatory practices. To complicate this further, in many cases, even companies don't know whether insights they derive from a vast amount of internal as well as external data are discriminatory or not. This is the dark side of Big Data.

The challenge with Big Data is not Big Data itself but what companies could do with your data combined with any other data without your explicit understanding of how algorithms work. To prevent discriminatory practices, we see employment practices being audited to ensure equal opportunity and admissions to colleges audited to ensure fair admission process, but I don't see how anyone is going to audit these algorithms and data practices.

Disruptive technology always surfaces socioeconomic issues that either didn't exist before or were not obvious and imminent. Some people get worked up because they don't quite understand how technology works. I still remember politicians trying to blame GMail for "reading" emails to show ads. I believe that Big Data is yet another such disruption that is going to cause similar issues and it is disappointing that nothing much has changed in the last two years.

It has taken a while for the Internet companies to figure out how to safeguard our personal data and they are not even there, but their ability to control the way this data could get used is very questionable. Let’s not forget data does not discriminate, people do. We should not shy away from these issues but should collaboratively work hard to highlight and amplify what these issues might be and address them as opposed to blame technology to be evil.

Photo courtesy: Kutrt Bauschardt

Monday, March 16, 2015

Chasing Unknown Unknown, The Spirit Of Silicon Valley


A framework that I use to think about problems disruptive technology could help solve is based on what Donald Rumsfeld wrote in his memoir, Known and Unknown:
Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.
A couple of decades ago technology was seen as means to automate manual processes and bring efficiency. While largely automation is a prerequisite in the modern economy the role of technology has significantly changed to create unique differentiation and competitive advantage against peers in an industry. Many people are working on making things betters, cheaper, and faster or a combination of these three. This approach—solving known known—does provide incremental or evolutionary innovation and market does reward it.

But, the Silicon Valley thinks differently.

The Silicon Valley loves to chase known unknown problems, the moonshots, such as self-driving vehicles, providing internet access to every single human being on the earth, and private shuttles to space. These BHAG are totally worth chasing. To a certain degree, we do know and experience what the actual problem is and we can even visualize what a possible solution could look like. As counterintuitive as it may sound, but it is relatively easy to have entrepreneurs and investors rally towards a solution if they can visualize an outcome even if solving a problem could mean putting in a monumental effort.
"We can be blind to the obvious, and we are also blind to our blindness.” - Daniel Kahneman
Most disruptive products or business models have a few things in common: they focus on latent needs of customers, they imagine new experiences and deliver them to customers, and most importantly they find and solve problems people didn’t know they had and couldn’t imagine it could be solved - the unknown unknown.

Chasing unknown unknown requires bold thinking and a strong belief in you quest and methods to get there. Traditional analytical thinking will take you to the next quarter or the next year with a double digit growth but won’t bring exponential growth. These unknown problems excite me the most and I truly enjoy working on them. Unknown unknown is the framework that I use to understand the potential of disruptive technology such as Big Data and Internet of Things. If technology can solve any problem which problem you want to have it solved is how I think.

Chasing unknown unknowns is not an alternative to go for moonshots; we need both and in many cases solving an unknown unknown journey starts by converting it to a known unknown. The key difference between the two is where you spend your time -  looking for a problem and reframing it or finding a breakthrough innovation for a known corny problem. A very small number of people can think across this entire spectrum; most people are either good at finding a problem or solving it but not at both.

Discovering unknown problems requires a qualitative and an abductive approach as well as right set of tools, techniques, and mindset. Simply asking people what problems they want to have it solved they don’t know they have won’t take you anywhere. I am a passionate design thinker and I practice and highly encourage others to practice qualitative methods of design and design thinking to chase unknown unknowns.

I wish, as Silicon Valley, we don’t lose the spirit of going after unknown unknown since it is hard to raise venture capital and rally people around a problem that people don’t know exist for sure. Empowering people to do things they could not have done before or even imagined they could do is a dream that I want entrepreneurs to have.

Photo courtesy: Ahmed Rabea

Wednesday, October 22, 2014

Disruptive Enterprise Platform Sales: Why Buy Anything, Buy Mine, Buy Now - Part III


This is the third and the last post in the three-post series on challenges associated with sales of disruptive platforms such as Big Data and how you can effectively work with your prospects and others to mitigate them. If you missed the previous posts the first post was about “why buy anything” and the second post was about “why buy mine." This post is about “why buy now."

Platform sales is often times perceived as a solution looking for a problem a.k.a hammer looking for a nail. In most cases your prospects don’t have a real urgency to buy your platform making it difficult for you to make them commit on an opportunity. There are a few things that you could do to deal with this situation:

Specific business case

It’s typical for vendors to create a business case positioning their solutions to their prospects. These business cases include details such as solution summary, pricing, ROI etc. If you’re a platform vendor not only you have to create this basic business case but you will also have to go beyond that. It’s relatively hard to quantify ROI of a platform since it doesn’t solve a specific problem but it could solve many problems. It is extremely difficult to quantify the impact of lost opportunities. If your prospect doesn’t buy anything do they lose money on lost opportunities? Traditional NPV kind of analysis goes for a toss for such situations.

As a vendor not only you will have to articulate the problems (scenarios/use cases) that you identified leading up to this step but you might also have to include more scenarios that were not specifically discussed during the evaluation phase. Getting a validation from the business on expected return on their investment while fulfilling their vision is crucial since your numbers will most likely get challenged when your prospect creates its own business case to secure necessary investment to buy your platform.

Leveraging the excitement

What seemed like a problem when you worked with a variety of people inside your prospect’s organization may not seem like a problem in a few weeks or months. It’s very important in platform sales cycle not to lose momentum. Find a champion of your pilot keep socializing the potential of your platform inside your prospect’s organization as much as you can while you work on commercials of your opportunity. People should be talking about your disruptive platform and wanting to work with you. Cease that moment to close it.

Knowing who will sign the check

Platform sales are convoluted. People who helped you so far may not necessarily help you with the last step—not that they don’t want to but they may not be the buyers who will sign the check. It’s not uncommon in enterprise software sales to have influencers who are not the final buyers but the buyers do have somewhat defined procurement process for standard solutions. When it comes to buying a platform many buyers don’t quite understand why they should be spending money on disruptive platform that may or may not necessarily solve a specific problem.

To complicate this further, for disruptive technology, it typically tends to get cheaper as it becomes more mature. This gives your prospect one more reason to wait and not buy your platform now. As I mentioned in the previous post your focus should never be on pricing (unless of course you are the best and cheapest vendor by a wide margin) but on immediate returns, no lost opportunities, and helping your prospect gain competitive differentiation in their industry.

Despite of working with your prospect for a while helping them define problems and piloting your platform to prove out the value proposition, you might get asked again to do things all over again. There are two ways to mitigate this situation: a) involve buyers early on in the sales process and not at the very end so that they are part of the journey b) work aggressively with your influencers to establish appropriate communication channels with the buyers so that it’s the influencer’s voice they hear and not yours.

Happy selling!

Photo Courtesy: Wierts Sebastien  

Monday, September 22, 2014

Disruptive Enterprise Platform Sales: Why Buy Anything, Buy Mine, Buy Now - Part II


This is the second post in the three-post series on challenges associated with sales of disruptive platforms such as Big Data and how you can effectively work with your prospects and others to mitigate them. If you missed the first post in the series it was about “why buy anything.” This post is about “why buy mine."

Convincing  your prospects they need to buy a platform is just a first step in the sales process. You need to work with them to convince them to buy not just any platform but your platform.

Asking the right questions - empathy for business

This is the next logical step after you have managed to generate organic demand in your prospect’s organization a.k.a “why buy anything” as I mentioned in the Part I. Unlike applications, platforms don’t answer a specific set of questions (functional requirements). You can’t really position and demonstrate the power of your platform unless you truly understand what questions your prospect needs you to answer. Understanding your prospect’s questions would mean working closely with them to understand their business and their latent needs. Your prospect may or may not tell you what they might want to do with your platform. You will need to do it for them. You will have to orchestrate those strategic conversations that have investment legs and understand problems that are not solvable by standard off-the-shelf solutions your prospect may have access to.

Answering the right questions - seeing is believing

One of the key benefits of SaaS solutions is your prospect’s ability to test drive your software before they buy it. Platforms, on-premise or SaaS, need to follow the same approach. There are two ways to do this: you either give your prospect access to your platform and let them test drive it or you work with your prospect and be involved in guiding them through how a pilot can answer their questions and track their progress. While the latter approach is a hi-touch sale I would advise you to practice it if it fits your cost structure. More on why it is necessary to stay involved during the pilot in the next and the last post (Part III) in this series.

Proving unique differentiation

Once your prospect starts the evaluation process whether to buy your platform or not your platform will be compared with your competitive products as part of their due diligence efforts. This is where you want to avoid an apple-to-apple comparison and focus on unique differentiation.

Even though enterprise platform deals are rarely won on price alone don’t try to sell something that solves a problem your competitors can solve at the same or cheaper price. Don’t compete on price unless you are significantly cheaper than your competitor. The best way to position your platform is to demonstrate a few unique features of your platform that are absolutely important to solve the core problems of your prospect and are not just nice-to-have features.

Care deeply for what your prospects truly care about and prove you’re unique.

The next and the last post in this series will be about “why buy now.”

Photo courtesy: Flickr 

Monday, September 01, 2014

Disruptive Enterprise Platform Sales: Why Buy Anything, Buy Mine, Buy Now - Part I


I think of enterprise software into two broad categories - products or solutions and platform. The simplest definition of platform is you use that to make a solution that you need. While largely I have been a product person I have had significant exposure to enterprise platform sales process. I have worked with many sales leaders, influencers, and buyers. Whether you're a product person or you're in a role where you facilitate sales I hope this post will give you some insights as well as food for thought on challenges associated with sales of disruptive platforms such as Big Data and how you can effectively work with customers and others to mitigate some of these challenges.

I like Mark Suster's sales advice to entrepreneurs through his framework of "why buy anything", why buy mine", and "why buy now." I am going to use the same framework. Platform sales is sales in the end and all the sales rules as well as tips and tricks you know that would still apply. The objective here is to focus on how disruptive enterprise software platform sales is different and what you could do about it.

The first part of this three-post series focuses on "why buy anything."

Companies look for solutions for problems they know exist. Not having a platform is typically not considered a good-enough problem to go and buy something. IT departments also tend to use what they have in terms of tools and technology to solve problems for which they decide to "build" as opposed to "buy." Making your prospects realize they need to buy something is a very important first step in sales process.

Generating organic demand:

Hopefully, you have good marketing people that are generating enough demand and interest in your platform and the category it belongs to. But, unfortunately, even if you have great marketing people it won't be sufficient to generate organic demand for a platform with your prospect. When it comes to platform sales your job is to create organic demand before you can fulfill it. This is hard and it doesn't come naturally to many good sales people that I have known. By and large sales people are good at three things: i) listen: understand what customers want ii) orchestrate: work with a variety of people to demonstrate that their product is the best feature and price fit iii) close: identify right influencers and work with a buyer to close an opportunity. While platform sales does require these three qualities like any other sales creating demand or appetite is the one that a very few sales people have. You have to go beyond what your prospects tell you; you have to assess their latent needs. Your prospects won't tell you they need a disruptive platform simply because they don't know that.

You're assuming a 1-1 marketing role to create this desire. Connect your prospects with (non-sales) thought leaders inside as well as outside of your organization and invite them to industry conferences to educate them on the category to which your platform belongs to. Platform conversations, in most cases, start from unusual places inside your prospect's organization. People who are seen as technology thought leaders or are responsible for "labs" inside their company or people who self-select as nerds or tinkerers are the ones you need to evangelize to and win over. These people typically don't sit in the traditional IT organization that you know of and even if they do they are not the ones who make decisions. These folks are simply passionate people who love working on disruptive technology and have a good handle on some of the challenges their companies are facing.

Dance with the business and the IT:

As counterintuitive as it may sound working with non-IT people to sell technology platform to IT is a good way to go. The "business" is always problem-centric and the IT is always solution-centric. Remember, you're chasing a problem and not a solution. Identify a few folks in a line of business who are willingly to work with you. This is not easy especially if you're a technology-only vendor. Identify their strategic challenges that have legs — money attached to it. Evangelize these challenges with IT to generate interest in disruptive platform that could be a good fit for these challenges.

IT doesn't like disruption regardless of what they tell you. If they are buying your disruptive platform they are not buying something else and they don't use some of the existing platforms or tools they have. There are people who have built their careers building solutions on top of existing tools and technology and they simply don't want to see that go away. You will have to walk this fine line and get these people excited on a new platform that doesn't threaten their jobs and perhaps show them how their personal careers could accelerate if they get on to this emerging technology that a very few people know in the company but something which is seen highly strategic in the market. Don't bypass IT; it won't work. Make them your friends and give them an opportunity to shine in front of business and give them credit for all the work.

Chasing the right IT spend:

Most enterprise software sales people generally know two things about their customers: i) overall IT spend ii) how much of that they spend with you. What they typically don't know is how much a customer spends on similar technology or platforms from that overall IT spend that doesn't come your way. There are two ways to execute a sales opportunity: either you find something to sell for the amount that your customer typically spends with you on annual basis or you go after the larger IT spend and expand your share of the overall pie. It's the latter that is relevant when you're selling platform to your existing customer (and not a prospect).

Platform, in most cases, is a budgeted investment that falls under "innovation" or "modernization" category. If you're just focused on current spending pattern of your customer you may not be able to generate demand for your platform. It is your job to convince your customer to look beyond how they see you as a vendor and be open to invest into a category that they might be reluctant for.

The next post in this series will be about "why buy mine."

Photo courtesy: Stef

Monday, June 30, 2014

Chasing Qualitative Signal In Quantitative Big Data Noise


Joey Votto is one of the best hitters in the MLB who plays for Cincinnati Reds. Lately he has received a lot of criticism for not swinging on strikes when there are runners on base. Five Thirty Eight decided to analyze this criticism with the help of data. They found this criticism to be true; his swings at strike zone pitches, especially fastballs, have significantly declined. But, they all agree that Votto is still a great player. This is how I see many Big Data stories go; you can explain "what" but you can't explain "why." In this story, no one actually went (that I know) and asked Votto, "hey, why are you not swinging at all those fastballs in the strike zone?"

This is not just about sports. I see that everyday in my work in enterprise software while working with customers to help them with their Big Data scenarios such as optimizing promotion forecast in retail, predicting customer churn in telco, or managing risk exposure in banks.

What I find is as you add more data it creates a lot more noise in these quantitative analysis as opposed to getting closer to a signal. On top of this noise people expect there shall be a perfect model to optimize and predict. Quantitative analysis alone doesn't help finding a needle in haystack but it does help identify which part of haystack the needle could be hiding in.
"In many walks of life, expressions of uncertainty are mistaken for admissions of weakness." - Nate Silver
I subscribe to and strongly advocate Nate Silver's philosophy to think of "predictions" as a series of scenarios with probability attached to it as opposed to a deterministic model. If you are looking for a precise binary prediction you're most likely not going to get one. Fixating on a model and perfecting it makes you focus on over-fitting your model on the past data. In other words, you are spending too much time on signal or knowledge that already exists as opposed to using it as a starting point (Bayesian) and be open to run as many experiments as you can to refine your models as you go. The context that turns your (quantitative) information into knowledge (signal) is your qualitative aptitude and attitude towards that analysis. If you are willing to ask a lot of "why"s once your model tells you "what" you are more likely to get closer to that signal you're chasing.

Not all quantitative analyses have to follow a qualitative exercise to look for a signal. Validating an existing hypothesis is one of the biggest Big Data weapons developers use since SaaS has made it relatively easy for developers to not only instrument their applications to gather and  analyze all kinds of usage data but trigger a change to influence users' behaviors. Facebook's recent psychology experiment to test whether emotions are contagious has attracted a lot of criticism. Keeping ethical and legal issues, accusing Facebook of manipulating 689,003 users' emotions for science, aside this quantitative analysis is a validation of an existing phenomenon in a different world. Priming is a well-understood and proven concept in psychology but we didn't know of a published test proving the same in a large online social network. The objective here was not to chase a specific signal but to validate a hypothesis— a "what"—for which the "why" has been well-understood in a different domain.

About the photo: Laplace Transforms is one of my favorite mathematical equations since these equations create a simple form of complex problems (exponential equations) that is relatively easy to solve. They help reframe problems in your endeavor to get to the signal.

Sunday, June 01, 2014

Optimizing Data Centers Through Machine Learning

Google has published a paper outlining their approach on using machine learning, a neural network to be specific, to reduce energy consumption in their data centers. Joe Kava, VP, Data Centers at Google also has a blog post explaining the backfround and their approach. Google has one of the best data center designs in the industry and takes their PUE (power usage effectiveness) numbers quite seriously. I blogged about Google's approach to optimize PUE almost five years back! Google has come a long way and I hope they continue to publish such valuable information in public domain.



There are a couple of key takeaways.

In his presentation at Data Centers Europe 2014 Joe said:  
As for hardware, the machine learning doesn’t require unusual computing horsepower, according to Kava, who says it runs on a single server and could even work on a high-end desktop.
This is a great example of a small data Big Data problem. This neural network is a supervised learning approach where you create a model with certain attributes to assess and fine tune the collective impact of these attributes to achieve a desired outcome. Unlike an expert system which emphasizes an upfront logic-driven approach neural networks continuously learn from underlying data and are tested for their predicted outcome. The outcome has no dependency on how large your data set is as long as it is large enough to include relevant data points with a good history. The "Big" part of Big Data misleads people in believing they need a fairly large data set to get started. This optimization debunks that myth.

The other fascinating part about Google's approach is not only they are using machine learning to optimize PUE of current data centers but they are also planning to use it to effectively design future data centers.

Like many other physical systems there are certain attributes that you have operational control over and can be changed fairly easily such as cooling systems, server load etc. but there are quite a few attributes that you only have control over during design phase such as physical layout of the data center, climate zone etc. If you decide to build a data center in Oregon you can't simply move it to Colorado. These neural networks can significantly help make those upfront irreversible decisions that are not tunable later on.

One of the challenges with neural networks or for that matter many other supervised learning methods is that it takes too much time and precision to perfect (train) the model. Joe describing it as a "nothing more than series of differential calculus equations " is downplaying the model. Neural networks are useful when you know what you are looking for - in this case to lower the PUE. In many cases you don't even know what you are looking for.

Google mentions identifying 19 attributes that have some impact on PUE. I wonder how they short listed these attributes. In my experience unsupervised machine learning is a good place to short list attributes and then move on to supervised machine learning to fine tune them. Unsupervised machine learning combined with supervised machine learning can yield even better results, if used correctly.

Thursday, November 21, 2013

Rise Of Big Data On Cloud


Growing up as an engineer and as a programmer I was reminded every step along the way that resources—computing as well as memory—are scarce. The programs were designed on these constraints. Then the cloud revolution happened and we told people not to worry about scarce computing. We saw rise of MapReduce, Hadoop, and countless other NoSQL technology. Software was the new hardware. We owe it to all the software development, especially computing frameworks, that allowed developers to leverage the cloud—computational elasticity—without having to understand the complexity underneath it. What has changed in the last two to three years is a) the underlying file systems and computational frameworks have matured b) adoption of Big Data is driving the demand for scale out and responsive I/Os in the cloud.

Three years back, I wrote a post, The Future Of The BI In Cloud where I had highlighted two challenges of using cloud as a natural platform for Big Data. The first one was to create a large scale data warehouse and the second was lack of scale out computing for I/O intensive applications.

A year back Amazon announced RedShift, a data warehouse service in the cloud, and last week they announced high I/O instances for EC2. We have come a long way and more and more I look at the current capabilities and trends, Big Data, at scale, on the cloud, seems much closer to reality.

From a batched data warehouse to interactive analytic applications:

Hadoop was never designed for I/O intensive applications, but Hadoop being a compelling computational scale out platform developers had a strong desire to use it for their data warehousing needs. This made Hive and HiveQL popular analytic frameworks but this was a sub optimal solution that worked well for batch loads and wasn't suitable for responsive and interactive analytic applications. Several vendors realized there's no real reason to stick to the original style of MapReduce. They still stuck to the HDFS but significantly invested into alternatives to Hive which are way faster.

There are series of such projects/products that are being developed on HDFS and MapReduce as a foundation but by adding special data management layers on top of it to run interactive queries much faster compared to plain vanilla Hive. Some of those examples are Impala from Cloudera and Apache Drill from MapR (both based on Dremel), HAWQ from EMC, Stinger from Hortonworks and many other start-ups. Not only vendors but the early adopters such as Facebook created Hive projects such as Presto, an accelerated Hive, which they recently open sourced.

From raw data access frameworks to higher level abstraction tools: 

As vendors continue to build more and more Hive alternatives I am also observing vendors investing in higher level abstraction frameworks. Pig was amongst those first higher level frameworks that made it easier to express data analysis programs. But, now, we are witnessing even higher layer rich frameworks such as Cascading and Cascalog not only to write SQL queries but write interactive programs in higher level languages such as Clojure and Java. I'm a big believer in empowering developers with right tools. Working directly against Hadoop has a significant learning curve and developers often end up spending time on plumbing and other things that can be abstracted out in a tool. For web development, popularity of Angular and Bootstrap are examples of how right frameworks and tools can make developers way more efficient not having to deal with raw HTML, CSS, and Javascript controls.

From solid state drives to in-memory data structures: 

Solid state drives were the first step in upstream innovation to make I/Os much faster but I am observing this trend go further where vendors are investing into building in-memory resident data management layers on top of HDFS. Shark and Spark are amongst the popular ones. Databricks has made big bets on Spark and recently raised $14M. Shark (and hence Spark) is designed to be compatible with Hive but designed to run queries 100x times faster by using in-memory data structures, columnar representation, and optimizing MapReduce not to write intermediate results back to disk. This looks a lot like MapReduce Online which was a research paper published a few years back. I do see a UC Berkeley connection here.

Photo courtesy: Trey Ratcliff

Monday, October 21, 2013

Big Data Platform As Technology Continuum

Source: Wikipedia
A Russian chemist, Dimitri Mendeleev, invented the first periodic table of elements. Prior to that, scientists had identified a few elements but the scientific world lacked a consistent framework to organize these elements. Dimitri built upon existing work of these scientists and invented the first periodic table based on a set of design principles. What fascinates me more about his design is that he left a couple of rows empty because he predicted that new elements would be discovered soon. Not only he designed the first periodic table to create a foundation for how elements can be organized but he anticipated what might happen in future and included that consideration in his design.    

It is unfortunate that a lot of us are trained to chase a perfect answer as opposed to designing something that is less than perfect, useful, and inspirational to future generations to build on it. We look at technology in a small snapshot and think what it can do for me and others now. We don't think of technology disruption as a continuum to solve a series of problems. Internet started that way and the first set of start-ups failed because they defined the problem too narrowly. The companies that succeeded such as Google, Amazon, eBay etc. saw Internet as a long term trend and didn't think of it in a small snapshot. Cloud and Big Data are the same. Everyday I see problems being narrowly defined as if this is just a fad and companies want to capitalize on it before it disappears.

Build that first element table and give others an imagination to extend it. As an entrepreneur you were not the first and you are not going to be the last trying to solve this problem.

Tuesday, October 01, 2013

The Dark Side Of Big Data


Latanya Sweeney, a Harvard professor Googled her own name to find out an ad next to her name for a background check hinting that she was arrested. She dug deeper and concluded that so-called black-identifying names were significantly more likely to be the targets for such ads. She documented this in her paper, Discrimination in Online Ad Delivery. It is up to an advertiser how they pick keywords and other criteria to show their ads. Google, like most other companies for which advertising is their primary source of revenue, would never disclose details of algorithms behind their ad offerings. Google denied AdWords being discriminatory in anyway.

Facebook just announced they are planning to give more options to their users to provide feedback regarding which ads are relevant to them and which ads are not. While on surface this might sound like a good idea to get rid of ads that are not relevant and keep marketers as well as users happy, this approach has far more severe consequences than what you might think. In case of the Google AdWords discrimination scenario the algorithm is supposedly blind and has no knowledge of who is searching for what (assuming you're not logged in and there is no cookie effect), but in case of Facebook, the ads are targeted based on you as an individual and what Facebook might know about you. Algorithms are written by human beings and knowingly or unknowingly they could certainly introduce subtle or blatant discrimination. As marketers and companies that serve ads on behalf of marketers know more about you as as an individual, and your social and professional network, they are a step closer to discriminate their users, knowingly or unknowingly. There's a fine line between stereotyping and what marketers call "segmentation."

AirBnB crunched their data and concluded that older hosts tend to be more hospitable and younger guests tend to be more generous with their reviews. If this is just for informational purposes it's interesting. However what if AirBnB uses this information to knowingly or unknowingly discriminate young hosts and old guests?

A combination of massively parallel computing and sophisticated algorithms to leverage this parallelism as well as ability of algorithms to learn and adapt to be more relevant, almost in real-time, are going to cause a lot more of such issues to surface. As a customer you simply don't know whether the products or services that you are offered or not at a certain price is based on any discriminatory practices. To complicate this further, in many cases, even companies don't know whether insights they derive from a vast amount of internal as well as external data are discriminatory or not. This is the dark side of Big Data.

The challenge with Big Data is not Big Data itself but what companies could do with your data combined with any other data without your explicit understanding of how algorithms work. To prevent discriminatory practices, we see employment practices being audited to ensure equal opportunity and admissions to colleges audited to ensure fair admission process, but I don't see how anyone is going to audit these algorithms and data practices.

I have no intention to paint a gloomy picture and blame technology. Disruptive technology always surfaces socioeconomic issues that either didn't exist before or were not obvious and imminent. Some people get worked up because they don't quite understand how technology works. I still remember politicians trying to blame GMail for "reading" emails to show ads. I believe that Big Data is yet another such disruption that is going to cause similar issues. We should not shy away from these issues but should collaboratively work hard to highlight and amplify what these issues might be and address them as opposed to blame technology to be evil.

Photo Courtesy: Jonathan Kos-Read 

Thursday, August 01, 2013

Chasing That Killer Application Of Big Data

I often get asked, "what is the killer application of Big Data?" Unfortunately, the answer is not that simple.

In the early days of enterprise software, it was the automation that fueled the growth of enterprise applications. The vendors that eventually managed to stay in business and got bigger were/are the ones that expanded their footprint to automate more business processes in more industries. The idea behind the killerness of some of these applications was merely the existence and some what maturity of business processes in alternate forms. The organizations did have financials and supply chain but those processes were paper-based or part-realized in a set of tools that didn't scale. The objective was to replace these homegrown non-scalable processes and tools and provide standardized package software that would automate the processes after customizing it to the needs of an organization. Some vendors did work hard to understand what problems they were set out to solve, but most didn't; they poured concrete into existing processes.

Traditional Business Intelligence (BI) market grew the same way; the customers were looking for a specific set of reporting problems to be solved to run their business. The enterprise applications that automated the business processes were not powerful enough to deliver the kind of reporting that organizations expected to gain insights into their operations and make decisions. These applications were designed to automate the processes and not to provide insights. The BI vendors created packaged tools and technology solutions to address this market. Once again, the vendors didn't have to think about what application problems the organizations were trying to solve.

Now with the rise of Big Data, the same vendors, and some new vendors, are asking that same question: what's the killer application? If Big Data turns out to be as big of a wave as the Internet or cloud we are certainly in a very early stage. This wave is very different than the previous ones in a few ways; it is technology-led innovation which is opening up new ways of running business. We are at an inflection point of cheap commodity hardware and MPP software that is designed from ground up to treat data as the first class citizen. This is not about automation or filling a known gap. I live this life working with IT and business leaders of small and large organizations worldwide where they are struggling to figure out how best they can leverage Big Data. These organizations know there's something in for them in this trend but they can't quite put a finger on it.

As a vendor, the best way to look at your strategy is to help customers with their Big Data efforts without chasing a killer application. The killer applications will be emergent when you pay attention and observe patterns across your customers. Make Big Data tangible for your customers and design tools that would take your customers away from complexity of a technology layer. The organizations continue to have massive challenges with semantics as well as the location and format of their data sources. This is not an exciting domain for many vendors but help these organizations bring their data together. And, most importantly, try hard to become a trusted advisor and a go-to vendor for Big Data regardless of your portfolio of products and solutions. Waiting for a killer application to get started or marketing your product as THE killer application of Big Data are perhaps not the smartest things to do right now.

Big Data is a nascent category; an explosive, promising, but a nascent category. The organizations are still trying to get a handle on what it means to them. The maturity of business processes and well-defined unsolved problems in this domain are not that clear. While this category plays out on its own don't chase those killer applications or place your bets on one or two killer applications. Just get started and help your customers. I promise you shall stumble upon that killer application during your journey.

About the picture: I took this picture inside a historic fort in Jaisalmer, India that has rich history. History has taught me a lot about all things enterprise software as well as non-enterprise-software.

Thursday, June 13, 2013

Hacking Into The Indian Education System Reveals Score Tampering


Debarghya Das has a fascinating story on how he managed to bypass a silly web security layer to get access to the results of 150,000 ISCE (10th grade) and 65,000 ISC (12th grade) students in India. While lack of security and total ignorance to safeguard sensitive information is an interesting topic what is more fascinating about this episode is the analysis of the results that unearthed score tampering. The school boards changed the scores of the students to give them "grace" points to bump them up to the passing level. The boards also seem to have tampered some other scores but the motive for that tampering remains unclear (at least to me).

I would encourage you to read the entire analysis and the comments, but a tl;dr version is:

32, 33 and 34 were visibly absent. This chain of 3 consecutive numbers is the longest chain of absent numbers. Coincidentally, 35 happens to be the pass mark.
Here's a complete list of unattained marks -
36, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 56, 57, 59, 61, 63, 65, 67, 68, 70, 71, 73, 75, 77, 79, 81, 82, 84, 85, 87, 89, 91, 93. Yes, that's 33 numbers!


The comments are even more fascinating where people are pointing out flaws with his approach and challenging the CLT (central limit theorem) with a rebuttal. If there has been no tampering with the score it would defy the CLT with a probability that is so high that I can't even compute. In other words, the chances are almost zero, if not zero, of this guy being wrong about his inferences and conclusions.

He is using fairly simple statistical techniques and MapReduce style computing to analyze a fairly decent size data set to infer and prove a specific hypothesis (most people including me believed that grace points existed but we had no evidence to prove it). He even created a public GitHub repository of his work which he later made it private.

I am not a lawyer and I don't know what he did is legal or not but I do admire his courage to not post this anonymously as many people in the comments have suggested. Hope he doesn't get into any trouble.

Spending a little more time trying to comprehend this situation I have two thoughts:

The first shocking but unfortunately not surprising observation is: how careless the school boards are in their approach in making such sensitive information available on their website without basic security. It is not like it is hard to find web developers in India who understand basic or even advanced security; it's simply laziness and carelessness on the school board side not to just bother with this. I am hoping that all government as well as non-government institutes will learn from this breach and tighten up their access and data security.

The second revelation was - it's not a terribly bad idea to publicly distribute the very same as well as similar datasets after removing PII (personally identifiable information) from it to let people legitimately go crazy at it. If this dataset is publicly available people will analyze it, find patterns, and challenge the fundamental education practices. Open source has been a living proof of making software more secured by opening it up to public to hack it and find flaws in it so that they can be fixed. Knowing the Indian bureaucracy I don't see them going in this direction. Turns out I have seen this movie before. I have been an advocate of making electronic voting machines available to researchers to examine the validity of a fair election process. Instead of allowing the security researchers to have access to an electronic voting machine Indian officials accused a researcher of stealing a voting machine and arrested him. However, if India is serious about competing globally in education this might very well be the first step to bring in transparency.

Friday, May 31, 2013

Unsupervised Machine Learning, Most Promising Ingredient Of Big Data


Orange (France Telecom), one of the largest mobile operators in the world, issued a challenge "Data for Development" by releasing a dataset of their subscribers in Ivory Coast. The dataset contained 2.5 billion records, calls and text messages exchanged between 5 million anonymous users in Ivory Coast, Africa. Various researchers got access to this dataset and submitted their proposals on how this data can be used for development purposes in Ivory Coast. It would be an understatement to say these proposals and projects were mind-blowing. I have never seen so many different ways of looking at the same data to accomplish so many different things. Here's a book [very large pdf] that contains all the proposals. My personal favorite is AllAborad where IBM researchers used the cell-phone data to redraw optimal bus routes. The researchers have used several algorithms including supervised and unsupervised machine learning to analyze the dataset resulting in a variety of scenarios.

In my conversations and work with the CIOs and LOB executives the breakthrough scenarios always come from a problem that they didn't even know existed or could be solved. For example, the point-of-sale data that you use for your out-of-stock analysis could give you new hyper segments using clustering algorithms such as k-means that you didn't even know existed and also could help you build a recommendation system using collaborative filtering. The data that you use to manage your fleet could help you identify outliers or unproductive routes using SOM (self organizing maps) with dimensionality reduction. Smart meter data that you use for billing could help you identify outliers and prevent thefts using a variety of ART (Adoptive Resonance Theory) algorithms. I see endless scenarios based on a variety of unsupervised machine learning algorithms similar to using cell phone data to redraw optimal bus routes.

Supervised and semi-supervised machine learning algorithms are also equally useful and I see them complement unsupervised machine learning in many cases. For example, in retail, you could start with a k-means to unearth new shopping behavior and end up with Bayesian regression followed by exponential smoothing to predict future behavior based on targeted campaigns to further monetize this newly discovered shopping behavior. However, unsupervised machine learning algorithms are by far the best that I have seen—to unearth breakthrough scenarios—due to its very nature of not requiring you to know a lot of details upfront regarding the data (labels) to be analyzed. In most cases you don't even know what questions you could ask.

Traditionally, BI has been built on pillars of highly structured data that has well-understood semantics. This legacy has made most enterprise people operate on a narrow mindset, which is: I know the exact problem that I want to solve and I know the exact question that I want to ask, and, Big Data is going to make all this possible and even faster. This is the biggest challenge that I see in embracing and realizing the full potential of Big Data. With Big Data there's an opportunity to ask a question that you never thought or imagined you could ask. Unsupervised machine learning is the most promising ingredient of Big Data.

Wednesday, May 01, 2013

Justifying Big Data Investment


Traditionally companies invest into software that has been proven to meet their needs and has a clear ROI. This model falls apart when disruptive technology such as Big Data comes around. Most CIOs have started to hear about Big Data and based on their position on the spectrum of conservative to progressive they have either started to think about investing or have already started investing. The challenge these CIOs face is not so much whether they should invest into Big Data or not but what they should do with it. Large companies have complex landscapes that serve multiple LOBs and all these LOBs have their own ideas about what they want to get out of Big Data. Most of these LOB executives are even more excited about the potential of Big Data but are less informed about the upstream technical impact and the change of mindset that IT will have to go through to embrace it. But, these LOBs do have a stronger lever - money to spend if they see that technology can help them accomplish something that they could not accomplish before.

As more and more IT executives get excited over the potential of Big Data they are underestimating the challenges to get access to meaningful data in single repository. Data movement has been one of the most painful problems of a traditional BI system and it continues to stay that way for Big Data systems. A vast majority of companies have most of their data locked into their on-premise systems. Not only it is inconvenient but it's actually impractical to move this data to the cloud for the purposes of analyzing it if Big Data platform happens to be a cloud platform. These companies also have a hybrid landscape where a subset of data resides in the cloud inside some of the cloud solutions that they use. It's even harder to get data out from these systems to move it to either a cloud-based or an on-premise Big Data platform. Most SaaS solutions are designed to support ad hoc point-to-point or hub and spoke REST-ful integration but they are not designed to efficiently dump data for external consumption.

Integrating semantics is yet another challenge. As organizations start to combine several data sources the quality as well as the semantics of data still remain big challenges. Managing semantics for single source in itself isn't easy. When you add multiple similar or dissimilar sources to the mix this challenge is further amplified. It has been the job of an application layer to make sense out of underlying data but when that layer goes away the underlying semantics become more challenging.

If you're a vendor you should work hard thinking about business value of your Big Data technology - not what it is to you but what it could do for your customers. The spending pie for customers hasn't changed and coming up with money to spend on (yet another) technology is quite a challenge. My humble opinion on this situation is that vendors have to go beyond technology talk and start understanding the impact of Big Data, understand the magnitude of these challenges, and then educate customers on the potential and especially help them with a business case. I would disagree with people who think that Big Data is a technology play/sale. It is not.

Photo Courtesy: Kurtis Garbutt

Sunday, March 31, 2013

Thrive For Precision Not Accuracy


Jake Porway who was a data scientist at the New York Times R&D labs has a great perspective on why multi-disciplinary teams are important to avoid bias and bring in different perspective in data analysis. He discusses a story where data gathered by Über in Oakland suggested that prostitution arrests increased in Oakland on Wednesdays but increased arrests necessarily didn't imply increased crime. He also outlines the data analysis done by Grameen Foundation where the analysis of Ugandan farm workers could result into the farmers being "good" or "bad" depending on which perspective you would consider. This story validates one more attribute of my point of view regarding data scientists - data scientists should be design thinkers. Working in a multi-disciplinary team to let people champion their perspective is one of the core tenants of design thinking.

One of the viewpoints of Jake that I don't agree with:

"Any data scientist worth their salary will tell you that you should start with a question, NOT the data."

In many cases you don't even know what question to ask. Sometimes an anomaly or a pattern in data tells a story. This story informs us what questions we might ask. I do see that many data scientists start with knowing a question ahead of time and then pull in necessary data they need but I advocate the other side where you bring in the sources and let the data tell you a story. Referring to design, Henry Ford once said, ""Every object tells a story if you know how to read it." Listen to the data—a story—without any pre-conceived bias and see where it leads you.

You can only ask what you know to ask. It limits your ability to unearth groundbreaking insights. Chasing a perfect answer to a perfect question is a trap that many data scientists fall into. In reality what business wants is to get to a good enough answer to a question or insight that is actionable. In most cases getting to an answer that is 95% accurate requires little effort but getting that rest 5% requires exponentially disproportionate time with disproportionately low return.

Thrive for precision, not accuracy. The first answer could really be of low precision. It's perfectly acceptable as long as you know what the precision is and you can continuously refine it to make it good enough. Being able to rapidly iterate and reframe the question is far more important than knowing upfront what question to ask; data analysis is a journey and not a step in the process.

Photo credit: Mario Klingemann

Thursday, February 28, 2013

A Data Scientist's View On Skills, Tools, And Attitude



I recently came across this interview (thanks Dharini for the link!) with Nick Chamandy, a statistician a.k.a a data scientist at Google. I would encourage you to read it; it does have some great points. I found the following snippets interesting:

Recruiting data scientists:
When posting job opportunities, we are cognizant that people from different academic fields tend to use different language, and we don’t want to miss out on a great candidate because he or she comes from a non-statistics background and doesn’t search for the right keyword. On my team alone, we have had successful “statisticians” with degrees in statistics, electrical engineering, econometrics, mathematics, computer science, and even physics. All are passionate about data and about tackling challenging inference problems.
I share the same view. The best scientists I have met are not statisticians by academic training. They are domain experts and design thinkers and they all share one common trait: they love data! When asked how they might build a team of data scientists I highly recommend people to look beyond traditional wisdom. You will be in good shape as long as you don't end up in a situation like this :-)

Skills:
The engineers at Google have also developed a truly impressive package for massive parallelization of R computations on hundreds or thousands of machines. I typically use shell or python scripts for chaining together data aggregation and analysis steps into “pipelines.”
Most companies won't have the kind of highly skilled development army that Google has but then not all companies would have Google scale problem to deal with. Though I suggest two things: a) build a very strong community of data scientists using social tools so that they can collaborate on challenges and tools they use b) make sure that the chief data scientist (if you have one) has very high level of management buy-in to make things happen otherwise he/she would be spending all the time in "alignment" meetings as opposed to doing the real work.

Data preparation:
There is a strong belief that without becoming intimate with the raw data structure, and the many considerations involved in filtering, cleaning, and aggregating the data, the statistician can never truly hope to have a complete understanding of the data.
I disagree. I do strongly believe the tools need to involve to do some of these things and the data scientists should not be spending their time to compensate for the inefficiencies of the tools. Becoming intimate with the data—have empathy for the problem—is certainly a necessity but spending time on pulling, fixing, and aggregating data is not the best use of their time.

Attitude:
To me, it is less about what skills one must brush up on, and much more about a willingness to adaptively learn new skills and adjust one’s attitude to be in tune with the statistical nuances and tradeoffs relevant to this New Frontier of statistics.
As I would say bring tools and knowledge but leave bias and expectations aside. The best data scientists are the ones who are passionate about data, can quickly learn a new domain, and are willing to make and fail and fail and make.

Image courtesy: xkcd

Friday, February 15, 2013

Commoditizing Data Science



My ongoing conversations with several people continue to reaffirm my belief that Data Science is still perceived to be a sacred discipline and data scientists are perceived to be highly skilled statisticians who walk around wearing white lab coats. The best data scientists are not the ones who know the most about data but they are the ones who are flexible enough to take on any domain with their curiosity to unearth insights. Apparently this is not well-understood. There are two parts to data science: domain and algorithms or in other words knowledge about the problem and knowledge about how to solve it.

One of the main aspects of Big Data that I get excited about is an opportunity to commoditize this data science—the how—by making it mainstream.

The rise of interest in Big Data platform—disruptive technology and desire to do something interesting about data—opens up opportunities to write some of these known algorithms that are easy to execute without any performance penalty. Run K Means if you want and if you don't like the result run Bayesian linear regression or something else. The access to algorithms should not be limited to the "scientists," instead any one who wants to look at their data to know the unknown should be able to execute those algorithms without any sophisticated training, experience, and skills. You don't have to be a statistician to find a standard deviation of a data set. Do you really have to be a statistician to run a classification algorithm?

Data science should not be a sacred discipline and data scientists shouldn't be voodoos.

There should not be any performance penalty or an upfront hesitation to decide what to do with data. People should be able to iterate as fast as possible to get to the result that they want without worrying about how to set up a "data experiment." Data scientists should be design thinkers.

So, what about traditional data scientists? What will they do?

I expect people that are "scientists" in a traditional sense would elevate themselves in their Maslow's hierarchy by focusing more on advanced aspects of data science and machine learning such as designing tools that would recommend algorithms that might fit the data (we have already witnessed this trend for visualization). There's also significant potential to invent new algorithms based on existing machine learning algorithms that have been into existence for a while. What algorithms to execute when could still be a science to some extent but that's what the data scientists should focus on and not on sampling, preparing, and waiting for hours to analyze their data sets. We finally have Big Data for that.

Image courtesy: scikit-learn

Wednesday, January 16, 2013

A Journey From SQL to NoSQL to NewSQL


Two years back I wrote that the primary challenge with NoSQL is that it's not SQL. SQL has played a huge rule in making relational databases popular for the last forty years or so. Whenever the developers wanted to design an(y) application they put an RDBMS underneath and used SQL from all possible layers. Over a period of time, the RDBMS grew in functions and features such as binary storage, faster access, clusters, sophisticated access control etc. and the applications reaped these benefits. The traditional RDBMS became a non-fit for cloud-scale applications that fundamentally required scale at whole different level. Traditional RDBMS could not support this scale and even if they could it became prohibitively expensive for the developers to use it. Traditional RDBMS also became too restrictive due to their strict upfront schema requirements that are not suitable for modern large scale consumer web and mobile applications. Due to these two primary reasons and a lot more other reasons we saw the rise of NoSQL. The cloud movement further fueled this growth and we started to see a variety of NoSQL offerings.

Each NoSQL store is unique in which how a programmer would access it. NoSQL did solve the scalability and flexibility problems of a traditional database, but introduced a set of new problems, primary ones being lack of ubiquitous access and consistency options, especially for OLTP workload, for schema-less data stores.

This has now led to the movement of NewSQL (a term initially coined by Mat Aslett in 2011) whose working definition is: "NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for OLTP workloads while still maintaining the ACID guarantees of a traditional single-node database system." NewSQL's focus appears to be on gaining performance and scalability for OLTP workload by supporting SQL as well as custom programming models and eliminating cumbersome error-prone management tasks such as manual sharding without breaking the bank. It's a good first step in the direction of a scalable distributed database that supports SQL. It doesn't say anything about mixed OLTP and OLAP workload which is one of the biggest challenges for the organizations who want to embrace Big Data.

From SQL to NoSQL to NewSQL, one thing that is common: SQL.

Let's not underestimate the power of a simple non-procedural language such as SQL. I believe the programmers should focus on what (non-procedural such as SQL) and not how. Exposing "how" invariably ends up making the system harder to learn and harder to use. Hadoop is a great example of this phenomenon. Even though Hadoop has seen widespread adoption it's still limited to silos in organizations. You won't find a large number of applications that are exclusively written for Hadoop. The developers first have to learn how to structure and organize data that makes sense for Hadoop and then write an extensive procedural logic to operate on that dataset. Hive is an effort to simplify a lot of these steps but it still hasn't gained desired populairty. The lesson here for the NewSQL vendors is: don't expose the internals to the applications developers. Let a few developers that are closer to the database deal with storing and configuring the data but provide easy ubiquitous access to the application developers. The enterprise software is all about SQL. Embracing, extending, and augmenting SQL is a smart thing to do. I expect all the vendors to converge somewhere. This is how RDBMS and SQL grew. The initial RDBMS were far from being perfect but SQL always worked and the RDBMS eventually got better.

Distributed databases is just one part of the bigger puzzle. Enterprise software is more about mixing OLAP and OLTP workload. This is the biggest challenge. SQL skills and tools are highly prevalent in this ecosystem and more importantly people have SQL mindset that is much harder to change. The challenge to vendors is to keep this abstraction intact and extend it without exposing the underlying architectural decisions to the end users.

The challenge that I threw out a couple of years back was:

"Design a data store that has ubiquitous interface for the application developers and is independent of consistency models, upfront data modeling (schema), and access algorithms. As a developer you start storing, accessing, and manipulating the information treating everything underneath as a service. As a data store provider you would gather upstream application and content metadata to configure, optimize, and localize your data store to provide ubiquitous experience to the developers. As an ecosystem partner you would plug-in your hot-swappable modules into the data stores that are designed to meet the specific data access and optimization needs of the applications."

We are not there, yet, but I do see  signs of convergence. As a Big Data enthusiast I love this energy. Curt Monash has started his year blogging about NewSQL. I have blogged about a couple of NewSQL vendors, NimbusDB (NuoDB) and GenieDB, in the past and I have also discussed the challenges with the OLAP workload in the cloud due to its I/O intensive nature. I am hoping that NewSQL will be inclusive of OLAP and keep SQL their first priority. The industry is finally on to something and some of these start-ups are set out to disrupt in a big way.

Photo Courtesy: Liz

Tuesday, December 18, 2012

Objectively Inconsistent




During his recent visit to the office of 37 Signals, Jeff Bezos said, "to be consistently objective, one has to be objectively inconsistent." I find this perspective very refreshing that is applicable to all things and all disciplines in life beyond just product design. As a product designer you need to have a series of point of views (POV) that would be inconsistent when seen together but each POV at any given time will be consistently objective. This is what design thinking, especially prototyping is all about. It shifts a subjective conversation between people to an objective conversation about a design artifact.

As I have blogged before I see data scientists as design thinkers. Most data scientists that I know of have knowledge-curse. I would like them to be  consistently objective by going through the journey of analyzing data without any pre-conceived bias. The knowledge-curse makes people commit more mistakes. It also makes them defend their POV instead of looking for new information and have courage to challenge and change it. I am a big fan of work of Daniel Kahneman. I would argue that prototyping helps deal with what Kahneman describers as "cognitive sophistication."
The problem with this introspective approach is that the driving forces behind biases—the root causes of our irrationality—are largely unconscious, which means they remain invisible to self-analysis and impermeable to intelligence.
This very cognitive sophistication works against people who cannot self-analyze themselves and be critical to their own POV. Prototyping brings in objectivity and external validation to eliminate this unconscious-driven irrationality. It's fascinating what happens when you put prototypes in the hands of users. They interact with it in unanticipated ways. These discoveries are not feasible if you hold on to single POV and defend it.

Let it go. Let the prototype speak your design—your product POV—and not your unconscious.

Photo courtesy: New Yorker

Tuesday, October 16, 2012

Analytics-first Enterprise Applications


This is the story of Tim Zimmer who has been working as a technician for one of the large appliance store chains. His job is to attend service calls for washers and dryers. He has seen a lot in his life; a lot has changed but a few things have stayed the same.

The 80's saw a rise of homegrown IT systems and 90's was the decade of standardized backend automation where a few large vendors as well as quite a few small vendors built and sold solutions to automate a whole bunch of backend processes. Tim experienced this firsthand. He started getting printed invoices that he could hand out to his customers. He also heard his buddies in finance talking about a week-long training class to learn "computers" and some tools to make journal entries. Tim's life didn't change much. He would still get a list of customers handed out to him in the morning. He would go visit them. He would turn-in a part-request form manually for the parts he didn't carry in his truck and life went on. Not knowing what might be a better way to work Tim always knew there must be a better way. Automation did help the companies run their business faster and helped increased their revenue and margins but the lives of their employees such as Tim didn't change much.

Mid to late 90's saw the rise of CRM and Self-Service HCM where vendors started referring to "resources" as "capital" without really changing the fundamental design of their products. Tim heard about some sales guys entering information into such systems after they had talked to their customers. They didn't quite like the system, but their supervisors and their supervisors' supervisors had asked them to do so. Tim thought somehow the company must benefit out of this but he didn't see his buddies' lives get any better. He did receive a rugged laptop to enter information about his tickets and resolutions. The tool still required him to enter a lot of data, screen by screen. He didn't really like the tool and the tool didn't make him any better or smarter, but he had no other choice but to use it.

Tim heard that the management gets weekly reports of all the service calls that he makes. He was told that the parts department uses this information to create a "part bucket" for each region. He thought it doesn't make any sense - by the time the management receives the part information, analyzes it, and gives me parts, I'm already on a few calls where I am running out of parts that I need. He also received an email from "Center of Excellence" (he couldn't tell what it is, but guessed, "must be those IT guys") whether he would like to receive some reports. He inquired. The lead time for what he thought was a simple report, once he submits a request, was 8-10 weeks and that "project" would require three levels of approval. He saw no value in it and decided not to pursue. While watching a football game, over beer, his buddy in IT told him that the "management" has bought very expensive software to run these reports and they are hiring a lot of people who would understand how to use it.

One day, he received a tablet. And he thought this must be yet another devious idea by his management to make him do more work that doesn't really help him or his customers. A fancy toy, he thought. For the first time in his life, the company positively surprised him. The tablet came with an app that did what he thought the tool should have done all along. As soon as he launched the app it showed him a graphical view of his service calls and parts required for those calls based on the historic analysis of those appliances. It showed him which trucks has what parts and which of his team members are better of visiting what set of customers based on their skill-set and their demonstrated ability in having solved those problems in the past. Tim makes a couple of clicks to analyze that data, drills down into line-item detail in realtime, and accepts recommendations with one click. He assigns the service calls to his team-members and drives his truck to a customer that he assigned to himself. As soon as he is done he pulls out his tablet. He clicks a button to acknowledge the completion of a service call. He is presented with new analysis updated in realtime with available parts in his truck as well as in his teammates' trucks. He clicks around, makes some decisions, cranks up the radio in his truck, and he is off to help the next customer. No more filling out any long meaningless screens. His view of his management has changed for good for the very first time.

As the world is moving towards building mobile-first or mobile-only applications I am proposing to build analytics-first enterprise applications that are mobile-only. Finally, we have access to sophisticated big data products, frameworks, and solutions that can help analyze large volume of data in real time. The large scale hardware — commodity, specialized, or virtualized — are accessible to the developers to do some amazing things. We are at an inflection point. There is no need to discriminate between transactional and analytic workload. Navigating from aggregated results to line-item details should just be one click instead of punching out into a separate system. There are many processes, if re-imagined without any pre-conceived bias, would start with an analysis at the very first click and will guide the user to a more fine-grained data-entry or decision-making screens. If mobile-first is the mindset to get the 20% of the scenarios of your application right that are used 80% of the times, the analytics-first is a design that should thrive to move the 20% of the decision-making workflows used 80% of the time that currently throw the end users into the maze of data entries and beautiful but completely isolated, outdated, and useless reports.

Let's rethink enterprise applications. Today's analytics is an end result of years of neglect to better understand human needs to analyze and decide as opposed to decide and analyze. Analytics should not be a category by itself disconnected from the workflows and processes that the applications have automated for years to make businesses better. Analytics should be an integral part of an application, not embedded, not contextual, but a lead-in.