Big data - don't judge a book by its cover

  • 28 March 2022
  • 0 replies

Userlevel 5

This content, written by Frank Bien, was initially posted in Looker Blog on Oct 31, 2013. The content is subject to limited support.

I can’t remember a term so over-hyped in my time in tech as “big data.” But the shift in technologies surrounding data and analytics is very real—and provides true value to organizations who can harness seas of bits to reveal important insights and new knowledge. Here’s the way I think about the evolution towards large data analysis.

The BI card catalog. The analogy I like to use is a library—the kind where you check out books. Traditional was like the card catalog. As new books arrived, data about the books was summarized, aggregated, and put in a big BI-like filing system. Because databases were traditionally expensive and overtapped, it made tons of sense to pull summary data, put it in a BI “card catalog,” and operate on the data. How many books do we have? How many are mysteries? What is the average-page-count kind of stuff. We make dashboards and reports about the status of books, how many are coming and going, and which get checked out the most. All valuable stuff.

The shift to detail—words on pages. But soon, people wanted to actually get at the data itself: the pages and the words in ALL the books. It was great to operate on the summaries, but the next level of value was found by analyzing the detail. Were books with more adjectives more or less apt to be checked out? Could we analyze word patterns and understand better what made a really desirable book? Could we cohort books by analyzing the detail more effectively than the human-generated card catalog was doing? This first tier of big data was well served by a new era in analytic databases—scale-out, MPP, column-store, and in-memory made doing this kind of analysis inexpensive and fast. The funny thing is, there weren’t many tools built to go on top of these systems, to allow analysts and end users the ability to really explore these immense and complex datasets. So analysts went back to hand-coding SQL or writing Map Reduce jobs, and building proprietary discovery tools that could harness the power of these new data systems.

The big stuff generated by machines. So in the library analogy, what could be bigger than analyzing all the words in all the books? That’s where big data moved next. Rather than limit ourselves to the books, what if we extended the analogy into the reader’s realm? What if we started to capture not just the actual books, but all the events surrounding how people used the books? The sphere gets big pretty fast. And results in another giant increase in data size: For every book, how many times was page 54 read? Did people skip to the end or start at the beginning? On average, how long did people read the book in each sitting? All of this data dwarfs the size of the actual book—it could easily be 100 times bigger. Multiply that by all the books and readers in the library and we have a really large bucket of bits. This is the world of machine-generated big data—event logs and click streams.

Where's the value? The most important point is that there’s value at every level. And in my humble opinion, we tech professionals have recently focused on machine-generated “big data” to the exclusion of much better access and discovery into data at every level. We have separate big data systems, different transactional systems, and different tools to peer into each. The real value moving forward will be providing analysis across all of this data, big or small.

When we join event data with transaction data, we can get answers to some really interesting questions. We could, for example, cohort readers by age (user database), correlating groups who check out books that frequently mention “attention deficit disorder” (detail data analysis) with whether they always skip to the back 5% of the book to see how the story ends (event data analysis).

0 replies

Be the first to reply!