Apache HBase is designed to be used for random, real-time, relatively low latency, read/write access to big data. HBase’s goal is to store very large tables with billions/millions of rows and billions/millions of columns on clusters installed on commodity hardware.

The following characteristics make an application suitable for HBase:

  • Large quantities of data in the scale of 100s of GBs to TBs and PBs. Not suitable for small-scale data

  • Fast, random access to data

  • Variable, flexible schema. Each row is or could be different

  • Key-based access to data when storing, loading, searching, retrieving, serving, and querying

  • Data stored in collections. For example, some metadata, message data, or binary data is all keyed into the same value

  • High throughput in the scale of 1000s of records per second

  • Horizontally scalable cache capacity. Capacity may be increased by just adding nodes

  • The data layout is designed for key lookup with no overhead for sparse columns

  • Data-centric model rather than a relationship-centric model. Not suitable for an ERD (entity relationship diagram) model

  • Strong consistency and high availability are requirements. Consistency is favored over availability

  • Lots of insertion, lookup, and deletion of records

  • Write-heavy applications

  • Append-style writing (inserting and overwriting) rather than heavy read-modify-write

Some use-cases for HBase are as follows:

  • Audit logging systems

  • Tracking user actions

  • Answering queries such as

    • What are the last 10 actions made by the user?

    • Which users logged into the system on a particular day?

  • Real-time analytics

    • Real-time counters

    • Interactive reports showing trends and breakdowns

    • Time series databases

  • Monitoring system

  • Message-centered systems (Twitter-like messages and statuses)

  • Content management systems serving content out of HBase

  • Canonical use-cases such as storing web pages during crawling of the Web

HBase is not suitable/optimized for

  • Classical transactional applications or relational analytics

  • Batch MapReduce (not a substitute for HDFS)

  • Cross-record transactions and joins

HBase is not a replacement for RDBMS or HDFS. HBase is suitable for

  • Large datasets

  • Sparse datasets

  • Loosely coupled (denormalized) records

  • Several concurrent clients

HBase is not suitable for

  • Small datasets (unless many of them)

  • Highly relational records

  • Schema designs requiring transactions

Summary

In this chapter, I discussed the characteristics that make an application suitable for Apache HBase. The characteristics include fast, random access to large quantities of data with high throughput. Application characteristics not suitable were also discussed. In the next chapter, I will discuss the physical storage in HBase.