Palash Chauhan
← Phoenix Fundamentals

Putting the SQL back in NoSQL

· Part 2 of 3 in Phoenix Fundamentals phoenix hbase coprocessors phoenix-fundamentals

In the last post we saw that HBase is a bare, sorted key-value store with a small API and one superpower: coprocessors. Apache Phoenix takes that and makes it feel like a relational database.

The trick is not a new engine. Phoenix is really just two things: a client-side JDBC driver and a set of HBase coprocessors. Everything else is how those two pieces map relational concepts onto HBase. This post is that map. We will not go deep into any single feature yet, and we will save how a query or an upsert actually runs for the next post.

Phoenix is not a server

A traditional RDBMS is a process you connect to. Phoenix is not. It is a JDBC driver on the client plus coprocessors that already live inside HBase. There is nothing new to run.

Traditional RDBMSPhoenix on HBase
Where the logic runsa dedicated database servera client driver, plus coprocessors inside HBase
Where data is storedits own filesHBase, on top of HDFS
What you installa databasea JDBC jar; coprocessors load into HBase

From SQL to HBase

Almost everything Phoenix does is one of these mappings:

Relational conceptHow Phoenix does it on HBase
TableAn HBase table (rows and column families)
Schema and catalogStored as rows in the SYSTEM.CATALOG table
Primary keyEncoded into the HBase rowkey
Columns and data typesTyped values serialized into cells (HBase only sees bytes)
Constraints and defaultsEnforced by coprocessors on write
Secondary indexesExtra HBase tables kept in sync by coprocessors
TTL, atomic updates, CDCBuilt on HBase features and coprocessors
SQL queriesCompiled by the driver into Get and Scan calls

Two of these are worth seeing up close.

The primary key becomes the rowkey

HBase only has one sorted axis: the rowkey. So Phoenix takes your primary key, say (customer_id, order_id), and encodes its columns one after another into a single rowkey. A zero byte follows the variable-length customer_id to separate it from the order_id. The pieces sit contiguously, and because the whole key is sorted, all orders for the same customer naturally land next to each other.

block-beta
  columns 3
  a["customer_id (VARCHAR)"] b["0x00 separator"] c["order_id (INTEGER)"]
  rowkey["one contiguous, sorted HBase rowkey"]:3
  style b fill:#6b7280,color:#fff,stroke:#374151
  style rowkey stroke:#f59e0b,stroke-width:4px

That single fact is why primary key design matters so much in Phoenix: the key is not just an identifier, it is also the physical sort order of your data.

One row becomes many cells

HBase does not store a row as a single record. Each non-key column is stored as its own cell, and every cell repeats the full rowkey. So a Phoenix row with two key columns (customer_id, order_id) and two regular columns (amount, status) is laid out as one cell per regular column. Each grouped box below is a single HBase cell: a rowkey, a column, and a value, with the rowkey copied into each.

block-beta
  columns 1
  block:cell1
    columns 3
    t1["cell-1"]:3
    a1["rowkey: acme | 1007"] a2["column: amount"] a3["value: 99.50"]
  end
  block:cell2
    columns 3
    t2["cell-2"]:3
    b1["rowkey: acme | 1007"] b2["column: status"] b3["value: shipped"]
  end
  style a1 stroke:#f59e0b,stroke-width:4px
  style b1 stroke:#f59e0b,stroke-width:4px

The rowkey (highlighted) is repeated in every cell, which is why short keys and short column names keep storage in check.

There is also one more cell we did not draw. Phoenix lets every column be part of the primary key, but a rowkey on its own is not a cell, and HBase stores only cells. So Phoenix writes one tiny empty cell per row to give it something after the rowkey. It looks trivial, but it is the lynchpin for a lot of what comes later, so we will keep referring back to it.

Your schema is just data

Phoenix stores your schema in an ordinary HBase table, SYSTEM.CATALOG. Tables, columns, types, and primary keys are just rows:

TABLE_NAMECOLUMN_NAMEDATA_TYPEPK position
ORDERSCUSTOMER_IDVARCHAR1
ORDERSORDER_IDINTEGER2
ORDERSAMOUNTDECIMALnone

So how do schema changes get validated? When you run CREATE TABLE or ALTER TABLE, something has to enforce the rules: that a column exists, that the change does not conflict with existing metadata. Enter coprocessors. An endpoint coprocessor on SYSTEM.CATALOG runs that logic right where the metadata lives, with no separate server involved.

flowchart LR
  client["Client: ALTER TABLE orders"]
  subgraph cat [SYSTEM.CATALOG region]
    ep(["Endpoint coprocessor attached here"])
  end
  client --> ep
  ep -->|"valid"| ok["Update the catalog"]
  ep -->|"breaks a rule"| reject["Reject the change"]
  style ep stroke:#f59e0b,stroke-width:4px

And this is only the beginning. Coprocessors are not a one-off for the catalog; they are how Phoenix does almost everything, and they will keep showing up everywhere across this series.

A familiar SQL grammar

Phoenix speaks a broad, mostly standard SQL dialect: SELECT with the usual clauses, joins, and subqueries. The main Phoenix-specific twist is UPSERT, one statement that inserts or updates a row. For the full reference, see the Phoenix SQL grammar.

Up next

Next, we will follow a query and an upsert from Phoenix SQL all the way down to HBase and back.