Putting the SQL back in NoSQL
In the last post we saw that HBase is a bare, sorted key-value store with a small API and one superpower: coprocessors. Apache Phoenix takes that and makes it feel like a relational database.
The trick is not a new engine. Phoenix is really just two things: a client-side JDBC driver and a set of HBase coprocessors. Everything else is how those two pieces map relational concepts onto HBase. This post is that map. We will not go deep into any single feature yet, and we will save how a query or an upsert actually runs for the next post.
Phoenix is not a server
A traditional RDBMS is a process you connect to. Phoenix is not. It is a JDBC driver on the client plus coprocessors that already live inside HBase. There is nothing new to run.
| Traditional RDBMS | Phoenix on HBase | |
|---|---|---|
| Where the logic runs | a dedicated database server | a client driver, plus coprocessors inside HBase |
| Where data is stored | its own files | HBase, on top of HDFS |
| What you install | a database | a JDBC jar; coprocessors load into HBase |
From SQL to HBase
Almost everything Phoenix does is one of these mappings:
| Relational concept | How Phoenix does it on HBase |
|---|---|
| Table | An HBase table (rows and column families) |
| Schema and catalog | Stored as rows in the SYSTEM.CATALOG table |
| Primary key | Encoded into the HBase rowkey |
| Columns and data types | Typed values serialized into cells (HBase only sees bytes) |
| Constraints and defaults | Enforced by coprocessors on write |
| Secondary indexes | Extra HBase tables kept in sync by coprocessors |
| TTL, atomic updates, CDC | Built on HBase features and coprocessors |
| SQL queries | Compiled by the driver into Get and Scan calls |
Two of these are worth seeing up close.
The primary key becomes the rowkey
HBase only has one sorted axis: the rowkey. So Phoenix takes your primary key, say (customer_id, order_id), and encodes its columns one after another into a single rowkey. A zero byte follows the variable-length customer_id to separate it from the order_id. The pieces sit contiguously, and because the whole key is sorted, all orders for the same customer naturally land next to each other.
block-beta columns 3 a["customer_id (VARCHAR)"] b["0x00 separator"] c["order_id (INTEGER)"] rowkey["one contiguous, sorted HBase rowkey"]:3 style b fill:#6b7280,color:#fff,stroke:#374151 style rowkey stroke:#f59e0b,stroke-width:4px
That single fact is why primary key design matters so much in Phoenix: the key is not just an identifier, it is also the physical sort order of your data.
One row becomes many cells
HBase does not store a row as a single record. Each non-key column is stored as its own cell, and every cell repeats the full rowkey. So a Phoenix row with two key columns (customer_id, order_id) and two regular columns (amount, status) is laid out as one cell per regular column. Each grouped box below is a single HBase cell: a rowkey, a column, and a value, with the rowkey copied into each.
block-beta
columns 1
block:cell1
columns 3
t1["cell-1"]:3
a1["rowkey: acme | 1007"] a2["column: amount"] a3["value: 99.50"]
end
block:cell2
columns 3
t2["cell-2"]:3
b1["rowkey: acme | 1007"] b2["column: status"] b3["value: shipped"]
end
style a1 stroke:#f59e0b,stroke-width:4px
style b1 stroke:#f59e0b,stroke-width:4px
The rowkey (highlighted) is repeated in every cell, which is why short keys and short column names keep storage in check.
There is also one more cell we did not draw. Phoenix lets every column be part of the primary key, but a rowkey on its own is not a cell, and HBase stores only cells. So Phoenix writes one tiny empty cell per row to give it something after the rowkey. It looks trivial, but it is the lynchpin for a lot of what comes later, so we will keep referring back to it.
Your schema is just data
Phoenix stores your schema in an ordinary HBase table, SYSTEM.CATALOG. Tables, columns, types, and primary keys are just rows:
| TABLE_NAME | COLUMN_NAME | DATA_TYPE | PK position |
|---|---|---|---|
| ORDERS | CUSTOMER_ID | VARCHAR | 1 |
| ORDERS | ORDER_ID | INTEGER | 2 |
| ORDERS | AMOUNT | DECIMAL | none |
So how do schema changes get validated? When you run CREATE TABLE or ALTER TABLE, something has to enforce the rules: that a column exists, that the change does not conflict with existing metadata. Enter coprocessors. An endpoint coprocessor on SYSTEM.CATALOG runs that logic right where the metadata lives, with no separate server involved.
flowchart LR
client["Client: ALTER TABLE orders"]
subgraph cat [SYSTEM.CATALOG region]
ep(["Endpoint coprocessor attached here"])
end
client --> ep
ep -->|"valid"| ok["Update the catalog"]
ep -->|"breaks a rule"| reject["Reject the change"]
style ep stroke:#f59e0b,stroke-width:4px
And this is only the beginning. Coprocessors are not a one-off for the catalog; they are how Phoenix does almost everything, and they will keep showing up everywhere across this series.
A familiar SQL grammar
Phoenix speaks a broad, mostly standard SQL dialect: SELECT with the usual clauses, joins, and subqueries. The main Phoenix-specific twist is UPSERT, one statement that inserts or updates a row. For the full reference, see the Phoenix SQL grammar.
Up next
Next, we will follow a query and an upsert from Phoenix SQL all the way down to HBase and back.