Secondary Indexes · Palash Chauhan

In Series 1 we saw that HBase indexes exactly one thing: the rowkey. So a Phoenix query on any other column has to scan the whole table. A secondary index fixes that, letting you look rows up by some other column quickly. The interesting part is how Phoenix keeps that index correct.

An index is a separate table

Phoenix implements a secondary index as a separate HBase table, sorted by the indexed column. Say we want to look orders up by customer:

CREATE TABLE orders (
  order_id BIGINT NOT NULL PRIMARY KEY,
  customer VARCHAR,
  amount   DECIMAL,
  status   VARCHAR
);

CREATE INDEX orders_by_customer ON orders (customer);

The index table’s rowkey is the indexed value plus the data table’s primary key, so it is sorted by customer and still maps back to the exact order:

flowchart LR
  subgraph data ["ORDERS (data table, by order_id)"]
    direction TB
    d1["1001 | acme | shipped"]
    d2["1002 | globex | pending"]
    d3["1003 | acme | shipped"]
  end
  subgraph idx ["orders_by_customer (index, by customer)"]
    direction TB
    i1["acme | 1001"]
    i2["acme | 1003"]
    i3["globex | 1002"]
  end
  i1 --> d1
  i2 --> d3
  i3 --> d2

Applications never touch the index table. They query orders as usual, and Phoenix uses the index behind the scenes.

Why this is hard

The data table and the index are two separate HBase tables, almost always on different region servers. Keeping them in sync sounds trivial: on every write, update both. In a distributed system, it is not.

If the client wrote to both itself, any of these would break it:

one write succeeds and the other fails, or the client crashes in between, and the two tables drift out of sync,
two updates to the same order race, and the index is left with the wrong entry,
a reader follows the index to a row that no longer matches, or misses one that does.

And every application would have to get this exactly right. So Phoenix does it centrally, on the server, with a protocol built for failure.

Enter 2PC and the empty column

Remember the empty cell every Phoenix row carries, from the fundamentals? On index rows, Phoenix puts it to work. It stores a status there: unverified or verified.

The data table is the source of truth. Index maintenance runs in a coprocessor attached to the data table: it intercepts every write to orders and keeps the index table in step, around the data write, in three ordered steps:

flowchart LR
  w["Client writes an order"] --> cp
  subgraph data ["ORDERS data table"]
    cp(["coprocessor attached here"])
    drow["the order row"]
  end
  idx["index table"]
  cp -->|"1. index row = unverified"| idx
  cp -->|"2. update"| drow
  cp -->|"3. index row = verified"| idx
  style cp stroke:#f59e0b,stroke-width:4px

This unverified-then-verified status, applied in two steps around the data write, is essentially a two-phase commit on the index row: the first step prepares it, the second commits it once the data write has succeeded.

Concurrent updates to the same row are serialized with a row lock, so the index never sees a half-applied state.

What if a crash interrupts the sequence? The data table decides the outcome either way. A crash after step 1, before the data row is written, leaves an unverified index row for a change that never committed. A crash after step 2, with the data row written but not yet marked verified, leaves a committed change whose index row is still unverified. In both cases you are left with an unverified or stale index row, which is fine, because of the next piece.

Read repair

A coprocessor on the index table runs this check on every scan. When a query lands on an unverified row, it does not trust it. It goes back to the data table, rebuilds the correct answer, and repairs the index row before returning results:

flowchart LR
  q["Query via the index"] --> c{"check the empty column"}
  c -->|"verified"| use["use the index row"]
  c -->|"unverified"| repair["repair from the data table, then use it"]
  style repair stroke:#f59e0b,stroke-width:4px

Because unverified rows only appear after a failure or a race, this happens rarely, and a read is always consistent with the data table. A background process sweeps up stale rows over time.

Covered, uncovered, and partial indexes

Indexes come in a few shapes:

Covered: the index also stores extra columns, so a query that needs only those can be answered from the index alone, with no trip to the data table.
Uncovered: the index stores only the indexed column and the primary key, so it finds the matching rows and Phoenix reads the rest from the data table. Carrying fewer columns makes it smaller and cheaper to keep in sync on writes; the tradeoff is that extra lookup at read time.
Partial: the index covers only the rows matching a condition, so if you query just a subset, the index stays small and cheap.
View index: an index on a view rather than a table. Phoenix keeps the view indexes for all views over the same base table in one shared physical HBase table, so the region count stays low even with many views.

An index is a separate table

Why this is hard

Enter 2PC and the empty column

Read repair

Covered, uncovered, and partial indexes

Further reading