Date: 2026-03-10
Reviewer: Technical Architecture Review
Scope: Kessel authorization (ReBAC) integration for Cost Management (Koku) on-prem
References: PR #5933, kessel-ocp-integration.md, rebac-bridge-design.md
Koku's on-prem deployment previously depended on the SaaS RBAC service, which is unavailable outside cloud.redhat.com. The integration replaces this with Kessel (SpiceDB-based ReBAC) to provide:
- Fine-grained access control — workspace-based, resource-specific permissions
- Relationship-based authorization — principals, roles, workspaces, and resources modeled as a graph
- On-prem independence — no dependency on external SaaS authorization services
Kessel is Red Hat's platform-level authorization system built on SpiceDB (Zanzibar-inspired). It provides:
- Inventory API (gRPC) —
Check,StreamedListObjects,ReportResource,DeleteResource - Relations API (REST + gRPC) — tuple CRUD for SpiceDB relationships
- ZED schema — declarative authorization model (resources, relations, permissions)
Kessel is the single source of truth for authorization decisions in on-prem Koku.
User Request
│
▼
Koku API (Django)
│
▼
IdentityHeaderMiddleware
│
├─► get_access_provider() → KesselAccessProvider (ONPREM) or RBACAccessProvider (SaaS)
│
▼
KesselAccessProvider.get_access_for_user()
│
├─► For each resource type:
│ ├─► Check(rbac/workspace:{org_id}, permission, rbac/principal:{user}) [workspace-level]
│ └─► StreamedListObjects(resource_type, relation, principal) [per-resource fallback]
│
▼
Kessel Inventory API (gRPC)
│
▼
SpiceDB (authorization engine)
│
▼
Decision: access map { "openshift.cluster": {"read": ["*"] | ["id1","id2"] }, ... }
│
▼
request.user.access populated → Permission classes & query layer apply filtersThe ReBAC Bridge (described in rebac-bridge-design.md) is a separate Go microservice — not part of this PR. It provides:
- insights-rbac v1 compatible REST API for roles, groups, principals, access
- Translation from high-level RBAC operations to SpiceDB tuples
- Management plane for on-prem admins (group creation, role assignment, resource assignment)
This PR implements the Koku application layer — KesselAccessProvider, resource_reporter, middleware, and integration hooks. The ReBAC Bridge is a future deliverable.
| Resource Type | Kessel Type | Relations | Purpose |
|---|---|---|---|
| OCP Cluster | cost_management/openshift_cluster |
t_workspace → org workspace |
Cluster visibility |
| OCP Node | cost_management/openshift_node |
t_workspace, has_cluster |
Node visibility |
| OCP Project | cost_management/openshift_project |
t_workspace, has_cluster |
Project/namespace visibility |
| Integration | cost_management/integration |
t_workspace, has_cluster, has_project |
Source visibility (computed) |
| Cost Model | cost_management/cost_model |
t_workspace |
Cost model visibility |
| Settings | rbac/workspace |
Check-only (capability) | Settings access |
| AWS/Azure/GCP | Pre-provisioned in schema | t_workspace |
Future cloud provider support |
Key relationships:
resource#t_workspace → rbac/workspace:{org_id}— primary org-level visibilityintegration#has_cluster → openshift_cluster:{id}— structural containment for computed permissionsrbac/role_binding#t_binding → rbac/workspace— role bindings scoped to workspaces
Strengths:
- OCP resources (cluster, node, project) map cleanly to Kessel types
- Integration as first-class resource with structural relationships enables computed visibility (project access → cluster → integration)
- Pre-provisioned schema for AWS, Azure, GCP supports future expansion
- Permission hierarchy (
_view=_read+_all+all_read+all_all) matches SaaS RBAC semantics
Concerns:
- Provider UUID vs cluster_id:
provider_builder._report_ocp_resource()passesstr(instance.uuid)(provider UUID) as the resource_id foropenshift_cluster. The architecture docs and query layer expect cluster_id (e.g.,"my-ocp-cluster-1"). The API filters bycluster_idin report queries. This may cause a mismatch — StreamedListObjects would return provider UUIDs, but the query layer filters by cluster_id from the database.
t_workspace— resource belongs to workspace (primary visibility)has_cluster,has_project— structural containment for integrationt_parent— workspace hierarchy (team workspaces inherit from org)t_binding,t_granted,t_subject— role binding chain
The model supports team-based access grants and cross-team resource sharing via multiple t_workspace tuples per resource.
| RBAC Type | Kessel Permission | Granularity |
|---|---|---|
openshift.cluster read |
cost_management_openshift_cluster_view |
Appropriate |
openshift.cluster * |
cost_management_openshift_cluster_all |
Appropriate |
settings |
Check-only on workspace | Capability, not per-resource |
Permissions are neither overly coarse nor overly granular for the current scope.
- Org scoping:
workspace_id=org_id; all resources and role bindings are org-scoped - Schema isolation: Koku's tenant schemas (
org{org_id}) remain separate; Kessel usesrbac/workspace:{org_id}as the authorization boundary - No cross-org leakage: StreamedListObjects and Check are scoped to the workspace; SpiceDB enforces relationship boundaries
- SpiceDB handles millions of tuples; the workspace model limits tuple count to
resources × workspaces(notresources × users) - Batch
StreamedListObjectsper resource type avoids O(N) per-resource Check calls - Cache (300s TTL) reduces Kessel load for repeated requests
- Schema is additive; new resource types can be added without breaking existing deployments
KOKU_TO_KESSEL_TYPE_MAPandIMMEDIATE_WRITE_TYPESare centralized for easy extension- ReBAC Bridge design allows new resource assignment endpoints
| Risk | Mitigation |
|---|---|
| Kessel unavailable | Fail-open per-type: failing types return no access; other types proceed. Cache mitigates transient outages. |
| Malformed identity | Middleware validates x-rh-identity; missing/invalid → 401 |
| Principal format | redhat/{username} convention is consistent; Keycloak is source of truth for user existence |
| ENHANCED_ORG_ADMIN | When True, admin bypasses access lookup; must be False when using Kessel |
Note: KesselConnectionError is defined and caught by middleware (HTTP 424), but KesselAccessProvider never raises it — it catches all exceptions internally and returns empty access. The fail-open behavior is documented but differs from the original fail-closed (424) recommendation.
| Component | Purpose |
|---|---|
| Kessel Inventory API | gRPC (Check, StreamedListObjects, ReportResource, DeleteResource) |
| Kessel Relations API | REST (t_workspace, structural tuples) |
| SpiceDB | Backend (never accessed directly by Koku) |
| Keycloak | OAuth2 client_credentials for Kessel API auth (when KESSEL_AUTH_ENABLED) |
New Python deps: kessel-sdk (gRPC stubs), grpcio, requests (already present)
- Deployment: Requires Kessel stack (SpiceDB + Inventory API + Relations API) + ZED schema + role seeding
- Configuration:
ONPREM=true,AUTHORIZATION_BACKEND=rebac,KESSEL_INVENTORY_*,KESSEL_RELATIONS_*, optionalKESSEL_AUTH_* - Role seeding: Platform responsibility; no auto-seeding on Kessel-only deployments — operators must run
kessel-admin.sh seed-rolesor equivalent
- Kessel Inventory API (gRPC, default 9081)
- Kessel Relations API (REST, default 8100)
- SpiceDB (backend for both)
- Keycloak (for Kessel API auth when enabled)
| Scenario | Behavior |
|---|---|
| Cache HIT | Cached access used; request proceeds |
| Cache MISS, Check/StreamedListObjects fails | Per-type fail-open: that type returns no access ([] or no wildcard); other types still queried. User sees no data for affected types, not incorrect data. |
| KesselConnectionError raised | Middleware catches it → HTTP 424 Failed Dependency (but current code path never raises it) |
- Cache backend:
CacheEnum.kesselwhenAUTHORIZATION_BACKEND=rebac - TTL: 300 seconds (from
settings.CACHES["kessel"]["TIMEOUT"]) - Key:
{user.uuid}_{org_id} - Invalidation: Per-request; no explicit invalidation on role/resource changes (stale for up to 5 minutes)
- All Kessel calls are synchronous —
get_access_for_userblocks until all resource types are resolved - Latency: ~N × (Check + StreamedListObjects) where N = number of resource types (~11)
- Mitigated by cache; first request per user/org pays full cost
- First request (cache miss): Multiple gRPC round-trips; expect 100–500 ms depending on network
- Cached requests: No Kessel calls
- Recommendation: Monitor p95 latency for
get_access_for_userand Kessel Check/StreamedListObjects
Strengths:
- Clean adapter pattern:
get_access_provider()returnsKesselAccessProviderorRBACAccessProvider; middleware and permission classes unchanged KOKU_TO_KESSEL_TYPE_MAPcentralizes type mapping- Workspace resolution (
ShimResolver,RbacV2Resolver) abstracts org_id → workspace_id
Issues:
- provider_builder OCP resource ID:
_report_ocp_resource(str(instance.uuid), self.org_id)uses provider UUID. The query layer and API filter bycluster_id. The Kessel resource foropenshift_clustershould usecluster_idfrominstance.authentication.credentials.get("cluster_id")to align with report filtering. Same for_report_integration(..., str(instance.uuid), ...)— thehas_clustersubject should be cluster_id.
- Singleton
KesselClientwith double-checked locking - gRPC channel supports TLS + OAuth2 call credentials when
KESSEL_AUTH_ENABLED - Relations API uses
requests.postfor tuple creation; auth headers fromget_http_auth_headers()
- Check-first pattern: For per-resource types, workspace Check runs first; if allowed → wildcard; if denied → StreamedListObjects for specific IDs. Reduces unnecessary StreamedListObjects for org-wide admins.
- Write-grants-read: When write access is granted, read is also populated
- Settings: Check-only (no StreamedListObjects)
KesselAccessProvider: All exceptions caught in_check_workspace_permissionand_streamed_list_objects; returnFalse/[]. No propagation.resource_reporter: gRPC/HTTP errors logged; never propagated. Provider creation/deletion succeeds even if Kessel sync fails.- Gap: No retry logic for transient Kessel failures; no circuit breaker.
- Authorization logic: Isolated in
koku_rebac; permission classes and views delegate torequest.user.access - Business logic: Provider creation, cost queries unchanged; hooks (
on_resource_created,on_resource_deleted) are called at integration points - No leakage: Views do not import Kessel directly; they rely on middleware-populated access
- Readability: Clear module structure; docstrings explain Check-first pattern and dual-write
- Test coverage: Unit tests for access_provider, client, config, resource_reporter, middleware; contract tests; E2E regression
- Extensibility: Adding a new resource type requires: (1)
KOKU_TO_KESSEL_TYPE_MAP, (2)IMMEDIATE_WRITE_TYPESif needed, (3) hook calls in appropriate lifecycle points
- Privilege escalation: Permission checks flow through SpiceDB; no client-side override
- Validation: Identity header validated;
org_idfrom identity is trusted for workspace resolution - Bypass:
ENHANCED_ORG_ADMINbypasses access lookup; must be disabled for Kessel
Potential bypass: If KesselAccessProvider returns empty access for all types (e.g., Kessel down, all exceptions), request.user.access is empty. The middleware checks not request.user.admin and not request.user.access — empty access raises PermissionDenied for non-admins. So fail-open per-type does not grant access; it denies access for affected types. Correct behavior.
| Risk | Description | Mitigation |
|---|---|---|
| Resource ID mismatch | provider_builder reports openshift_cluster with provider UUID instead of cluster_id. Query layer filters by cluster_id. StreamedListObjects may return IDs that don't match DB columns. |
Extract cluster_id from instance.authentication.credentials and use it for openshift_cluster resource_id and has_cluster subject. |
| Risk | Description | Mitigation |
|---|---|---|
| Role seeding gap | On-prem has no auto-seeding; operators must manually create role instances. Blocks deployment if not documented. | Document in operator guide; provide kessel-admin.sh seed-roles or Helm hook; verify platform tooling. |
| Cache staleness | 5-minute TTL; role/resource changes take up to 5 minutes to take effect. | Document; consider shorter TTL or cache invalidation on admin actions (ReBAC Bridge). |
| Kessel API gaps | ADR documents: no structural relationship support in Relations API; schema deployment not integrated. | Track upstream; use workarounds (Koku writes structural tuples via REST). |
| Risk | Description | Mitigation |
|---|---|---|
| No retry for Kessel | Transient gRPC/HTTP failures cause immediate empty access. | Add retry with backoff for Check/StreamedListObjects; consider circuit breaker. |
| Principal prefix fragility | redhat/ hardcoded in multiple codebases. |
Extract to KESSEL_PRINCIPAL_PREFIX env var. |
| ReBAC Bridge not implemented | Management plane (groups, roles, resource assignment) requires manual Kessel API or future Bridge. | Document kessel-admin.sh usage; prioritize Bridge delivery. |
| Risk | Description | Mitigation |
|---|---|---|
| Backward compatibility | /status simplified to {"status": "OK"} — breaking for clients expecting detailed response. |
Document in release notes. |
| Trino JVM config | GCLockerRetryAllocationCount removed; may affect stability. |
Verify with Trino team. |
- Clarify resource ID semantics: Document that
openshift_clusterresource_id must becluster_id(not provider UUID) for alignment with report queries. Fix provider_builder. - Fail-closed option: Consider a configurable mode (e.g.,
KESSEL_FAIL_CLOSED=true) that returns HTTP 424 when Kessel is unreachable, for high-security deployments. - Schema versioning: Add
KESSEL_SCHEMA_VERSIONto settings and document upgrade path for schema changes.
- Audit role seeding: Verify all 5 system roles from
seed-roles.yamlare correctly wired in the ZED schema and that custom roles can be created via the future Bridge. - Integration visibility: Ensure
integration.readcomputed permission correctly cascades fromhas_clusterandhas_project; validate with E2E tests.
- Extract cluster_id helper: Add
get_cluster_id_from_provider(provider)to avoid duplication and ensure correct extraction in provider_builder. - Raise KesselConnectionError on connection failure: Consider raising when gRPC channel fails or when all resource types fail, so middleware can return 424 for observability.
- Resource reporter auth: Ensure
get_http_auth_headers()is used for Relations API DELETE (it is for_delete_resource_tuples); verify POST also uses it.
- Health check: Add
/api/cost-management/v1/kessel/healthor similar that probes Kessel Inventory API reachability. - Metrics: Add Prometheus counters for Kessel Check/StreamedListObjects latency, errors, cache hit rate.
- Runbook: Document "Kessel is down" scenarios and recovery steps.
- Kessel dev stack:
dev/kessel/docker-compose.ymland README are present; ensure they work withpipenv runand local testing. - Contract tests: Keep contract tests for Inventory API v1beta2; run against real Kessel in CI when available.
-
Resource ID for openshift_cluster: Should the Kessel resource_id for
openshift_clusterbecluster_id(from credentials) orprovider_uuid? The query layer filters by cluster_id; using provider_uuid would require mapping in the access layer. -
Role seeding ownership: Who provides the tooling to seed roles from
rbac-config/roles/cost-management.jsoninto Kessel for on-prem? Is there a Helm hook or script that operators run? -
Cache invalidation: When an admin assigns a resource to a team via the future ReBAC Bridge, how will Koku's cache be invalidated? Is there a webhook or pub/sub, or do we rely on TTL only?
-
KesselConnectionError: Should
KesselAccessProviderraiseKesselConnectionErrorwhen gRPC connection fails (e.g., channel creation or all RPCs fail), so operators get HTTP 424 instead of silent deny? -
Structural tuples via Relations API: The ADR says Relations API doesn't support structural relationships. Does Koku's
create_structural_tuple(Relations API REST) work forhas_cluster/has_project, or does it require direct SpiceDB access? -
Provider deletion and Kessel cleanup: When a provider is deleted via Sources API, is
on_resource_deletedcalled? The provider_builderdestroy_providerusesProviderManager.remove; does that trigger resource cleanup in Kessel? -
Multi-cluster provider: For OCP-on-AWS (one provider, multiple clusters), how are clusters reported? One
openshift_clusterper cluster_id, or one per provider? -
ENHANCED_ORG_ADMIN: Is
ENHANCED_ORG_ADMINever True in on-prem? The docs say it must be False when using Kessel. -
ReBAC Bridge timeline: When is the ReBAC Bridge expected? Without it, how do operators manage groups and resource assignments today?
-
Upstream schema PRs: PR #5933 references rbac-config#737 and inventory-api#1243. What is the merge timeline, and how will Koku handle the transition when upstream schema changes?
| Dimension | Score | Rationale |
|---|---|---|
| Architecture quality | 8/10 | Well-structured adapter pattern, clear separation of Kessel vs RBAC paths, comprehensive ZED schema. Minor gaps: resource ID semantics, ReBAC Bridge not yet delivered. |
| Implementation quality | 7/10 | Clean code, good test coverage, proper error handling in most paths. Issue: provider_builder may use wrong resource ID for OCP cluster. |
| Operational readiness | 6/10 | Documentation is strong; role seeding and Kessel deployment require operator expertise. No health check endpoint; cache staleness may surprise admins. |
| Security model | 8/10 | SpiceDB as source of truth; no client-side bypass. Identity validation and org scoping are correct. ENHANCED_ORG_ADMIN must be disabled. |
The Kessel ReBAC integration is architecturally sound and implements a clean authorization abstraction. The design documents are thorough and the implementation follows established patterns. The main concerns are:
- Verify/fix OCP cluster resource_id — ensure it aligns with query layer expectations (cluster_id vs provider_uuid).
- Operational readiness — role seeding, health checks, and cache behavior need clear operator guidance.
- ReBAC Bridge dependency — management plane (groups, resource assignment) is not yet available; document workarounds.
Recommendation: Address the resource ID question and add a health check before production rollout. The design is suitable for production with these clarifications and the ReBAC Bridge (or equivalent management tooling) for day-two operations.
- Model used:
composer-1.5 - Generated on: 2026-03-10