Skip to content

Instantly share code, notes, and snippets.

@masayag
Last active March 10, 2026 19:15
Show Gist options
  • Select an option

  • Save masayag/ca54888ebcae361c7e8ab1259a6c979b to your computer and use it in GitHub Desktop.

Select an option

Save masayag/ca54888ebcae361c7e8ab1259a6c979b to your computer and use it in GitHub Desktop.
Kessel ReBAC Integration — Technical Architecture & Implementation Review

Kessel ReBAC Integration — Technical Architecture & Implementation Review

Date: 2026-03-10
Reviewer: Technical Architecture Review
Scope: Kessel authorization (ReBAC) integration for Cost Management (Koku) on-prem
References: PR #5933, kessel-ocp-integration.md, rebac-bridge-design.md


1. High-Level Architecture Review

1.1 Problem Statement

Koku's on-prem deployment previously depended on the SaaS RBAC service, which is unavailable outside cloud.redhat.com. The integration replaces this with Kessel (SpiceDB-based ReBAC) to provide:

  • Fine-grained access control — workspace-based, resource-specific permissions
  • Relationship-based authorization — principals, roles, workspaces, and resources modeled as a graph
  • On-prem independence — no dependency on external SaaS authorization services

1.2 Role of Kessel

Kessel is Red Hat's platform-level authorization system built on SpiceDB (Zanzibar-inspired). It provides:

  • Inventory API (gRPC)Check, StreamedListObjects, ReportResource, DeleteResource
  • Relations API (REST + gRPC) — tuple CRUD for SpiceDB relationships
  • ZED schema — declarative authorization model (resources, relations, permissions)

Kessel is the single source of truth for authorization decisions in on-prem Koku.

1.3 Authorization Decision Flow

User Request
    │
    ▼
Koku API (Django)
    │
    ▼
IdentityHeaderMiddleware
    │
    ├─► get_access_provider() → KesselAccessProvider (ONPREM) or RBACAccessProvider (SaaS)
    │
    ▼
KesselAccessProvider.get_access_for_user()
    │
    ├─► For each resource type:
    │   ├─► Check(rbac/workspace:{org_id}, permission, rbac/principal:{user})  [workspace-level]
    │   └─► StreamedListObjects(resource_type, relation, principal)           [per-resource fallback]
    │
    ▼
Kessel Inventory API (gRPC)
    │
    ▼
SpiceDB (authorization engine)
    │
    ▼
Decision: access map { "openshift.cluster": {"read": ["*"] | ["id1","id2"] }, ... }
    │
    ▼
request.user.access populated → Permission classes & query layer apply filters

1.4 ReBAC Bridge Responsibility

The ReBAC Bridge (described in rebac-bridge-design.md) is a separate Go microservice — not part of this PR. It provides:

  • insights-rbac v1 compatible REST API for roles, groups, principals, access
  • Translation from high-level RBAC operations to SpiceDB tuples
  • Management plane for on-prem admins (group creation, role assignment, resource assignment)

This PR implements the Koku application layerKesselAccessProvider, resource_reporter, middleware, and integration hooks. The ReBAC Bridge is a future deliverable.

1.5 Resources and Relationships Modeled

Resource Type Kessel Type Relations Purpose
OCP Cluster cost_management/openshift_cluster t_workspace → org workspace Cluster visibility
OCP Node cost_management/openshift_node t_workspace, has_cluster Node visibility
OCP Project cost_management/openshift_project t_workspace, has_cluster Project/namespace visibility
Integration cost_management/integration t_workspace, has_cluster, has_project Source visibility (computed)
Cost Model cost_management/cost_model t_workspace Cost model visibility
Settings rbac/workspace Check-only (capability) Settings access
AWS/Azure/GCP Pre-provisioned in schema t_workspace Future cloud provider support

Key relationships:

  • resource#t_workspace → rbac/workspace:{org_id} — primary org-level visibility
  • integration#has_cluster → openshift_cluster:{id} — structural containment for computed permissions
  • rbac/role_binding#t_binding → rbac/workspace — role bindings scoped to workspaces

2. Authorization Model Evaluation

2.1 Resource Modeling

Strengths:

  • OCP resources (cluster, node, project) map cleanly to Kessel types
  • Integration as first-class resource with structural relationships enables computed visibility (project access → cluster → integration)
  • Pre-provisioned schema for AWS, Azure, GCP supports future expansion
  • Permission hierarchy (_view = _read + _all + all_read + all_all) matches SaaS RBAC semantics

Concerns:

  • Provider UUID vs cluster_id: provider_builder._report_ocp_resource() passes str(instance.uuid) (provider UUID) as the resource_id for openshift_cluster. The architecture docs and query layer expect cluster_id (e.g., "my-ocp-cluster-1"). The API filters by cluster_id in report queries. This may cause a mismatch — StreamedListObjects would return provider UUIDs, but the query layer filters by cluster_id from the database.

2.2 Relationship Definitions

  • t_workspace — resource belongs to workspace (primary visibility)
  • has_cluster, has_project — structural containment for integration
  • t_parent — workspace hierarchy (team workspaces inherit from org)
  • t_binding, t_granted, t_subject — role binding chain

The model supports team-based access grants and cross-team resource sharing via multiple t_workspace tuples per resource.

2.3 Permission Mapping

RBAC Type Kessel Permission Granularity
openshift.cluster read cost_management_openshift_cluster_view Appropriate
openshift.cluster * cost_management_openshift_cluster_all Appropriate
settings Check-only on workspace Capability, not per-resource

Permissions are neither overly coarse nor overly granular for the current scope.

2.4 Tenancy Isolation

  • Org scoping: workspace_id = org_id; all resources and role bindings are org-scoped
  • Schema isolation: Koku's tenant schemas (org{org_id}) remain separate; Kessel uses rbac/workspace:{org_id} as the authorization boundary
  • No cross-org leakage: StreamedListObjects and Check are scoped to the workspace; SpiceDB enforces relationship boundaries

2.5 Scalability of Relationship Graph

  • SpiceDB handles millions of tuples; the workspace model limits tuple count to resources × workspaces (not resources × users)
  • Batch StreamedListObjects per resource type avoids O(N) per-resource Check calls
  • Cache (300s TTL) reduces Kessel load for repeated requests

2.6 Extensibility

  • Schema is additive; new resource types can be added without breaking existing deployments
  • KOKU_TO_KESSEL_TYPE_MAP and IMMEDIATE_WRITE_TYPES are centralized for easy extension
  • ReBAC Bridge design allows new resource assignment endpoints

2.7 Authorization Bypass Risks

Risk Mitigation
Kessel unavailable Fail-open per-type: failing types return no access; other types proceed. Cache mitigates transient outages.
Malformed identity Middleware validates x-rh-identity; missing/invalid → 401
Principal format redhat/{username} convention is consistent; Keycloak is source of truth for user existence
ENHANCED_ORG_ADMIN When True, admin bypasses access lookup; must be False when using Kessel

Note: KesselConnectionError is defined and caught by middleware (HTTP 424), but KesselAccessProvider never raises it — it catches all exceptions internally and returns empty access. The fail-open behavior is documented but differs from the original fail-closed (424) recommendation.


3. On-Prem Integration Concerns

3.1 Dependency Footprint

Component Purpose
Kessel Inventory API gRPC (Check, StreamedListObjects, ReportResource, DeleteResource)
Kessel Relations API REST (t_workspace, structural tuples)
SpiceDB Backend (never accessed directly by Koku)
Keycloak OAuth2 client_credentials for Kessel API auth (when KESSEL_AUTH_ENABLED)

New Python deps: kessel-sdk (gRPC stubs), grpcio, requests (already present)

3.2 Operational Complexity

  • Deployment: Requires Kessel stack (SpiceDB + Inventory API + Relations API) + ZED schema + role seeding
  • Configuration: ONPREM=true, AUTHORIZATION_BACKEND=rebac, KESSEL_INVENTORY_*, KESSEL_RELATIONS_*, optional KESSEL_AUTH_*
  • Role seeding: Platform responsibility; no auto-seeding on Kessel-only deployments — operators must run kessel-admin.sh seed-roles or equivalent

3.3 Required Services

  • Kessel Inventory API (gRPC, default 9081)
  • Kessel Relations API (REST, default 8100)
  • SpiceDB (backend for both)
  • Keycloak (for Kessel API auth when enabled)

3.4 Failure Modes When Kessel Is Unavailable

Scenario Behavior
Cache HIT Cached access used; request proceeds
Cache MISS, Check/StreamedListObjects fails Per-type fail-open: that type returns no access ([] or no wildcard); other types still queried. User sees no data for affected types, not incorrect data.
KesselConnectionError raised Middleware catches it → HTTP 424 Failed Dependency (but current code path never raises it)

3.5 Caching

  • Cache backend: CacheEnum.kessel when AUTHORIZATION_BACKEND=rebac
  • TTL: 300 seconds (from settings.CACHES["kessel"]["TIMEOUT"])
  • Key: {user.uuid}_{org_id}
  • Invalidation: Per-request; no explicit invalidation on role/resource changes (stale for up to 5 minutes)

3.6 Synchronous Authorization

  • All Kessel calls are synchronousget_access_for_user blocks until all resource types are resolved
  • Latency: ~N × (Check + StreamedListObjects) where N = number of resource types (~11)
  • Mitigated by cache; first request per user/org pays full cost

3.7 Latency Impact

  • First request (cache miss): Multiple gRPC round-trips; expect 100–500 ms depending on network
  • Cached requests: No Kessel calls
  • Recommendation: Monitor p95 latency for get_access_for_user and Kessel Check/StreamedListObjects

4. Implementation Review (PR #5933)

4.1 Integration Layer

Strengths:

  • Clean adapter pattern: get_access_provider() returns KesselAccessProvider or RBACAccessProvider; middleware and permission classes unchanged
  • KOKU_TO_KESSEL_TYPE_MAP centralizes type mapping
  • Workspace resolution (ShimResolver, RbacV2Resolver) abstracts org_id → workspace_id

Issues:

  • provider_builder OCP resource ID: _report_ocp_resource(str(instance.uuid), self.org_id) uses provider UUID. The query layer and API filter by cluster_id. The Kessel resource for openshift_cluster should use cluster_id from instance.authentication.credentials.get("cluster_id") to align with report filtering. Same for _report_integration(..., str(instance.uuid), ...) — the has_cluster subject should be cluster_id.

4.2 Kessel Client Usage

  • Singleton KesselClient with double-checked locking
  • gRPC channel supports TLS + OAuth2 call credentials when KESSEL_AUTH_ENABLED
  • Relations API uses requests.post for tuple creation; auth headers from get_http_auth_headers()

4.3 Authorization Request Flow

  • Check-first pattern: For per-resource types, workspace Check runs first; if allowed → wildcard; if denied → StreamedListObjects for specific IDs. Reduces unnecessary StreamedListObjects for org-wide admins.
  • Write-grants-read: When write access is granted, read is also populated
  • Settings: Check-only (no StreamedListObjects)

4.4 Error Handling

  • KesselAccessProvider: All exceptions caught in _check_workspace_permission and _streamed_list_objects; return False / []. No propagation.
  • resource_reporter: gRPC/HTTP errors logged; never propagated. Provider creation/deletion succeeds even if Kessel sync fails.
  • Gap: No retry logic for transient Kessel failures; no circuit breaker.

4.5 Separation of Concerns

  • Authorization logic: Isolated in koku_rebac; permission classes and views delegate to request.user.access
  • Business logic: Provider creation, cost queries unchanged; hooks (on_resource_created, on_resource_deleted) are called at integration points
  • No leakage: Views do not import Kessel directly; they rely on middleware-populated access

4.6 Maintainability

  • Readability: Clear module structure; docstrings explain Check-first pattern and dual-write
  • Test coverage: Unit tests for access_provider, client, config, resource_reporter, middleware; contract tests; E2E regression
  • Extensibility: Adding a new resource type requires: (1) KOKU_TO_KESSEL_TYPE_MAP, (2) IMMEDIATE_WRITE_TYPES if needed, (3) hook calls in appropriate lifecycle points

4.7 Security

  • Privilege escalation: Permission checks flow through SpiceDB; no client-side override
  • Validation: Identity header validated; org_id from identity is trusted for workspace resolution
  • Bypass: ENHANCED_ORG_ADMIN bypasses access lookup; must be disabled for Kessel

Potential bypass: If KesselAccessProvider returns empty access for all types (e.g., Kessel down, all exceptions), request.user.access is empty. The middleware checks not request.user.admin and not request.user.access — empty access raises PermissionDenied for non-admins. So fail-open per-type does not grant access; it denies access for affected types. Correct behavior.


5. Risks and Architectural Weaknesses

Critical

Risk Description Mitigation
Resource ID mismatch provider_builder reports openshift_cluster with provider UUID instead of cluster_id. Query layer filters by cluster_id. StreamedListObjects may return IDs that don't match DB columns. Extract cluster_id from instance.authentication.credentials and use it for openshift_cluster resource_id and has_cluster subject.

High

Risk Description Mitigation
Role seeding gap On-prem has no auto-seeding; operators must manually create role instances. Blocks deployment if not documented. Document in operator guide; provide kessel-admin.sh seed-roles or Helm hook; verify platform tooling.
Cache staleness 5-minute TTL; role/resource changes take up to 5 minutes to take effect. Document; consider shorter TTL or cache invalidation on admin actions (ReBAC Bridge).
Kessel API gaps ADR documents: no structural relationship support in Relations API; schema deployment not integrated. Track upstream; use workarounds (Koku writes structural tuples via REST).

Medium

Risk Description Mitigation
No retry for Kessel Transient gRPC/HTTP failures cause immediate empty access. Add retry with backoff for Check/StreamedListObjects; consider circuit breaker.
Principal prefix fragility redhat/ hardcoded in multiple codebases. Extract to KESSEL_PRINCIPAL_PREFIX env var.
ReBAC Bridge not implemented Management plane (groups, roles, resource assignment) requires manual Kessel API or future Bridge. Document kessel-admin.sh usage; prioritize Bridge delivery.

Low

Risk Description Mitigation
Backward compatibility /status simplified to {"status": "OK"} — breaking for clients expecting detailed response. Document in release notes.
Trino JVM config GCLockerRetryAllocationCount removed; may affect stability. Verify with Trino team.

6. Suggested Improvements

Architecture

  1. Clarify resource ID semantics: Document that openshift_cluster resource_id must be cluster_id (not provider UUID) for alignment with report queries. Fix provider_builder.
  2. Fail-closed option: Consider a configurable mode (e.g., KESSEL_FAIL_CLOSED=true) that returns HTTP 424 when Kessel is unreachable, for high-security deployments.
  3. Schema versioning: Add KESSEL_SCHEMA_VERSION to settings and document upgrade path for schema changes.

Authorization Model

  1. Audit role seeding: Verify all 5 system roles from seed-roles.yaml are correctly wired in the ZED schema and that custom roles can be created via the future Bridge.
  2. Integration visibility: Ensure integration.read computed permission correctly cascades from has_cluster and has_project; validate with E2E tests.

Code Structure

  1. Extract cluster_id helper: Add get_cluster_id_from_provider(provider) to avoid duplication and ensure correct extraction in provider_builder.
  2. Raise KesselConnectionError on connection failure: Consider raising when gRPC channel fails or when all resource types fail, so middleware can return 424 for observability.
  3. Resource reporter auth: Ensure get_http_auth_headers() is used for Relations API DELETE (it is for _delete_resource_tuples); verify POST also uses it.

Operational Resilience

  1. Health check: Add /api/cost-management/v1/kessel/health or similar that probes Kessel Inventory API reachability.
  2. Metrics: Add Prometheus counters for Kessel Check/StreamedListObjects latency, errors, cache hit rate.
  3. Runbook: Document "Kessel is down" scenarios and recovery steps.

Developer Experience

  1. Kessel dev stack: dev/kessel/docker-compose.yml and README are present; ensure they work with pipenv run and local testing.
  2. Contract tests: Keep contract tests for Inventory API v1beta2; run against real Kessel in CI when available.

7. Questions for Design Authors

  1. Resource ID for openshift_cluster: Should the Kessel resource_id for openshift_cluster be cluster_id (from credentials) or provider_uuid? The query layer filters by cluster_id; using provider_uuid would require mapping in the access layer.

  2. Role seeding ownership: Who provides the tooling to seed roles from rbac-config/roles/cost-management.json into Kessel for on-prem? Is there a Helm hook or script that operators run?

  3. Cache invalidation: When an admin assigns a resource to a team via the future ReBAC Bridge, how will Koku's cache be invalidated? Is there a webhook or pub/sub, or do we rely on TTL only?

  4. KesselConnectionError: Should KesselAccessProvider raise KesselConnectionError when gRPC connection fails (e.g., channel creation or all RPCs fail), so operators get HTTP 424 instead of silent deny?

  5. Structural tuples via Relations API: The ADR says Relations API doesn't support structural relationships. Does Koku's create_structural_tuple (Relations API REST) work for has_cluster/has_project, or does it require direct SpiceDB access?

  6. Provider deletion and Kessel cleanup: When a provider is deleted via Sources API, is on_resource_deleted called? The provider_builder destroy_provider uses ProviderManager.remove; does that trigger resource cleanup in Kessel?

  7. Multi-cluster provider: For OCP-on-AWS (one provider, multiple clusters), how are clusters reported? One openshift_cluster per cluster_id, or one per provider?

  8. ENHANCED_ORG_ADMIN: Is ENHANCED_ORG_ADMIN ever True in on-prem? The docs say it must be False when using Kessel.

  9. ReBAC Bridge timeline: When is the ReBAC Bridge expected? Without it, how do operators manage groups and resource assignments today?

  10. Upstream schema PRs: PR #5933 references rbac-config#737 and inventory-api#1243. What is the merge timeline, and how will Koku handle the transition when upstream schema changes?


8. Final Assessment

Dimension Score Rationale
Architecture quality 8/10 Well-structured adapter pattern, clear separation of Kessel vs RBAC paths, comprehensive ZED schema. Minor gaps: resource ID semantics, ReBAC Bridge not yet delivered.
Implementation quality 7/10 Clean code, good test coverage, proper error handling in most paths. Issue: provider_builder may use wrong resource ID for OCP cluster.
Operational readiness 6/10 Documentation is strong; role seeding and Kessel deployment require operator expertise. No health check endpoint; cache staleness may surprise admins.
Security model 8/10 SpiceDB as source of truth; no client-side bypass. Identity validation and org scoping are correct. ENHANCED_ORG_ADMIN must be disabled.

Summary

The Kessel ReBAC integration is architecturally sound and implements a clean authorization abstraction. The design documents are thorough and the implementation follows established patterns. The main concerns are:

  1. Verify/fix OCP cluster resource_id — ensure it aligns with query layer expectations (cluster_id vs provider_uuid).
  2. Operational readiness — role seeding, health checks, and cache behavior need clear operator guidance.
  3. ReBAC Bridge dependency — management plane (groups, resource assignment) is not yet available; document workarounds.

Recommendation: Address the resource ID question and add a health check before production rollout. The design is suitable for production with these clarifications and the ReBAC Bridge (or equivalent management tooling) for day-two operations.


Review Metadata

  • Model used: composer-1.5
  • Generated on: 2026-03-10

Kessel ReBAC Integration — Technical Architecture & Implementation Review

Date: 2026-03-10
Reviewer: Technical Architecture Review
Scope: Kessel authorization (ReBAC) integration for Cost Management (Koku) on-prem
References: PR #5933, kessel-ocp-integration.md, rebac-bridge-design.md


1. High-Level Architecture Review

1.1 Problem Statement

Koku's on-prem deployment previously depended on the SaaS RBAC service, which is unavailable outside cloud.redhat.com. The integration replaces this with Kessel (SpiceDB-based ReBAC) to provide:

  • Fine-grained access control — workspace-based, resource-specific permissions
  • Relationship-based authorization — principals, roles, workspaces, and resources modeled as a graph
  • On-prem independence — no dependency on external SaaS authorization services

1.2 Role of Kessel

Kessel is Red Hat's platform-level authorization system built on SpiceDB (Zanzibar-inspired). It provides:

  • Inventory API (gRPC)Check, StreamedListObjects, ReportResource, DeleteResource
  • Relations API (REST + gRPC) — tuple CRUD for SpiceDB relationships
  • ZED schema — declarative authorization model (resources, relations, permissions)

Kessel is the single source of truth for authorization decisions in on-prem Koku.

1.3 Authorization Decision Flow

User Request
    │
    ▼
Koku API (Django)
    │
    ▼
IdentityHeaderMiddleware
    │
    ├─► get_access_provider() → KesselAccessProvider (ONPREM) or RBACAccessProvider (SaaS)
    │
    ▼
KesselAccessProvider.get_access_for_user()
    │
    ├─► For each resource type:
    │   ├─► Check(rbac/workspace:{org_id}, permission, rbac/principal:{user})  [workspace-level]
    │   └─► StreamedListObjects(resource_type, relation, principal)           [per-resource fallback]
    │
    ▼
Kessel Inventory API (gRPC)
    │
    ▼
SpiceDB (authorization engine)
    │
    ▼
Decision: access map { "openshift.cluster": {"read": ["*"] | ["id1","id2"] }, ... }
    │
    ▼
request.user.access populated → Permission classes & query layer apply filters

1.4 ReBAC Bridge Responsibility

The ReBAC Bridge (described in rebac-bridge-design.md) is a separate Go microservice — not part of this PR. It provides:

  • insights-rbac v1 compatible REST API for roles, groups, principals, access
  • Translation from high-level RBAC operations to SpiceDB tuples
  • Management plane for on-prem admins (group creation, role assignment, resource assignment)

This PR implements the Koku application layerKesselAccessProvider, resource_reporter, middleware, and integration hooks. The ReBAC Bridge is a future deliverable.

1.5 Resources and Relationships Modeled

Resource Type Kessel Type Relations Purpose
OCP Cluster cost_management/openshift_cluster t_workspace → org workspace Cluster visibility
OCP Node cost_management/openshift_node t_workspace, has_cluster Node visibility
OCP Project cost_management/openshift_project t_workspace, has_cluster Project/namespace visibility
Integration cost_management/integration t_workspace, has_cluster, has_project Source visibility (computed)
Cost Model cost_management/cost_model t_workspace Cost model visibility
Settings rbac/workspace Check-only (capability) Settings access
AWS/Azure/GCP Pre-provisioned in schema t_workspace Future cloud provider support

Key relationships:

  • resource#t_workspace → rbac/workspace:{org_id} — primary org-level visibility
  • integration#has_cluster → openshift_cluster:{id} — structural containment for computed permissions
  • rbac/role_binding#t_binding → rbac/workspace — role bindings scoped to workspaces

2. Authorization Model Evaluation

2.1 Resource Modeling

Strengths:

  • OCP resources (cluster, node, project) map cleanly to Kessel types
  • Integration as first-class resource with structural relationships enables computed visibility (project access → cluster → integration)
  • Pre-provisioned schema for AWS, Azure, GCP supports future expansion
  • Permission hierarchy (_view = _read + _all + all_read + all_all) matches SaaS RBAC semantics

Concerns:

  • Provider UUID vs cluster_id: provider_builder._report_ocp_resource() passes str(instance.uuid) (provider UUID) as the resource_id for openshift_cluster. The architecture docs and query layer expect cluster_id (e.g., "my-ocp-cluster-1"). The API filters by cluster_id in report queries. This may cause a mismatch — StreamedListObjects would return provider UUIDs, but the query layer filters by cluster_id from the database.

2.2 Relationship Definitions

  • t_workspace — resource belongs to workspace (primary visibility)
  • has_cluster, has_project — structural containment for integration
  • t_parent — workspace hierarchy (team workspaces inherit from org)
  • t_binding, t_granted, t_subject — role binding chain

The model supports team-based access grants and cross-team resource sharing via multiple t_workspace tuples per resource.

2.3 Permission Mapping

RBAC Type Kessel Permission Granularity
openshift.cluster read cost_management_openshift_cluster_view Appropriate
openshift.cluster * cost_management_openshift_cluster_all Appropriate
settings Check-only on workspace Capability, not per-resource

Permissions are neither overly coarse nor overly granular for the current scope.

2.4 Tenancy Isolation

  • Org scoping: workspace_id = org_id; all resources and role bindings are org-scoped
  • Schema isolation: Koku's tenant schemas (org{org_id}) remain separate; Kessel uses rbac/workspace:{org_id} as the authorization boundary
  • No cross-org leakage: StreamedListObjects and Check are scoped to the workspace; SpiceDB enforces relationship boundaries

2.5 Scalability of Relationship Graph

  • SpiceDB handles millions of tuples; the workspace model limits tuple count to resources × workspaces (not resources × users)
  • Batch StreamedListObjects per resource type avoids O(N) per-resource Check calls
  • Cache (300s TTL) reduces Kessel load for repeated requests

2.6 Extensibility

  • Schema is additive; new resource types can be added without breaking existing deployments
  • KOKU_TO_KESSEL_TYPE_MAP and IMMEDIATE_WRITE_TYPES are centralized for easy extension
  • ReBAC Bridge design allows new resource assignment endpoints

2.7 Authorization Bypass Risks

Risk Mitigation
Kessel unavailable Fail-open per-type: failing types return no access; other types proceed. Cache mitigates transient outages.
Malformed identity Middleware validates x-rh-identity; missing/invalid → 401
Principal format redhat/{username} convention is consistent; Keycloak is source of truth for user existence
ENHANCED_ORG_ADMIN When True, admin bypasses access lookup; must be False when using Kessel

Note: KesselConnectionError is defined and caught by middleware (HTTP 424), but KesselAccessProvider never raises it — it catches all exceptions internally and returns empty access. The fail-open behavior is documented but differs from the original fail-closed (424) recommendation.


3. On-Prem Integration Concerns

3.1 Dependency Footprint

Component Purpose
Kessel Inventory API gRPC (Check, StreamedListObjects, ReportResource, DeleteResource)
Kessel Relations API REST (t_workspace, structural tuples)
SpiceDB Backend (never accessed directly by Koku)
Keycloak OAuth2 client_credentials for Kessel API auth (when KESSEL_AUTH_ENABLED)

New Python deps: kessel-sdk (gRPC stubs), grpcio, requests (already present)

3.2 Operational Complexity

  • Deployment: Requires Kessel stack (SpiceDB + Inventory API + Relations API) + ZED schema + role seeding
  • Configuration: ONPREM=true, AUTHORIZATION_BACKEND=rebac, KESSEL_INVENTORY_*, KESSEL_RELATIONS_*, optional KESSEL_AUTH_*
  • Role seeding: Platform responsibility; no auto-seeding on Kessel-only deployments — operators must run kessel-admin.sh seed-roles or equivalent

3.3 Required Services

  • Kessel Inventory API (gRPC, default 9081)
  • Kessel Relations API (REST, default 8100)
  • SpiceDB (backend for both)
  • Keycloak (for Kessel API auth when enabled)

3.4 Failure Modes When Kessel Is Unavailable

Scenario Behavior
Cache HIT Cached access used; request proceeds
Cache MISS, Check/StreamedListObjects fails Per-type fail-open: that type returns no access ([] or no wildcard); other types still queried. User sees no data for affected types, not incorrect data.
KesselConnectionError raised Middleware catches it → HTTP 424 Failed Dependency (but current code path never raises it)

3.5 Caching

  • Cache backend: CacheEnum.kessel when AUTHORIZATION_BACKEND=rebac
  • TTL: 300 seconds (from settings.CACHES["kessel"]["TIMEOUT"])
  • Key: {user.uuid}_{org_id}
  • Invalidation: Per-request; no explicit invalidation on role/resource changes (stale for up to 5 minutes)

3.6 Synchronous Authorization

  • All Kessel calls are synchronousget_access_for_user blocks until all resource types are resolved
  • Latency: ~N × (Check + StreamedListObjects) where N = number of resource types (~11)
  • Mitigated by cache; first request per user/org pays full cost

3.7 Latency Impact

  • First request (cache miss): Multiple gRPC round-trips; expect 100–500 ms depending on network
  • Cached requests: No Kessel calls
  • Recommendation: Monitor p95 latency for get_access_for_user and Kessel Check/StreamedListObjects

4. Implementation Review (PR #5933)

4.1 Integration Layer

Strengths:

  • Clean adapter pattern: get_access_provider() returns KesselAccessProvider or RBACAccessProvider; middleware and permission classes unchanged
  • KOKU_TO_KESSEL_TYPE_MAP centralizes type mapping
  • Workspace resolution (ShimResolver, RbacV2Resolver) abstracts org_id → workspace_id

Issues:

  • provider_builder OCP resource ID: _report_ocp_resource(str(instance.uuid), self.org_id) uses provider UUID. The query layer and API filter by cluster_id. The Kessel resource for openshift_cluster should use cluster_id from instance.authentication.credentials.get("cluster_id") to align with report filtering. Same for _report_integration(..., str(instance.uuid), ...) — the has_cluster subject should be cluster_id.

4.2 Kessel Client Usage

  • Singleton KesselClient with double-checked locking
  • gRPC channel supports TLS + OAuth2 call credentials when KESSEL_AUTH_ENABLED
  • Relations API uses requests.post for tuple creation; auth headers from get_http_auth_headers()

4.3 Authorization Request Flow

  • Check-first pattern: For per-resource types, workspace Check runs first; if allowed → wildcard; if denied → StreamedListObjects for specific IDs. Reduces unnecessary StreamedListObjects for org-wide admins.
  • Write-grants-read: When write access is granted, read is also populated
  • Settings: Check-only (no StreamedListObjects)

4.4 Error Handling

  • KesselAccessProvider: All exceptions caught in _check_workspace_permission and _streamed_list_objects; return False / []. No propagation.
  • resource_reporter: gRPC/HTTP errors logged; never propagated. Provider creation/deletion succeeds even if Kessel sync fails.
  • Gap: No retry logic for transient Kessel failures; no circuit breaker.

4.5 Separation of Concerns

  • Authorization logic: Isolated in koku_rebac; permission classes and views delegate to request.user.access
  • Business logic: Provider creation, cost queries unchanged; hooks (on_resource_created, on_resource_deleted) are called at integration points
  • No leakage: Views do not import Kessel directly; they rely on middleware-populated access

4.6 Maintainability

  • Readability: Clear module structure; docstrings explain Check-first pattern and dual-write
  • Test coverage: Unit tests for access_provider, client, config, resource_reporter, middleware; contract tests; E2E regression
  • Extensibility: Adding a new resource type requires: (1) KOKU_TO_KESSEL_TYPE_MAP, (2) IMMEDIATE_WRITE_TYPES if needed, (3) hook calls in appropriate lifecycle points

4.7 Security

  • Privilege escalation: Permission checks flow through SpiceDB; no client-side override
  • Validation: Identity header validated; org_id from identity is trusted for workspace resolution
  • Bypass: ENHANCED_ORG_ADMIN bypasses access lookup; must be disabled for Kessel

Potential bypass: If KesselAccessProvider returns empty access for all types (e.g., Kessel down, all exceptions), request.user.access is empty. The middleware checks not request.user.admin and not request.user.access — empty access raises PermissionDenied for non-admins. So fail-open per-type does not grant access; it denies access for affected types. Correct behavior.


5. Risks and Architectural Weaknesses

Critical

Risk Description Mitigation
Resource ID mismatch provider_builder reports openshift_cluster with provider UUID instead of cluster_id. Query layer filters by cluster_id. StreamedListObjects may return IDs that don't match DB columns. Extract cluster_id from instance.authentication.credentials and use it for openshift_cluster resource_id and has_cluster subject.

High

Risk Description Mitigation
Role seeding gap On-prem has no auto-seeding; operators must manually create role instances. Blocks deployment if not documented. Document in operator guide; provide kessel-admin.sh seed-roles or Helm hook; verify platform tooling.
Cache staleness 5-minute TTL; role/resource changes take up to 5 minutes to take effect. Document; consider shorter TTL or cache invalidation on admin actions (ReBAC Bridge).
Kessel API gaps ADR documents: no structural relationship support in Relations API; schema deployment not integrated. Track upstream; use workarounds (Koku writes structural tuples via REST).

Medium

Risk Description Mitigation
No retry for Kessel Transient gRPC/HTTP failures cause immediate empty access. Add retry with backoff for Check/StreamedListObjects; consider circuit breaker.
Principal prefix fragility redhat/ hardcoded in multiple codebases. Extract to KESSEL_PRINCIPAL_PREFIX env var.
ReBAC Bridge not implemented Management plane (groups, roles, resource assignment) requires manual Kessel API or future Bridge. Document kessel-admin.sh usage; prioritize Bridge delivery.

Low

Risk Description Mitigation
Backward compatibility /status simplified to {"status": "OK"} — breaking for clients expecting detailed response. Document in release notes.
Trino JVM config GCLockerRetryAllocationCount removed; may affect stability. Verify with Trino team.

6. Suggested Improvements

Architecture

  1. Clarify resource ID semantics: Document that openshift_cluster resource_id must be cluster_id (not provider UUID) for alignment with report queries. Fix provider_builder.
  2. Fail-closed option: Consider a configurable mode (e.g., KESSEL_FAIL_CLOSED=true) that returns HTTP 424 when Kessel is unreachable, for high-security deployments.
  3. Schema versioning: Add KESSEL_SCHEMA_VERSION to settings and document upgrade path for schema changes.

Authorization Model

  1. Audit role seeding: Verify all 5 system roles from seed-roles.yaml are correctly wired in the ZED schema and that custom roles can be created via the future Bridge.
  2. Integration visibility: Ensure integration.read computed permission correctly cascades from has_cluster and has_project; validate with E2E tests.

Code Structure

  1. Extract cluster_id helper: Add get_cluster_id_from_provider(provider) to avoid duplication and ensure correct extraction in provider_builder.
  2. Raise KesselConnectionError on connection failure: Consider raising when gRPC channel fails or when all resource types fail, so middleware can return 424 for observability.
  3. Resource reporter auth: Ensure get_http_auth_headers() is used for Relations API DELETE (it is for _delete_resource_tuples); verify POST also uses it.

Operational Resilience

  1. Health check: Add /api/cost-management/v1/kessel/health or similar that probes Kessel Inventory API reachability.
  2. Metrics: Add Prometheus counters for Kessel Check/StreamedListObjects latency, errors, cache hit rate.
  3. Runbook: Document "Kessel is down" scenarios and recovery steps.

Developer Experience

  1. Kessel dev stack: dev/kessel/docker-compose.yml and README are present; ensure they work with pipenv run and local testing.
  2. Contract tests: Keep contract tests for Inventory API v1beta2; run against real Kessel in CI when available.

7. Questions for Design Authors

  1. Resource ID for openshift_cluster: Should the Kessel resource_id for openshift_cluster be cluster_id (from credentials) or provider_uuid? The query layer filters by cluster_id; using provider_uuid would require mapping in the access layer.

  2. Role seeding ownership: Who provides the tooling to seed roles from rbac-config/roles/cost-management.json into Kessel for on-prem? Is there a Helm hook or script that operators run?

  3. Cache invalidation: When an admin assigns a resource to a team via the future ReBAC Bridge, how will Koku's cache be invalidated? Is there a webhook or pub/sub, or do we rely on TTL only?

  4. KesselConnectionError: Should KesselAccessProvider raise KesselConnectionError when gRPC connection fails (e.g., channel creation or all RPCs fail), so operators get HTTP 424 instead of silent deny?

  5. Structural tuples via Relations API: The ADR says Relations API doesn't support structural relationships. Does Koku's create_structural_tuple (Relations API REST) work for has_cluster/has_project, or does it require direct SpiceDB access?

  6. Provider deletion and Kessel cleanup: When a provider is deleted via Sources API, is on_resource_deleted called? The provider_builder destroy_provider uses ProviderManager.remove; does that trigger resource cleanup in Kessel?

  7. Multi-cluster provider: For OCP-on-AWS (one provider, multiple clusters), how are clusters reported? One openshift_cluster per cluster_id, or one per provider?

  8. ENHANCED_ORG_ADMIN: Is ENHANCED_ORG_ADMIN ever True in on-prem? The docs say it must be False when using Kessel.

  9. ReBAC Bridge timeline: When is the ReBAC Bridge expected? Without it, how do operators manage groups and resource assignments today?

  10. Upstream schema PRs: PR #5933 references rbac-config#737 and inventory-api#1243. What is the merge timeline, and how will Koku handle the transition when upstream schema changes?


8. Final Assessment

Dimension Score Rationale
Architecture quality 8/10 Well-structured adapter pattern, clear separation of Kessel vs RBAC paths, comprehensive ZED schema. Minor gaps: resource ID semantics, ReBAC Bridge not yet delivered.
Implementation quality 7/10 Clean code, good test coverage, proper error handling in most paths. Issue: provider_builder may use wrong resource ID for OCP cluster.
Operational readiness 6/10 Documentation is strong; role seeding and Kessel deployment require operator expertise. No health check endpoint; cache staleness may surprise admins.
Security model 8/10 SpiceDB as source of truth; no client-side bypass. Identity validation and org scoping are correct. ENHANCED_ORG_ADMIN must be disabled.

Summary

The Kessel ReBAC integration is architecturally sound and implements a clean authorization abstraction. The design documents are thorough and the implementation follows established patterns. The main concerns are:

  1. Verify/fix OCP cluster resource_id — ensure it aligns with query layer expectations (cluster_id vs provider_uuid).
  2. Operational readiness — role seeding, health checks, and cache behavior need clear operator guidance.
  3. ReBAC Bridge dependency — management plane (groups, resource assignment) is not yet available; document workarounds.

Recommendation: Address the resource ID question and add a health check before production rollout. The design is suitable for production with these clarifications and the ReBAC Bridge (or equivalent management tooling) for day-two operations.


Review Metadata

  • Model used: gpt-5.3-codex
  • Generated on: 2026-03-10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment