masayag/KESSEL_REBAC_INTEGRATION_REVIEW-composer-1.5.md

## KESSEL_REBAC_INTEGRATION_REVIEW-composer-1.5.md

      
    Raw
  

              KESSEL_REBAC_INTEGRATION_REVIEW-composer-1.5.md
            
          
    Kessel ReBAC Integration — Technical Architecture & Implementation Review

Date: 2026-03-10

Reviewer: Technical Architecture Review

Scope: Kessel authorization (ReBAC) integration for Cost Management (Koku) on-prem

References: PR #5933, kessel-ocp-integration.md, rebac-bridge-design.md

1. High-Level Architecture Review

1.1 Problem Statement

Koku's on-prem deployment previously depended on the SaaS RBAC service, which is unavailable outside cloud.redhat.com. The integration replaces this with Kessel (SpiceDB-based ReBAC) to provide:

Fine-grained access control — workspace-based, resource-specific permissions
Relationship-based authorization — principals, roles, workspaces, and resources modeled as a graph
On-prem independence — no dependency on external SaaS authorization services

1.2 Role of Kessel

Kessel is Red Hat's platform-level authorization system built on SpiceDB (Zanzibar-inspired). It provides:

Inventory API (gRPC) — Check, StreamedListObjects, ReportResource, DeleteResource
Relations API (REST + gRPC) — tuple CRUD for SpiceDB relationships
ZED schema — declarative authorization model (resources, relations, permissions)

Kessel is the single source of truth for authorization decisions in on-prem Koku.
1.3 Authorization Decision Flow

User Request
    │
    ▼
Koku API (Django)
    │
    ▼
IdentityHeaderMiddleware
    │
    ├─► get_access_provider() → KesselAccessProvider (ONPREM) or RBACAccessProvider (SaaS)
    │
    ▼
KesselAccessProvider.get_access_for_user()
    │
    ├─► For each resource type:
    │   ├─► Check(rbac/workspace:{org_id}, permission, rbac/principal:{user})  [workspace-level]
    │   └─► StreamedListObjects(resource_type, relation, principal)           [per-resource fallback]
    │
    ▼
Kessel Inventory API (gRPC)
    │
    ▼
SpiceDB (authorization engine)
    │
    ▼
Decision: access map { "openshift.cluster": {"read": ["*"] | ["id1","id2"] }, ... }
    │
    ▼
request.user.access populated → Permission classes & query layer apply filters
1.4 ReBAC Bridge Responsibility

The ReBAC Bridge (described in rebac-bridge-design.md) is a separate Go microservice — not part of this PR. It provides:

insights-rbac v1 compatible REST API for roles, groups, principals, access
Translation from high-level RBAC operations to SpiceDB tuples
Management plane for on-prem admins (group creation, role assignment, resource assignment)

This PR implements the Koku application layer — KesselAccessProvider, resource_reporter, middleware, and integration hooks. The ReBAC Bridge is a future deliverable.
1.5 Resources and Relationships Modeled


Resource Type
Kessel Type
Relations
Purpose


OCP Cluster
cost_management/openshift_cluster
t_workspace → org workspace
Cluster visibility


OCP Node
cost_management/openshift_node
t_workspace, has_cluster
Node visibility


OCP Project
cost_management/openshift_project
t_workspace, has_cluster
Project/namespace visibility


Integration
cost_management/integration
t_workspace, has_cluster, has_project
Source visibility (computed)


Cost Model
cost_management/cost_model
t_workspace
Cost model visibility


Settings
rbac/workspace
Check-only (capability)
Settings access


AWS/Azure/GCP
Pre-provisioned in schema
t_workspace
Future cloud provider support


Key relationships:

resource#t_workspace → rbac/workspace:{org_id} — primary org-level visibility
integration#has_cluster → openshift_cluster:{id} — structural containment for computed permissions
rbac/role_binding#t_binding → rbac/workspace — role bindings scoped to workspaces


2. Authorization Model Evaluation

2.1 Resource Modeling

Strengths:

OCP resources (cluster, node, project) map cleanly to Kessel types
Integration as first-class resource with structural relationships enables computed visibility (project access → cluster → integration)
Pre-provisioned schema for AWS, Azure, GCP supports future expansion
Permission hierarchy (_view = _read + _all + all_read + all_all) matches SaaS RBAC semantics

Concerns:

Provider UUID vs cluster_id: provider_builder._report_ocp_resource() passes str(instance.uuid) (provider UUID) as the resource_id for openshift_cluster. The architecture docs and query layer expect cluster_id (e.g., "my-ocp-cluster-1"). The API filters by cluster_id in report queries. This may cause a mismatch — StreamedListObjects would return provider UUIDs, but the query layer filters by cluster_id from the database.

2.2 Relationship Definitions


t_workspace — resource belongs to workspace (primary visibility)
has_cluster, has_project — structural containment for integration
t_parent — workspace hierarchy (team workspaces inherit from org)
t_binding, t_granted, t_subject — role binding chain

The model supports team-based access grants and cross-team resource sharing via multiple t_workspace tuples per resource.
2.3 Permission Mapping


RBAC Type
Kessel Permission
Granularity


openshift.cluster read
cost_management_openshift_cluster_view
Appropriate


openshift.cluster *
cost_management_openshift_cluster_all
Appropriate


settings
Check-only on workspace
Capability, not per-resource


Permissions are neither overly coarse nor overly granular for the current scope.
2.4 Tenancy Isolation


Org scoping: workspace_id = org_id; all resources and role bindings are org-scoped
Schema isolation: Koku's tenant schemas (org{org_id}) remain separate; Kessel uses rbac/workspace:{org_id} as the authorization boundary
No cross-org leakage: StreamedListObjects and Check are scoped to the workspace; SpiceDB enforces relationship boundaries

2.5 Scalability of Relationship Graph


SpiceDB handles millions of tuples; the workspace model limits tuple count to resources × workspaces (not resources × users)
Batch StreamedListObjects per resource type avoids O(N) per-resource Check calls
Cache (300s TTL) reduces Kessel load for repeated requests

2.6 Extensibility


Schema is additive; new resource types can be added without breaking existing deployments
KOKU_TO_KESSEL_TYPE_MAP and IMMEDIATE_WRITE_TYPES are centralized for easy extension
ReBAC Bridge design allows new resource assignment endpoints

2.7 Authorization Bypass Risks


Risk
Mitigation


Kessel unavailable
Fail-open per-type: failing types return no access; other types proceed. Cache mitigates transient outages.


Malformed identity
Middleware validates x-rh-identity; missing/invalid → 401


Principal format
redhat/{username} convention is consistent; Keycloak is source of truth for user existence


ENHANCED_ORG_ADMIN
When True, admin bypasses access lookup; must be False when using Kessel


Note: KesselConnectionError is defined and caught by middleware (HTTP 424), but KesselAccessProvider never raises it — it catches all exceptions internally and returns empty access. The fail-open behavior is documented but differs from the original fail-closed (424) recommendation.

3. On-Prem Integration Concerns

3.1 Dependency Footprint


Component
Purpose


Kessel Inventory API
gRPC (Check, StreamedListObjects, ReportResource, DeleteResource)


Kessel Relations API
REST (t_workspace, structural tuples)


SpiceDB
Backend (never accessed directly by Koku)


Keycloak
OAuth2 client_credentials for Kessel API auth (when KESSEL_AUTH_ENABLED)


New Python deps: kessel-sdk (gRPC stubs), grpcio, requests (already present)
3.2 Operational Complexity


Deployment: Requires Kessel stack (SpiceDB + Inventory API + Relations API) + ZED schema + role seeding
Configuration: ONPREM=true, AUTHORIZATION_BACKEND=rebac, KESSEL_INVENTORY_*, KESSEL_RELATIONS_*, optional KESSEL_AUTH_*
Role seeding: Platform responsibility; no auto-seeding on Kessel-only deployments — operators must run kessel-admin.sh seed-roles or equivalent

3.3 Required Services


Kessel Inventory API (gRPC, default 9081)
Kessel Relations API (REST, default 8100)
SpiceDB (backend for both)
Keycloak (for Kessel API auth when enabled)

3.4 Failure Modes When Kessel Is Unavailable


Scenario
Behavior


Cache HIT
Cached access used; request proceeds


Cache MISS, Check/StreamedListObjects fails
Per-type fail-open: that type returns no access ([] or no wildcard); other types still queried. User sees no data for affected types, not incorrect data.


KesselConnectionError raised
Middleware catches it → HTTP 424 Failed Dependency (but current code path never raises it)


3.5 Caching


Cache backend: CacheEnum.kessel when AUTHORIZATION_BACKEND=rebac
TTL: 300 seconds (from settings.CACHES["kessel"]["TIMEOUT"])
Key: {user.uuid}_{org_id}
Invalidation: Per-request; no explicit invalidation on role/resource changes (stale for up to 5 minutes)

3.6 Synchronous Authorization


All Kessel calls are synchronous — get_access_for_user blocks until all resource types are resolved
Latency: ~N × (Check + StreamedListObjects) where N = number of resource types (~11)
Mitigated by cache; first request per user/org pays full cost

3.7 Latency Impact


First request (cache miss): Multiple gRPC round-trips; expect 100–500 ms depending on network
Cached requests: No Kessel calls
Recommendation: Monitor p95 latency for get_access_for_user and Kessel Check/StreamedListObjects


4. Implementation Review (PR #5933)

4.1 Integration Layer

Strengths:

Clean adapter pattern: get_access_provider() returns KesselAccessProvider or RBACAccessProvider; middleware and permission classes unchanged
KOKU_TO_KESSEL_TYPE_MAP centralizes type mapping
Workspace resolution (ShimResolver, RbacV2Resolver) abstracts org_id → workspace_id

Issues:

provider_builder OCP resource ID: _report_ocp_resource(str(instance.uuid), self.org_id) uses provider UUID. The query layer and API filter by cluster_id. The Kessel resource for openshift_cluster should use cluster_id from instance.authentication.credentials.get("cluster_id") to align with report filtering. Same for _report_integration(..., str(instance.uuid), ...) — the has_cluster subject should be cluster_id.

4.2 Kessel Client Usage


Singleton KesselClient with double-checked locking
gRPC channel supports TLS + OAuth2 call credentials when KESSEL_AUTH_ENABLED
Relations API uses requests.post for tuple creation; auth headers from get_http_auth_headers()

4.3 Authorization Request Flow


Check-first pattern: For per-resource types, workspace Check runs first; if allowed → wildcard; if denied → StreamedListObjects for specific IDs. Reduces unnecessary StreamedListObjects for org-wide admins.
Write-grants-read: When write access is granted, read is also populated
Settings: Check-only (no StreamedListObjects)

4.4 Error Handling


KesselAccessProvider: All exceptions caught in _check_workspace_permission and _streamed_list_objects; return False / []. No propagation.
resource_reporter: gRPC/HTTP errors logged; never propagated. Provider creation/deletion succeeds even if Kessel sync fails.
Gap: No retry logic for transient Kessel failures; no circuit breaker.

4.5 Separation of Concerns


Authorization logic: Isolated in koku_rebac; permission classes and views delegate to request.user.access
Business logic: Provider creation, cost queries unchanged; hooks (on_resource_created, on_resource_deleted) are called at integration points
No leakage: Views do not import Kessel directly; they rely on middleware-populated access

4.6 Maintainability


Readability: Clear module structure; docstrings explain Check-first pattern and dual-write
Test coverage: Unit tests for access_provider, client, config, resource_reporter, middleware; contract tests; E2E regression
Extensibility: Adding a new resource type requires: (1) KOKU_TO_KESSEL_TYPE_MAP, (2) IMMEDIATE_WRITE_TYPES if needed, (3) hook calls in appropriate lifecycle points

4.7 Security


Privilege escalation: Permission checks flow through SpiceDB; no client-side override
Validation: Identity header validated; org_id from identity is trusted for workspace resolution
Bypass: ENHANCED_ORG_ADMIN bypasses access lookup; must be disabled for Kessel

Potential bypass: If KesselAccessProvider returns empty access for all types (e.g., Kessel down, all exceptions), request.user.access is empty. The middleware checks not request.user.admin and not request.user.access — empty access raises PermissionDenied for non-admins. So fail-open per-type does not grant access; it denies access for affected types. Correct behavior.

5. Risks and Architectural Weaknesses

Critical


Risk
Description
Mitigation


Resource ID mismatch
provider_builder reports openshift_cluster with provider UUID instead of cluster_id. Query layer filters by cluster_id. StreamedListObjects may return IDs that don't match DB columns.
Extract cluster_id from instance.authentication.credentials and use it for openshift_cluster resource_id and has_cluster subject.


High


Risk
Description
Mitigation


Role seeding gap
On-prem has no auto-seeding; operators must manually create role instances. Blocks deployment if not documented.
Document in operator guide; provide kessel-admin.sh seed-roles or Helm hook; verify platform tooling.


Cache staleness
5-minute TTL; role/resource changes take up to 5 minutes to take effect.
Document; consider shorter TTL or cache invalidation on admin actions (ReBAC Bridge).


Kessel API gaps
ADR documents: no structural relationship support in Relations API; schema deployment not integrated.
Track upstream; use workarounds (Koku writes structural tuples via REST).


Medium


Risk
Description
Mitigation


No retry for Kessel
Transient gRPC/HTTP failures cause immediate empty access.
Add retry with backoff for Check/StreamedListObjects; consider circuit breaker.


Principal prefix fragility
redhat/ hardcoded in multiple codebases.
Extract to KESSEL_PRINCIPAL_PREFIX env var.


ReBAC Bridge not implemented
Management plane (groups, roles, resource assignment) requires manual Kessel API or future Bridge.
Document kessel-admin.sh usage; prioritize Bridge delivery.


Low


Risk
Description
Mitigation


Backward compatibility
/status simplified to {"status": "OK"} — breaking for clients expecting detailed response.
Document in release notes.


Trino JVM config
GCLockerRetryAllocationCount removed; may affect stability.
Verify with Trino team.


6. Suggested Improvements

Architecture


Clarify resource ID semantics: Document that openshift_cluster resource_id must be cluster_id (not provider UUID) for alignment with report queries. Fix provider_builder.
Fail-closed option: Consider a configurable mode (e.g., KESSEL_FAIL_CLOSED=true) that returns HTTP 424 when Kessel is unreachable, for high-security deployments.
Schema versioning: Add KESSEL_SCHEMA_VERSION to settings and document upgrade path for schema changes.

Authorization Model


Audit role seeding: Verify all 5 system roles from seed-roles.yaml are correctly wired in the ZED schema and that custom roles can be created via the future Bridge.
Integration visibility: Ensure integration.read computed permission correctly cascades from has_cluster and has_project; validate with E2E tests.

Code Structure


Extract cluster_id helper: Add get_cluster_id_from_provider(provider) to avoid duplication and ensure correct extraction in provider_builder.
Raise KesselConnectionError on connection failure: Consider raising when gRPC channel fails or when all resource types fail, so middleware can return 424 for observability.
Resource reporter auth: Ensure get_http_auth_headers() is used for Relations API DELETE (it is for _delete_resource_tuples); verify POST also uses it.

Operational Resilience


Health check: Add /api/cost-management/v1/kessel/health or similar that probes Kessel Inventory API reachability.
Metrics: Add Prometheus counters for Kessel Check/StreamedListObjects latency, errors, cache hit rate.
Runbook: Document "Kessel is down" scenarios and recovery steps.

Developer Experience


Kessel dev stack: dev/kessel/docker-compose.yml and README are present; ensure they work with pipenv run and local testing.
Contract tests: Keep contract tests for Inventory API v1beta2; run against real Kessel in CI when available.


7. Questions for Design Authors


Resource ID for openshift_cluster: Should the Kessel resource_id for openshift_cluster be cluster_id (from credentials) or provider_uuid? The query layer filters by cluster_id; using provider_uuid would require mapping in the access layer.


Role seeding ownership: Who provides the tooling to seed roles from rbac-config/roles/cost-management.json into Kessel for on-prem? Is there a Helm hook or script that operators run?


Cache invalidation: When an admin assigns a resource to a team via the future ReBAC Bridge, how will Koku's cache be invalidated? Is there a webhook or pub/sub, or do we rely on TTL only?


KesselConnectionError: Should KesselAccessProvider raise KesselConnectionError when gRPC connection fails (e.g., channel creation or all RPCs fail), so operators get HTTP 424 instead of silent deny?


Structural tuples via Relations API: The ADR says Relations API doesn't support structural relationships. Does Koku's create_structural_tuple (Relations API REST) work for has_cluster/has_project, or does it require direct SpiceDB access?


Provider deletion and Kessel cleanup: When a provider is deleted via Sources API, is on_resource_deleted called? The provider_builder destroy_provider uses ProviderManager.remove; does that trigger resource cleanup in Kessel?


Multi-cluster provider: For OCP-on-AWS (one provider, multiple clusters), how are clusters reported? One openshift_cluster per cluster_id, or one per provider?


ENHANCED_ORG_ADMIN: Is ENHANCED_ORG_ADMIN ever True in on-prem? The docs say it must be False when using Kessel.


ReBAC Bridge timeline: When is the ReBAC Bridge expected? Without it, how do operators manage groups and resource assignments today?


Upstream schema PRs: PR #5933 references rbac-config#737 and inventory-api#1243. What is the merge timeline, and how will Koku handle the transition when upstream schema changes?


8. Final Assessment


Dimension
Score
Rationale


Architecture quality
8/10
Well-structured adapter pattern, clear separation of Kessel vs RBAC paths, comprehensive ZED schema. Minor gaps: resource ID semantics, ReBAC Bridge not yet delivered.


Implementation quality
7/10
Clean code, good test coverage, proper error handling in most paths. Issue: provider_builder may use wrong resource ID for OCP cluster.


Operational readiness
6/10
Documentation is strong; role seeding and Kessel deployment require operator expertise. No health check endpoint; cache staleness may surprise admins.


Security model
8/10
SpiceDB as source of truth; no client-side bypass. Identity validation and org scoping are correct. ENHANCED_ORG_ADMIN must be disabled.


Summary

The Kessel ReBAC integration is architecturally sound and implements a clean authorization abstraction. The design documents are thorough and the implementation follows established patterns. The main concerns are:

Verify/fix OCP cluster resource_id — ensure it aligns with query layer expectations (cluster_id vs provider_uuid).
Operational readiness — role seeding, health checks, and cache behavior need clear operator guidance.
ReBAC Bridge dependency — management plane (groups, resource assignment) is not yet available; document workarounds.

Recommendation: Address the resource ID question and add a health check before production rollout. The design is suitable for production with these clarifications and the ReBAC Bridge (or equivalent management tooling) for day-two operations.

Review Metadata


Model used: composer-1.5
Generated on: 2026-03-10


## KESSEL_REBAC_INTEGRATION_REVIEW-gpt-5.3-codex.md

      
    Raw
  

              KESSEL_REBAC_INTEGRATION_REVIEW-gpt-5.3-codex.md
            
          
    Kessel ReBAC Integration — Technical Architecture & Implementation Review

Date: 2026-03-10

Reviewer: Technical Architecture Review

Scope: Kessel authorization (ReBAC) integration for Cost Management (Koku) on-prem

References: PR #5933, kessel-ocp-integration.md, rebac-bridge-design.md

1. High-Level Architecture Review

1.1 Problem Statement

Koku's on-prem deployment previously depended on the SaaS RBAC service, which is unavailable outside cloud.redhat.com. The integration replaces this with Kessel (SpiceDB-based ReBAC) to provide:

Fine-grained access control — workspace-based, resource-specific permissions
Relationship-based authorization — principals, roles, workspaces, and resources modeled as a graph
On-prem independence — no dependency on external SaaS authorization services

1.2 Role of Kessel

Kessel is Red Hat's platform-level authorization system built on SpiceDB (Zanzibar-inspired). It provides:

Inventory API (gRPC) — Check, StreamedListObjects, ReportResource, DeleteResource
Relations API (REST + gRPC) — tuple CRUD for SpiceDB relationships
ZED schema — declarative authorization model (resources, relations, permissions)

Kessel is the single source of truth for authorization decisions in on-prem Koku.
1.3 Authorization Decision Flow

User Request
    │
    ▼
Koku API (Django)
    │
    ▼
IdentityHeaderMiddleware
    │
    ├─► get_access_provider() → KesselAccessProvider (ONPREM) or RBACAccessProvider (SaaS)
    │
    ▼
KesselAccessProvider.get_access_for_user()
    │
    ├─► For each resource type:
    │   ├─► Check(rbac/workspace:{org_id}, permission, rbac/principal:{user})  [workspace-level]
    │   └─► StreamedListObjects(resource_type, relation, principal)           [per-resource fallback]
    │
    ▼
Kessel Inventory API (gRPC)
    │
    ▼
SpiceDB (authorization engine)
    │
    ▼
Decision: access map { "openshift.cluster": {"read": ["*"] | ["id1","id2"] }, ... }
    │
    ▼
request.user.access populated → Permission classes & query layer apply filters

1.4 ReBAC Bridge Responsibility

The ReBAC Bridge (described in rebac-bridge-design.md) is a separate Go microservice — not part of this PR. It provides:

insights-rbac v1 compatible REST API for roles, groups, principals, access
Translation from high-level RBAC operations to SpiceDB tuples
Management plane for on-prem admins (group creation, role assignment, resource assignment)

This PR implements the Koku application layer — KesselAccessProvider, resource_reporter, middleware, and integration hooks. The ReBAC Bridge is a future deliverable.
1.5 Resources and Relationships Modeled


Resource Type
Kessel Type
Relations
Purpose


OCP Cluster
cost_management/openshift_cluster
t_workspace → org workspace
Cluster visibility


OCP Node
cost_management/openshift_node
t_workspace, has_cluster
Node visibility


OCP Project
cost_management/openshift_project
t_workspace, has_cluster
Project/namespace visibility


Integration
cost_management/integration
t_workspace, has_cluster, has_project
Source visibility (computed)


Cost Model
cost_management/cost_model
t_workspace
Cost model visibility


Settings
rbac/workspace
Check-only (capability)
Settings access


AWS/Azure/GCP
Pre-provisioned in schema
t_workspace
Future cloud provider support


Key relationships:

resource#t_workspace → rbac/workspace:{org_id} — primary org-level visibility
integration#has_cluster → openshift_cluster:{id} — structural containment for computed permissions
rbac/role_binding#t_binding → rbac/workspace — role bindings scoped to workspaces


2. Authorization Model Evaluation

2.1 Resource Modeling

Strengths:

OCP resources (cluster, node, project) map cleanly to Kessel types
Integration as first-class resource with structural relationships enables computed visibility (project access → cluster → integration)
Pre-provisioned schema for AWS, Azure, GCP supports future expansion
Permission hierarchy (_view = _read + _all + all_read + all_all) matches SaaS RBAC semantics

Concerns:

Provider UUID vs cluster_id: provider_builder._report_ocp_resource() passes str(instance.uuid) (provider UUID) as the resource_id for openshift_cluster. The architecture docs and query layer expect cluster_id (e.g., "my-ocp-cluster-1"). The API filters by cluster_id in report queries. This may cause a mismatch — StreamedListObjects would return provider UUIDs, but the query layer filters by cluster_id from the database.

2.2 Relationship Definitions


t_workspace — resource belongs to workspace (primary visibility)
has_cluster, has_project — structural containment for integration
t_parent — workspace hierarchy (team workspaces inherit from org)
t_binding, t_granted, t_subject — role binding chain

The model supports team-based access grants and cross-team resource sharing via multiple t_workspace tuples per resource.
2.3 Permission Mapping


RBAC Type
Kessel Permission
Granularity


openshift.cluster read
cost_management_openshift_cluster_view
Appropriate


openshift.cluster *
cost_management_openshift_cluster_all
Appropriate


settings
Check-only on workspace
Capability, not per-resource


Permissions are neither overly coarse nor overly granular for the current scope.
2.4 Tenancy Isolation


Org scoping: workspace_id = org_id; all resources and role bindings are org-scoped
Schema isolation: Koku's tenant schemas (org{org_id}) remain separate; Kessel uses rbac/workspace:{org_id} as the authorization boundary
No cross-org leakage: StreamedListObjects and Check are scoped to the workspace; SpiceDB enforces relationship boundaries

2.5 Scalability of Relationship Graph


SpiceDB handles millions of tuples; the workspace model limits tuple count to resources × workspaces (not resources × users)
Batch StreamedListObjects per resource type avoids O(N) per-resource Check calls
Cache (300s TTL) reduces Kessel load for repeated requests

2.6 Extensibility


Schema is additive; new resource types can be added without breaking existing deployments
KOKU_TO_KESSEL_TYPE_MAP and IMMEDIATE_WRITE_TYPES are centralized for easy extension
ReBAC Bridge design allows new resource assignment endpoints

2.7 Authorization Bypass Risks


Risk
Mitigation


Kessel unavailable
Fail-open per-type: failing types return no access; other types proceed. Cache mitigates transient outages.


Malformed identity
Middleware validates x-rh-identity; missing/invalid → 401


Principal format
redhat/{username} convention is consistent; Keycloak is source of truth for user existence


ENHANCED_ORG_ADMIN
When True, admin bypasses access lookup; must be False when using Kessel


Note: KesselConnectionError is defined and caught by middleware (HTTP 424), but KesselAccessProvider never raises it — it catches all exceptions internally and returns empty access. The fail-open behavior is documented but differs from the original fail-closed (424) recommendation.

3. On-Prem Integration Concerns

3.1 Dependency Footprint


Component
Purpose


Kessel Inventory API
gRPC (Check, StreamedListObjects, ReportResource, DeleteResource)


Kessel Relations API
REST (t_workspace, structural tuples)


SpiceDB
Backend (never accessed directly by Koku)


Keycloak
OAuth2 client_credentials for Kessel API auth (when KESSEL_AUTH_ENABLED)


New Python deps: kessel-sdk (gRPC stubs), grpcio, requests (already present)
3.2 Operational Complexity


Deployment: Requires Kessel stack (SpiceDB + Inventory API + Relations API) + ZED schema + role seeding
Configuration: ONPREM=true, AUTHORIZATION_BACKEND=rebac, KESSEL_INVENTORY_*, KESSEL_RELATIONS_*, optional KESSEL_AUTH_*
Role seeding: Platform responsibility; no auto-seeding on Kessel-only deployments — operators must run kessel-admin.sh seed-roles or equivalent

3.3 Required Services


Kessel Inventory API (gRPC, default 9081)
Kessel Relations API (REST, default 8100)
SpiceDB (backend for both)
Keycloak (for Kessel API auth when enabled)

3.4 Failure Modes When Kessel Is Unavailable


Scenario
Behavior


Cache HIT
Cached access used; request proceeds


Cache MISS, Check/StreamedListObjects fails
Per-type fail-open: that type returns no access ([] or no wildcard); other types still queried. User sees no data for affected types, not incorrect data.


KesselConnectionError raised
Middleware catches it → HTTP 424 Failed Dependency (but current code path never raises it)


3.5 Caching


Cache backend: CacheEnum.kessel when AUTHORIZATION_BACKEND=rebac
TTL: 300 seconds (from settings.CACHES["kessel"]["TIMEOUT"])
Key: {user.uuid}_{org_id}
Invalidation: Per-request; no explicit invalidation on role/resource changes (stale for up to 5 minutes)

3.6 Synchronous Authorization


All Kessel calls are synchronous — get_access_for_user blocks until all resource types are resolved
Latency: ~N × (Check + StreamedListObjects) where N = number of resource types (~11)
Mitigated by cache; first request per user/org pays full cost

3.7 Latency Impact


First request (cache miss): Multiple gRPC round-trips; expect 100–500 ms depending on network
Cached requests: No Kessel calls
Recommendation: Monitor p95 latency for get_access_for_user and Kessel Check/StreamedListObjects


4. Implementation Review (PR #5933)

4.1 Integration Layer

Strengths:

Clean adapter pattern: get_access_provider() returns KesselAccessProvider or RBACAccessProvider; middleware and permission classes unchanged
KOKU_TO_KESSEL_TYPE_MAP centralizes type mapping
Workspace resolution (ShimResolver, RbacV2Resolver) abstracts org_id → workspace_id

Issues:

provider_builder OCP resource ID: _report_ocp_resource(str(instance.uuid), self.org_id) uses provider UUID. The query layer and API filter by cluster_id. The Kessel resource for openshift_cluster should use cluster_id from instance.authentication.credentials.get("cluster_id") to align with report filtering. Same for _report_integration(..., str(instance.uuid), ...) — the has_cluster subject should be cluster_id.

4.2 Kessel Client Usage


Singleton KesselClient with double-checked locking
gRPC channel supports TLS + OAuth2 call credentials when KESSEL_AUTH_ENABLED
Relations API uses requests.post for tuple creation; auth headers from get_http_auth_headers()

4.3 Authorization Request Flow


Check-first pattern: For per-resource types, workspace Check runs first; if allowed → wildcard; if denied → StreamedListObjects for specific IDs. Reduces unnecessary StreamedListObjects for org-wide admins.
Write-grants-read: When write access is granted, read is also populated
Settings: Check-only (no StreamedListObjects)

4.4 Error Handling


KesselAccessProvider: All exceptions caught in _check_workspace_permission and _streamed_list_objects; return False / []. No propagation.
resource_reporter: gRPC/HTTP errors logged; never propagated. Provider creation/deletion succeeds even if Kessel sync fails.
Gap: No retry logic for transient Kessel failures; no circuit breaker.

4.5 Separation of Concerns


Authorization logic: Isolated in koku_rebac; permission classes and views delegate to request.user.access
Business logic: Provider creation, cost queries unchanged; hooks (on_resource_created, on_resource_deleted) are called at integration points
No leakage: Views do not import Kessel directly; they rely on middleware-populated access

4.6 Maintainability


Readability: Clear module structure; docstrings explain Check-first pattern and dual-write
Test coverage: Unit tests for access_provider, client, config, resource_reporter, middleware; contract tests; E2E regression
Extensibility: Adding a new resource type requires: (1) KOKU_TO_KESSEL_TYPE_MAP, (2) IMMEDIATE_WRITE_TYPES if needed, (3) hook calls in appropriate lifecycle points

4.7 Security


Privilege escalation: Permission checks flow through SpiceDB; no client-side override
Validation: Identity header validated; org_id from identity is trusted for workspace resolution
Bypass: ENHANCED_ORG_ADMIN bypasses access lookup; must be disabled for Kessel

Potential bypass: If KesselAccessProvider returns empty access for all types (e.g., Kessel down, all exceptions), request.user.access is empty. The middleware checks not request.user.admin and not request.user.access — empty access raises PermissionDenied for non-admins. So fail-open per-type does not grant access; it denies access for affected types. Correct behavior.

5. Risks and Architectural Weaknesses

Critical


Risk
Description
Mitigation


Resource ID mismatch
provider_builder reports openshift_cluster with provider UUID instead of cluster_id. Query layer filters by cluster_id. StreamedListObjects may return IDs that don't match DB columns.
Extract cluster_id from instance.authentication.credentials and use it for openshift_cluster resource_id and has_cluster subject.


High


Risk
Description
Mitigation


Role seeding gap
On-prem has no auto-seeding; operators must manually create role instances. Blocks deployment if not documented.
Document in operator guide; provide kessel-admin.sh seed-roles or Helm hook; verify platform tooling.


Cache staleness
5-minute TTL; role/resource changes take up to 5 minutes to take effect.
Document; consider shorter TTL or cache invalidation on admin actions (ReBAC Bridge).


Kessel API gaps
ADR documents: no structural relationship support in Relations API; schema deployment not integrated.
Track upstream; use workarounds (Koku writes structural tuples via REST).


Medium


Risk
Description
Mitigation


No retry for Kessel
Transient gRPC/HTTP failures cause immediate empty access.
Add retry with backoff for Check/StreamedListObjects; consider circuit breaker.


Principal prefix fragility
redhat/ hardcoded in multiple codebases.
Extract to KESSEL_PRINCIPAL_PREFIX env var.


ReBAC Bridge not implemented
Management plane (groups, roles, resource assignment) requires manual Kessel API or future Bridge.
Document kessel-admin.sh usage; prioritize Bridge delivery.


Low


Risk
Description
Mitigation


Backward compatibility
/status simplified to {"status": "OK"} — breaking for clients expecting detailed response.
Document in release notes.


Trino JVM config
GCLockerRetryAllocationCount removed; may affect stability.
Verify with Trino team.


6. Suggested Improvements

Architecture


Clarify resource ID semantics: Document that openshift_cluster resource_id must be cluster_id (not provider UUID) for alignment with report queries. Fix provider_builder.
Fail-closed option: Consider a configurable mode (e.g., KESSEL_FAIL_CLOSED=true) that returns HTTP 424 when Kessel is unreachable, for high-security deployments.
Schema versioning: Add KESSEL_SCHEMA_VERSION to settings and document upgrade path for schema changes.

Authorization Model


Audit role seeding: Verify all 5 system roles from seed-roles.yaml are correctly wired in the ZED schema and that custom roles can be created via the future Bridge.
Integration visibility: Ensure integration.read computed permission correctly cascades from has_cluster and has_project; validate with E2E tests.

Code Structure


Extract cluster_id helper: Add get_cluster_id_from_provider(provider) to avoid duplication and ensure correct extraction in provider_builder.
Raise KesselConnectionError on connection failure: Consider raising when gRPC channel fails or when all resource types fail, so middleware can return 424 for observability.
Resource reporter auth: Ensure get_http_auth_headers() is used for Relations API DELETE (it is for _delete_resource_tuples); verify POST also uses it.

Operational Resilience


Health check: Add /api/cost-management/v1/kessel/health or similar that probes Kessel Inventory API reachability.
Metrics: Add Prometheus counters for Kessel Check/StreamedListObjects latency, errors, cache hit rate.
Runbook: Document "Kessel is down" scenarios and recovery steps.

Developer Experience


Kessel dev stack: dev/kessel/docker-compose.yml and README are present; ensure they work with pipenv run and local testing.
Contract tests: Keep contract tests for Inventory API v1beta2; run against real Kessel in CI when available.


7. Questions for Design Authors


Resource ID for openshift_cluster: Should the Kessel resource_id for openshift_cluster be cluster_id (from credentials) or provider_uuid? The query layer filters by cluster_id; using provider_uuid would require mapping in the access layer.


Role seeding ownership: Who provides the tooling to seed roles from rbac-config/roles/cost-management.json into Kessel for on-prem? Is there a Helm hook or script that operators run?


Cache invalidation: When an admin assigns a resource to a team via the future ReBAC Bridge, how will Koku's cache be invalidated? Is there a webhook or pub/sub, or do we rely on TTL only?


KesselConnectionError: Should KesselAccessProvider raise KesselConnectionError when gRPC connection fails (e.g., channel creation or all RPCs fail), so operators get HTTP 424 instead of silent deny?


Structural tuples via Relations API: The ADR says Relations API doesn't support structural relationships. Does Koku's create_structural_tuple (Relations API REST) work for has_cluster/has_project, or does it require direct SpiceDB access?


Provider deletion and Kessel cleanup: When a provider is deleted via Sources API, is on_resource_deleted called? The provider_builder destroy_provider uses ProviderManager.remove; does that trigger resource cleanup in Kessel?


Multi-cluster provider: For OCP-on-AWS (one provider, multiple clusters), how are clusters reported? One openshift_cluster per cluster_id, or one per provider?


ENHANCED_ORG_ADMIN: Is ENHANCED_ORG_ADMIN ever True in on-prem? The docs say it must be False when using Kessel.


ReBAC Bridge timeline: When is the ReBAC Bridge expected? Without it, how do operators manage groups and resource assignments today?


Upstream schema PRs: PR #5933 references rbac-config#737 and inventory-api#1243. What is the merge timeline, and how will Koku handle the transition when upstream schema changes?


8. Final Assessment


Dimension
Score
Rationale


Architecture quality
8/10
Well-structured adapter pattern, clear separation of Kessel vs RBAC paths, comprehensive ZED schema. Minor gaps: resource ID semantics, ReBAC Bridge not yet delivered.


Implementation quality
7/10
Clean code, good test coverage, proper error handling in most paths. Issue: provider_builder may use wrong resource ID for OCP cluster.


Operational readiness
6/10
Documentation is strong; role seeding and Kessel deployment require operator expertise. No health check endpoint; cache staleness may surprise admins.


Security model
8/10
SpiceDB as source of truth; no client-side bypass. Identity validation and org scoping are correct. ENHANCED_ORG_ADMIN must be disabled.


Summary

The Kessel ReBAC integration is architecturally sound and implements a clean authorization abstraction. The design documents are thorough and the implementation follows established patterns. The main concerns are:

Verify/fix OCP cluster resource_id — ensure it aligns with query layer expectations (cluster_id vs provider_uuid).
Operational readiness — role seeding, health checks, and cache behavior need clear operator guidance.
ReBAC Bridge dependency — management plane (groups, resource assignment) is not yet available; document workarounds.

Recommendation: Address the resource ID question and add a health check before production rollout. The design is suitable for production with these clarifications and the ReBAC Bridge (or equivalent management tooling) for day-two operations.

Review Metadata


Model used: gpt-5.3-codex
Generated on: 2026-03-10
Resource Type	Kessel Type	Relations	Purpose
OCP Cluster	`cost_management/openshift_cluster`	`t_workspace` → org workspace	Cluster visibility
OCP Node	`cost_management/openshift_node`	`t_workspace`, `has_cluster`	Node visibility
OCP Project	`cost_management/openshift_project`	`t_workspace`, `has_cluster`	Project/namespace visibility
Integration	`cost_management/integration`	`t_workspace`, `has_cluster`, `has_project`	Source visibility (computed)
Cost Model	`cost_management/cost_model`	`t_workspace`	Cost model visibility
Settings	`rbac/workspace`	Check-only (capability)	Settings access
AWS/Azure/GCP	Pre-provisioned in schema	`t_workspace`	Future cloud provider support
RBAC Type	Kessel Permission	Granularity
`openshift.cluster` read	`cost_management_openshift_cluster_view`	Appropriate
`openshift.cluster` *	`cost_management_openshift_cluster_all`	Appropriate
`settings`	Check-only on workspace	Capability, not per-resource
Risk	Mitigation
Kessel unavailable	Fail-open per-type: failing types return no access; other types proceed. Cache mitigates transient outages.
Malformed identity	Middleware validates `x-rh-identity`; missing/invalid → 401
Principal format	`redhat/{username}` convention is consistent; Keycloak is source of truth for user existence
ENHANCED_ORG_ADMIN	When True, admin bypasses access lookup; must be False when using Kessel
Component	Purpose
Kessel Inventory API	gRPC (Check, StreamedListObjects, ReportResource, DeleteResource)
Kessel Relations API	REST (t_workspace, structural tuples)
SpiceDB	Backend (never accessed directly by Koku)
Keycloak	OAuth2 client_credentials for Kessel API auth (when `KESSEL_AUTH_ENABLED`)
Scenario	Behavior
Cache HIT	Cached access used; request proceeds
Cache MISS, Check/StreamedListObjects fails	Per-type fail-open: that type returns no access (`[]` or no wildcard); other types still queried. User sees no data for affected types, not incorrect data.
KesselConnectionError raised	Middleware catches it → HTTP 424 Failed Dependency (but current code path never raises it)
Risk	Description	Mitigation
Role seeding gap	On-prem has no auto-seeding; operators must manually create role instances. Blocks deployment if not documented.	Document in operator guide; provide `kessel-admin.sh seed-roles` or Helm hook; verify platform tooling.
Cache staleness	5-minute TTL; role/resource changes take up to 5 minutes to take effect.	Document; consider shorter TTL or cache invalidation on admin actions (ReBAC Bridge).
Kessel API gaps	ADR documents: no structural relationship support in Relations API; schema deployment not integrated.	Track upstream; use workarounds (Koku writes structural tuples via REST).
Risk	Description	Mitigation
No retry for Kessel	Transient gRPC/HTTP failures cause immediate empty access.	Add retry with backoff for Check/StreamedListObjects; consider circuit breaker.
Principal prefix fragility	`redhat/` hardcoded in multiple codebases.	Extract to `KESSEL_PRINCIPAL_PREFIX` env var.
ReBAC Bridge not implemented	Management plane (groups, roles, resource assignment) requires manual Kessel API or future Bridge.	Document `kessel-admin.sh` usage; prioritize Bridge delivery.
Risk	Description	Mitigation
Backward compatibility	`/status` simplified to `{"status": "OK"}` — breaking for clients expecting detailed response.	Document in release notes.
Trino JVM config	`GCLockerRetryAllocationCount` removed; may affect stability.	Verify with Trino team.
Dimension	Score	Rationale
Architecture quality	8/10	Well-structured adapter pattern, clear separation of Kessel vs RBAC paths, comprehensive ZED schema. Minor gaps: resource ID semantics, ReBAC Bridge not yet delivered.
Implementation quality	7/10	Clean code, good test coverage, proper error handling in most paths. Issue: provider_builder may use wrong resource ID for OCP cluster.
Operational readiness	6/10	Documentation is strong; role seeding and Kessel deployment require operator expertise. No health check endpoint; cache staleness may surprise admins.
Security model	8/10	SpiceDB as source of truth; no client-side bypass. Identity validation and org scoping are correct. ENHANCED_ORG_ADMIN must be disabled.