Skip to content

Instantly share code, notes, and snippets.

@RussellSpitzer
Created March 8, 2026 02:03
Show Gist options
  • Select an option

  • Save RussellSpitzer/8dddd1915d0c9fb9e027ab5fd5331c87 to your computer and use it in GitHub Desktop.

Select an option

Save RussellSpitzer/8dddd1915d0c9fb9e027ab5fd5331c87 to your computer and use it in GitHub Desktop.
Apache Iceberg review comments: 58,381 comments from 4,309 merged PRs — source data for AGENTS.md
This file has been truncated, but you can view the full file.
{"9695": [{"author": "jackye1995", "body": "nit: `PreplanTable`", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I think we should avoid using the wordings like `\"preplan\"`, `\"plan\"`. That makes the documentation a bit informal (at least to me)", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "just nit about the word \"preplan\", looks like it can exist either as one whole word, or two words `pre-plan`: https://www.merriam-webster.com/dictionary/preplan\r\n\r\nI think we should be consistent for this wording in the spec. If `preplan` is the word we will always use, We should always stick to `Preplan` instead of `PrePlan` for camel casing.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "nit: `PlanTable`", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I don't think we need to mention anything recommended. However, we should mention that it should be supplied as-is as input of PlanTable", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I think quite a few things are required, based on FileScanTaskParser.fromJson. \r\n\r\nAnd interestingly data-file is actually not required:\r\n\r\n```\r\nDataFile dataFile = null;\r\n if (jsonNode.has(DATA_FILE)) {\r\n dataFile = (DataFile) ContentFileParser.fromJson(jsonNode.get(DATA_FILE), spec);\r\n }\r\n```\r\n\r\nAlthough I am not sure why the parser decides to allow null for the data file. @stevenzwu could you help explain that?", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I don't think data-sequence-number and file-sequence-number are in the ContentFileParser, why are they added here?", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "similar comment, please check ContentFileParser.fromJson for what are required", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "nit: avoid unrelated line changes. If this is breaking checkstyle, let's put a separated PR to fix it.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I don't think this is needed, it can be a part of each plan task's opaque object content", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "To be consistent with other results, I think we should simply say something in the line of \"Result of preplanning a table\"", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "similar comment on description, just \"Result of planning a table\" should be fine", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "\"that user specifies during a query\" seems very specific to query use case. I think we can just say \"A list of the selected columns\"", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "missing ref", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "missing from-snapshot-id and to-snapshot-id for incremental read support.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "missing case-sensitive", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "missing project(Schema)", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "nit: more consistent with other descriptions, like \"result of preplanning a table scan\"", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "nit: more consistent with other descriptions, like \"result of planning a table scan\"", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I don't think we want to say anything about recommendation here.", "path": "open-api/rest-catalog-open-api.py", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "This is not just a string map I think, it needs proper parsing", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "At least it should not be \r\n\r\n```\r\n additionalProperties:\r\n type: string\r\n```\r\n\r\nAnd ideally we should match the behavior of the parser", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I am talking about this: https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/TableScan.java#L48", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I think this should just be a single plan task, based on the devlist discussion", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I am not sure if we should directly reference all properties of another request. I'd rather have something like a `PlanContext` that holds all the planning configurations, and let both `PreplanTableRequest` and `PlanTableRequst` have it", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I am talking about this:\r\n\r\n```\r\nPlanTable w/ a plan task: POST /v1/namespaces/ns/tables/t/plan\r\n{ \"filter\": {\"type\": \"in\", \"term\": \"x\", \"values\": [1, 2, 3] }, \"select\": [\"x\", \"a.b\"], \"plan-task\": { ... } }\r\n\r\n{ \"file-scan-tasks\": [ { ... }, { ... } ] }\r\n```\r\n\r\nSo a single plan task is in the request.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "dramaticlly", "body": "If your change is based on latest master, can leverage the openAPI spec validation task `:iceberg-open-api:build` I added in #9344 to identify the potential problems.\r\n\r\n", "path": null, "line": null, "type": "review_body"}, {"author": "dramaticlly", "body": "I believe case does not match here and it's `PrePlanTableRequest` below at line 3099, ", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "nit: 1 newline instead of 2", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "nit: 1 newline instead of 3", "path": "open-api/rest-catalog-open-api.yaml", "line": 629, "type": "inline"}, {"author": "jackye1995", "body": "nit: 1 newline instead of 2", "path": "open-api/rest-catalog-open-api.yaml", "line": 797, "type": "inline"}, {"author": "jackye1995", "body": "Nit: remove quotation for as is", "path": "open-api/rest-catalog-open-api.py", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: remove \"flexible\". stick to plain fact.", "path": "open-api/rest-catalog-open-api.py", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "in `IncrementalScan`, we have `fromSnapshot` API/semantics of `inclusive` or `exclusive`. do we need to replicate the same behavior here?", "path": "open-api/rest-catalog-open-api.py", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "alias `timestamp-ms` -> `as-of-time`?", "path": "open-api/rest-catalog-open-api.py", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: alias to `use-snapshot`?", "path": "open-api/rest-catalog-open-api.py", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this summary description would suggest that query should be part of the request body. should it be changed to `plan context`?", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "avoid words like `we`.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this sentence reads a little weird. maybe `For distributed planning, use plan-tasks to pass in a portion of scan planning`?", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I agree with Jack. Let's revert this change.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It looks like this simply copies the Java API. That's not what we want to do when designing a REST endpoint because it can end up overly complicated. In the Java API we have some options for the caller's convenience, but those don't make sense if translated to an API contract like this.\r\n\r\nThe difference between `select` and `project` is a good example. In Java, both are exposed for different use cases. If the caller has a list of columns from SQL, we can accept those via `select`. And if the ca", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think that `plan-task` is incompatible with some of the `PlanContext` arguments. If `plan-task` is present, then the version selection arguments (`snapshot-id` and `snapshot-range`) must be omitted. I'm not sure it makes sense to use a shared set of arguments like this because of the changes to behavior that need to be documented with the extra option.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I would not include the `stats-fields` list in this request. That option only applies when producing scan tasks. I think that I would not use `PlanContext` given that this is a subset of the arguments and the addition of `plan-task` below changes whether other arguments like `snapshot-id` are allowed.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Let's keep threads open until they are resolved.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What is this? Why is it estimated?", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What's the purpose of sending a schema? What schema is this, the table schema, read schema, data file schema, or partition schema?", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why send back the partition spec? Could you send an ID instead?", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'm not sure that we want to do this on the service. This would cause the service to run fairly expensive analysis and would inflate the response size. In some cases, that response size could get really large. For instance, if you send an IN predicate with a large key set.\r\n\r\nThis is also not widely used. In Spark, the original predicate is run for every task instead of the residual.\r\n\r\nI think the solution here is to document that `residual-filter` is optional. If it is not present, the residua", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think that start and length are required. The service sends back the file size in bytes and breaking files into splits should be a client-side operation. That keeps the response a bit simpler.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think it makes sense for this to be part of `PrimitiveTypeValue`. I also can't think of a situation in which this is going to be used so I think you can remove it.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "This is reviewed in more details in https://github.com/apache/iceberg/pull/9717, you might want to jump there. This is pending the change there to update this part.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think that this is necessary.\r\n\r\nThere are two places that might send typed data values:\r\n1. The partition struct of a data file\r\n2. The lower and upper bound maps of a data file\r\n\r\nBoth of those only allow primitive values, so this can be simpler:\r\n```\r\n CountMap:\r\n type: object\r\n properties:\r\n keys:\r\n type: array\r\n items:\r\n type: integer\r\n values:\r\n type: array\r\n items:\r\n type: integer\r\n format: int64\r\n\r\n ValueMap", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Do we really need to enumerate the types? To me, this seems like a placeholder for a value that matches a known type and will be parsed/validated using an Iceberg type and that type's single-value JSON encoding.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why is partition here? It is also in `ContentFile`. I would send it back just once.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "These 4 fields are all int to long maps. I think that should be a specific type. I called it `CountMap` above.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "These two are value maps. I have a more strict definition above.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should be removed.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "You can remove this.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "If `spec-id` is required, should `partition` also be required?", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "You can remove this because the content file has a file-size-in-bytes. The caller can calculate this.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should not have a trailing `/`.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What does it mean for a user to submit a query? I think that this needs to be more direct and clear about what this endpoint specifically does.\r\n\r\n> Scan pre-planning creates a set of opaque planning tasks for a set of scan configuration options. Each task can be passed to the plan endpoint to fetch a (disjoint) subset of the file scan tasks for the scan.\r\n>\r\n> Scan pre-planning enables breaking scan planning across multiple tasks. This can be used to parallelize scan planning requests, use fewe", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this also needs a response code that instructs the caller to use the pre-planning endpoint.\r\n\r\nI don't think any of the 3xx responses fit because those are redirecting the request and this indicates that the response type will change. There are some 4xx responses that would make sense:\r\n\r\n* `413\u200a\u2014\u200aPayload Too Large` could indicate that the result is too large. If the request had no `plan-task`, then the solution is to use pre-planning to break down the work. I think this makes sense if t", "path": "open-api/rest-catalog-open-api.yaml", "line": 764, "type": "inline"}, {"author": "rdblue", "body": "Thanks, @rahil-c! I did a thorough review, so there are a lot of comments. This is looking good!", "path": null, "line": null, "type": "review_body"}, {"author": "rdblue", "body": "I suggested a description update below.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Sorry, I think I was wrong here. Row group offsets are needed for task planning. I confused this information about where files can be split with information about how to split files.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "So far we have found quite a few places that we might need to diverge from the existing FileScanTaskParser (and other related parsers). Let's make sure we get the spec right, and then change the parser implementation or potentially create a different parser.", "path": "open-api/rest-catalog-open-api.yaml", "line": null, "type": "inline"}], "12774": [{"author": "liurenjie1024", "body": "Thanks @pvary for this pr, left some comments, genearlly looks great!", "path": null, "line": null, "type": "review_body"}, {"author": "liurenjie1024", "body": "```suggestion\r\n \"Unable to register object model {} for data files: {}\", objectModel, e.getMessage());\r\n```", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "liurenjie1024", "body": "This implementation seems odd to me, it throws an exception to avoid duplicated registeration. How about just like this:\r\n```\r\nval ret = OBJECT_MODELS.putIfAbsent(k, v);\r\nif (ret != null) {\r\n log.info(...)\r\n}\r\n```", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "liurenjie1024", "body": "```suggestion\r\n private static final ConcurrentMap<Key, ObjectModel<?>> OBJECT_MODELS = Maps.newConcurrentMap();\r\n```\r\n\r\nIf we want to allow registering non predefined formats at runtime, it should be a concurrrent map?", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "liurenjie1024", "body": "Do we really need this?", "path": "data/src/main/java/org/apache/iceberg/data/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "liurenjie1024", "body": "Do we really need this?", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "liurenjie1024", "body": "Position deletes's input maybe sorted, should we also have a config key for the case?", "path": "data/src/main/java/org/apache/iceberg/data/PositionDeleteWriterBuilder.java", "line": null, "type": "inline"}, {"author": "liurenjie1024", "body": "How is this different from `DATA_WRITER`?", "path": "core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "@danielcweeks and @rdblue specifically asked for harder guardrails around registering object models. This check prevents users to accidentally overwrite embedded object models.\r\n\r\nIf the community has consensus around changing this behavior, I'm happy to do either way.", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "The deletes are always expected to be sorted. The writer/appender doesn't sort them. There are several abstraction layers above the actual writers which do the sorting, but at this level we always expect sorted input.", "path": "data/src/main/java/org/apache/iceberg/data/PositionDeleteWriterBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Good point. I included it since I introduced a new mode for every writer mode, but they are always the same.\r\nI will remove this.", "path": "core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Fixed. Thx!", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "It was concurrent already, but we used only the API provided by the `Map`.\r\nBut agree, that it is better to explicitly define as concurrent.\r\n\r\nFixed", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Flink depends on `BaseTaskWriter` to implement the writers, and `BaseTaskWriter` needs `FileAppenderFactory` and `Appender`a to work.\r\nAlso there might be some other engines still depending on the `FileAppenderFactory` and `FileAppender` API which is not deprecated.\r\n\r\nAll-in-all, I think we need this.", "path": "data/src/main/java/org/apache/iceberg/data/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Same as above.", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "liurenjie1024", "body": "Seems reasonable to me. Another point is should we consider throwing an exception if the `ObjectModel` has been registered? Current behavior just prints a message.", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "liurenjie1024", "body": "Oh, sorry for misclarification. I mean we already have `AppenderBuilder` in iceberg-core, why we need another one here?", "path": "data/src/main/java/org/apache/iceberg/data/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "liurenjie1024", "body": "Oh, right, this should only focus on format part.", "path": "data/src/main/java/org/apache/iceberg/data/PositionDeleteWriterBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Ehh.. that was a mistake on my side.\r\nThe `try/catch` is not needed after the refactors.\r\nFixed.", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "There are a few convenience methods in the `data.AppenderBuilder`, like `B set(Map<String, String> properties)`, or `B meta(Map<String, String> properties)` what we don't enforce for the `io.AppenderBuilder` to implement, but these could be handled by default methods.\r\nThe main thing is that we don't want to expose the `<D> FileAppender<D> build(WriteMode mode) throws IOException;` to the data writers.", "path": "data/src/main/java/org/apache/iceberg/data/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Tried my hands on moving the WriterMode to the constructor for the appender, but that would mean even more serious refactor of the old codebase. I'm happy to do it, if the community thinks that is the direction moving forward, but currently I'm struggling to find reviewers for the current code. I would be reluctant to add any more big changes unless specifically asked", "path": "data/src/main/java/org/apache/iceberg/data/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "liurenjie1024", "body": "I think I get your point, and I'm fine with this approach. Would you mind to add doc in this class to explain why we need to have this `AppenderBuilder`?", "path": "data/src/main/java/org/apache/iceberg/data/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "liurenjie1024", "body": "Thanks @pvary for this pr, LGTM!", "path": null, "line": null, "type": "review_body"}, {"author": "pvary", "body": "Good news!\r\nI have found a way to relatively simply remove the `data.AppenderBuilder`. I needed to add `WriteMode mode` parameter to the `ObjectModel.appenderBuilder(OutputFile outputFile, WriteMode mode);`, but I think this is an acceptable tradeoff.\r\n\r\n@liurenjie1024, @jbonofre could you please double check?\r\n\r\nThanks,\r\nPeter", "path": "data/src/main/java/org/apache/iceberg/data/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "liurenjie1024", "body": "LGTM, just some nits!", "path": null, "line": null, "type": "review_body"}, {"author": "liurenjie1024", "body": "```suggestion\r\n {@link DataWriterBuilder}, {@link EqualityDeleteWriterBuilder}, {@link\r\n```", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "liurenjie1024", "body": "I would prefer to add them one by one when we implement each model.", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Moved a bit.\r\nThanks!", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Yeah.. that was a mistake.\r\nThanks for catching!", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "wondering why this is not called `WriteBuilder`?", "path": "core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "why we are introducing a deprecated method in a brand new interface?", "path": "core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "As an end result, we need to provide 4 interfaces from the ObjectModel:\r\n- Appender - generates a `FileAppender` which just writes the data to an output file\r\n- DataWriter - generates a `DataWriter` which generates DataFiles\r\n- PositionDeleteWriter - generates a `PositionDeleteWriter` which generates DeleteFile\r\n- EqualityDeleteWriter - generates a `EqualityDeleteWriter` which generates DeleteFile\r\n\r\nThe PR doesn't touch the results (`FileAppender`, `DataWriter`, `PositionDeleteWriter`, `Equalit", "path": "core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "There are several public APIs where the contract requires us to keep the current `overwrite(boolean)`. These APIs will be deprecated, but in 1.10 we still need to provide them. OTOH we are planning to use the new API internally in every case, so as a temporary measure we will provide `overwrite(boolean)`. When the old APIs are removed, we will not need the new method in the new API, so it could be removed and streamlined.\r\n", "path": "core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "left some initial comments on the interfaces. will still need to take a look at the other bigger PR to understand more on the work as a whole.", "path": null, "line": null, "type": "review_body"}, {"author": "stevenzwu", "body": "`ObjectModel` seems too generic and doesn't really capture the responsibility of this factory class. maybe `DataIOFactory`?\r\n\r\n`FileIO` is already taken for another context. Otherwise, `FileIOFactory` may be good.", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I see. It is odd that the symmetric counter part of `ReadBuilder` is not `WriteBuilder`.\r\n\r\nour current class naming is a bit inconsistent in terms of `Writer` vs `Appender`. E.g. we have `ParquetWriter implements FileAppender`. In my mind, `FileAppender` should be named as `FileWriter` for write data to a specific file format. And the current `FileWriter` should be called `ContentFileWriter`, because `ContentFile` is a data, position delete, or equality delete file. But I guess we might be too ", "path": "core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "shouldn't this be in the same package as `ObjectModel` in iceberg-core module?", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "wondering if we need the `WriteMode` enum. Somewhere downstream, there is probably a switch case.\r\n\r\nInstead, we can have 3 different methods like `dataWriteBuilder`, `positionDeleteWriteBuilder`, and `equalityDeleteWriteBuilder`.\r\n\r\nIf we want to stick with a enum arg, we can probably just use the existing `FileContent` enum.", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: inconsistent naming: append and read (or) appender and reader.", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: add the acronym in the comment. When I first saw `aadPrefix`, I thought it is a typo of `addPrefix` :)\r\n```\r\nadditional authentication data (aad)\r\n```", "path": "core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "maybe `dataSchema` is better than `engineSchema`?", "path": "core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "inconsistent naming. some methods use `with`, while some don't", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "ContentWriteBuilder? also, can we merge the `WriteBuildBase` into this interface too?", "path": "data/src/main/java/org/apache/iceberg/data/FileWriteBuilderBase.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Maybe `DataFileIOFactory`? But I would like to hear more voices on this.\r\n\r\nI definitely don't like `FileIOFactory`. We don't create `FileIO`s with it.", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "If I remember correctly @danielcweeks had a comment suggesting to move as much as possible to `data`. I agree with him, as I feel that the `core` module starts to get bloated, and everything is pushed there.\r\n\r\nThe `data` module already contains classes for accessing data independently of the actual File Format. I think the ObjectModelRegistry has the same function.\r\n\r\nAnother possibility to separate the classes based on the user/implementer.\r\n\r\nThe `ObjectModel`, `AppenderBuilder`, `ReadBuilder", "path": "data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java", "line": null, "type": "inline"}, {"author": "liurenjie1024", "body": "I prefer `DataIOFactory`, `DataFileIOFactory` is a little odd to me", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "This is a very good point.\r\nThe values in the Mode kept shrinking, and now it can be replaced by `FileContent`.\r\n\r\nThanks for the suggestion!", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "These are the methods which are not pushed down to the `Appender`. They were different in the previous API, and I have kept it so, but I'm open to change it. This would mean more code change when migrating to the new API, but I am fine if the community decides so", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Good point.\r\nAfter removing the Appender from the data module, we don't need the base.", "path": "data/src/main/java/org/apache/iceberg/data/FileWriteBuilderBase.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Also, shall we use aadPrefix/aADPrefix/withAadPrefix in the new API?\r\n\r\nCurrently it is `withAADPrefix`", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "We have a matrix between the `File Format` X `Engine Internal Object Model`.\r\nI don't really like the `DataIOFactory` as it doesn't contain any of the `File Format` or `Engine Internal Object Model`, but agree with you that even `ObjectModel` is a bit odd.", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Went ahead and used `FileContent`.\r\n\r\nAbout the 3 different method approach:\r\n- We would like to expose the same Appender interface for Flink (or other users) that it is provided by the File Format. There we don't want any mention of \"FileContent\" or \"WriteMode\" or multiple builder methods.\r\n\r\nThat is why I opted for passing the mode in the constructor.\r\n", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Went ahead, and removed the `with` prefix from everywhere.\r\nIt's much more consistent, and the users should not know about the internal details.", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I needed to define a boundary for this change.\r\n\r\nMy decision was to keep the `FileAppender` (for writes), and the `CloseableIterable` (for writes) which are both widely used in the codebase.\r\nIn every other place I tried to stick to the API defined by the InternalData API where WriteBuilder and ReadBuilder was defined.\r\n\r\nThis dictates many of the choices/discrepancies you highlight here.\r\n\r\nIf we stick to the `FileAppender`, should we build it with `AppenderBuilder`, or `WriteBuilder`? Shall w", "path": "core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "since we are doing a major refactoring of writer interface, I am wondering if there is an appetite for revisiting the current writer class organization. I just found it weird to have the opposite side of `ReadBuilder` to be `AppenderBuilder` and not `WriteBuilder`.\r\n\r\n> FileAppender should be named as `FileWriter` for write data to a specific file format. And the current `FileWriter` should be called `ContentFileWriter` or `ContentWriter`, because it handles `ContentFile` which is a data, positi", "path": "core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I understand the concern of including `Data` would that it might be confused with `Data/PositionDelete/EqualityDelete` writer. This factory is focused on file format (like Parquet).\r\n\r\nmaybe `FileAccessFactory`. access could cover both read and write/appender.", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "oh. this also include the `Engine Internal Object Model` dimension. instead of bundling both dimensions in one interface, I am wondering if it can be split to two separate interfaces: one for file format access factory and another for engine specific model. maybe the engine specific implementation can encapsulate the file format access factory as a class member.", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "if we can separate out the file format handling and the engine in-memory object model, then this class (file format specific) can be called `FileFormatIOFactory` or `FileAccessFactory`.\r\n\r\nEngine specific in-memory model doesn't need to be registered in the `ObjectModelRegistry` in the `iceberg-data` module. probably only file format factories needs to be registered in core/data module.\r\n\r\nEngine can define and use its own interface local to the engine module.", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "If we rename all of the `FileAppender` occurrences to `FileWriter` we need to modify 130 files (not checked every file, but seems correct by randomly checking files). Seems a bit excessive to me.\r\n\r\nWe can still decide to call the class WriteBuilder and return FileAppender (this is how it is done in InternalData)", "path": "core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I tried several ways, but the information required from the engines is different for different file formats:\r\n\r\nFor Avro:\r\n```\r\n public ObjectModel(\r\n String name,\r\n BiFunction<org.apache.iceberg.Schema, Map<Integer, ?>, DatumReader<?>> readerFunction,\r\n BiFunction<Schema, E, DatumWriter<?>> writerFunction,\r\n BiFunction<Schema, E, DatumWriter<?>> deleteRowWriterFunction)\r\n```\r\n\r\nFor Parquet:\r\n```\r\n private ObjectModel(\r\n String name,\r\n ReaderFuncti", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Changed the `AppenderBuilder` to `WriteBuilder`, but kept the `FileAppender` as it is.", "path": "core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Renamed the `ObjectModel` to `FileAccessFactory` as I find this as a best candidate for now.\r\n\r\nAlso updated a ton of javadoc, so it is easier to understand the responsibilities of the different components.", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "sounds good, this seems like a good choice for now. `FileAppender` and `FileWriter` renaming are disruptive and probably can be discussed separately", "path": "core/src/main/java/org/apache/iceberg/io/AppenderBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: is `fileSchema` a little more clear?\r\n\r\nSo the assumption is that the file writer would convert Iceberg schema to the underneath file schema like (Parquet `MessageType`)?\r\n", "path": "core/src/main/java/org/apache/iceberg/formats/ModelWriteBuilder.java", "line": 43, "type": "inline"}, {"author": "stevenzwu", "body": "we changed the method name to `dataSchema`. but `engine` is still referenced in many places (arg and Javadoc) in this class. maybe they can be renamed to `data` as well?", "path": "core/src/main/java/org/apache/iceberg/io/WriteBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "`D` is not defined. should `D` be added to the class generic type params?\r\n\r\nIf we use Flink as an example, `E` would be Flink `RowType` and `D` would be `RowData`?\r\n", "path": "core/src/main/java/org/apache/iceberg/io/WriteBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: I see both `aadPrefix` and `fileAADPrefiix` are used in existing code. maybe `fileAADPrefix` is a little more appealing to the eye and avoid the confusion with typo. It is also more consistent with the above method naming `fileEncryptionKey`", "path": "core/src/main/java/org/apache/iceberg/io/WriteBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "should we break the `caseSensitive` to a separate method and remove this one? that seems to be the approach for the existing code.", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "```\r\n * <p>If {@link #rowSchema(Schema)} is configured, the position delete records should include the\r\n * content of the deleted rows. These row values should match the engine schema specified via\r\n * {@link #dataSchema(Object)} and will be converted to the target Iceberg schema defined by\r\n * {@link #rowSchema(Schema)}.\r\n```\r\n\r\nMaybe move the Javadoc from the method below to this method. do we just need a boolean flag? otherwise, we need to validate that `rowSchema` matches `dataSchema", "path": "data/src/main/java/org/apache/iceberg/data/PositionDeleteWriteBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I know this is the practice of existing code. but these constants have been repeated in multiple classes (for each file format). should we define some constants in the `iceberg-data` module?", "path": "data/src/main/java/org/apache/iceberg/data/ContentFileWriteBuilderImpl.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this interface is highly redundant with the new `WriteBuilder` from core module. maybe extract some base interface, like the `WriterBuilderBase`. the base interface would need to live in core so that the `WriteBuilder` can use it.", "path": "data/src/main/java/org/apache/iceberg/data/ContentFileWriteBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "does it make sense to have those 3 additional interfaces? can they be merged into the `ContentFileWriteBuilder` interface if these 3 builder interfaces aren't needed outside the data module?", "path": "data/src/main/java/org/apache/iceberg/data/ContentFileWriteBuilderImpl.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Why should the writer need to know the content type? Shouldn't it just need a schema?", "path": "core/src/main/java/org/apache/iceberg/io/WriteBuilder.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Why are we adding in a deprecated method to a new interface?", "path": "core/src/main/java/org/apache/iceberg/io/WriteBuilder.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "\r\n```suggestion\r\n * Sets the encryption key used for writing the file. If the writer does not support encryption,\r\n```", "path": "core/src/main/java/org/apache/iceberg/io/WriteBuilder.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Do we need a special class here? We already have Pair I think which basically does this and isn't private?", "path": "data/src/main/java/org/apache/iceberg/data/FileAccessFactoryRegistry.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I think the current api is misleading. The case sensitivity is not working with column names. The columns are identified by ids, or name mapping, but never by names. So the reader only uses the case sensitivity for the filters. I wanted to highlight this with this change in the API.", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Discussed with @RussellSpitzer, and we can't deprecate deletes for now. We will support V2 tables for quite a while, so we need this on the API.", "path": "data/src/main/java/org/apache/iceberg/data/PositionDeleteWriteBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "This will be the only place for these constants after the refactor and deprecation.\r\nSo I would keep them as it is for now.", "path": "data/src/main/java/org/apache/iceberg/data/ContentFileWriteBuilderImpl.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I intentionally try to keep a very separate API for the content file writers from the actual writers. In my mental model they are just \"accidentally\" similar. One is for the file formats to implement, one is for the users to write content files.", "path": "data/src/main/java/org/apache/iceberg/data/ContentFileWriteBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "@rdblue asked to separate out them as they have different configuration possibilities, and I agree with him. One doesn't need to set the `equalityFieldIds` for the `PositionDeleteWriteBuilder`. ", "path": "data/src/main/java/org/apache/iceberg/data/ContentFileWriteBuilderImpl.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "There are 2 main things:\r\n- WriterContext - the writer needs the info if it is a data or a delete file we are writing. Based on this, different table properties are used, like `write.delete.parquet.row-group-size-bytes` or `write.parquet.row-group-size-bytes`\r\n- The logic of writing the positional deletes are dependent on the file format `PositionDeleteStructWriter` for Parquet, `PositionAndRowDatumWriter` for Avro, `GenericOrcWriters.positionDelete` for ORC. They need different handling/parame", "path": "core/src/main/java/org/apache/iceberg/io/WriteBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "There are several public APIs where the contract requires us to keep the current `overwrite(boolean)`. These APIs will be deprecated, but in 1.10 we still need to provide them. OTOH we are planning to use the new API internally in every case, so as a temporary measure we will provide `overwrite(boolean)`. When the old APIs are removed, we will not need the new method in the new API, so it could be removed and streamlined.", "path": "core/src/main/java/org/apache/iceberg/io/WriteBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "The `WriteBuilder` appender will change the input type based on the ContentType parameter passed on the constructor. If `D` is the type of the object model records, then for `DATA` and `EQUALITY_DELETES` the appender will expect `D`, but for `POSITION_DELETES` it will expect `PositionDelete<D>`.\r\nAlso this would change the method signature for the existing ORC/Parquet/Avro`<D> CloseableIterable<D> ReadBuilder.build()`, <D> FileAppender<D> WriteBuilder.build()` methods. This makes creating backwa", "path": "core/src/main/java/org/apache/iceberg/io/WriteBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Actually the `WriteBuilder` has this responsibility. The file format specific code converts the Iceberg schema to the file format specific schema. In many cases (not in Avro) the file format specific schema is passed to the engine specific code and use as a base for conversion.", "path": "core/src/main/java/org/apache/iceberg/formats/ModelWriteBuilder.java", "line": 43, "type": "inline"}, {"author": "pvary", "body": "Went through the doc, and in most of the places changed the engine-specific to `input/output`, but kept the engine in some places, because that helps understanding the reason behind this schema.", "path": "core/src/main/java/org/apache/iceberg/io/WriteBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "moved them to `fileAADPrefix` in both the `WriteBuilder` and the `ContentFileWriteBuilder`", "path": "core/src/main/java/org/apache/iceberg/io/WriteBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Also moved/rewritten the javadoc", "path": "data/src/main/java/org/apache/iceberg/data/PositionDeleteWriteBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Thanks for catching. Fixed", "path": "core/src/main/java/org/apache/iceberg/io/WriteBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Moved to using Pair.\r\nOriginally I have chickened out using the Pair because of the schema cache, but after double checking, I have realized that it is not used in our case.\r\nThanks for the suggestion!", "path": "data/src/main/java/org/apache/iceberg/data/FileAccessFactoryRegistry.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "the `FileAccessFactoryRegistry` already have different write builder methods for different content type (data, equality, position delete). it seems reasonable to have similar model for the `FileAccessFactory` too.\r\n\r\nAnyway, I need to spend more time to better understand the current class hierarchy to further review the overall work.", "path": "core/src/main/java/org/apache/iceberg/io/WriteBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "yeah. I agree. There are duplicates because `FileWriter` would encapsulate `FileAppender` object. Hence `FileWriterBuilder` need to forward those to the `FileAppenderBuilder`.", "path": "data/src/main/java/org/apache/iceberg/data/ContentFileWriteBuilder.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Something I've been running into is that the CloseableIterator here means that we can build a reader, but can't access the file metadata if it exists. I'm wondering if this should be shaped a little differently, May CloseableIterator<D> with an HasMetadata or something interface?", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Ok. This is the 3rd, or 4th time that this one was requested:\r\n- https://github.com/apache/iceberg/pull/12298#discussion_r2038212300 - by @danielcweeks\r\n- In our Slack discussions by @stevenzwu \r\n- IIRC even @liurenjie1024 suggested something like this.\r\n\r\nI have originally pushed back on this based on @rdblue's and @danielcweeks's previous discussion on one of the `InternalData` PRs: https://github.com/apache/iceberg/pull/12476#discussion_r1985473714\r\n\r\nAdding a separate interface will help in ", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think that we should add this and should instead track down and remove cases where we are using the file metadata to recover a partition spec. I think that only happens for unused and deprecated calls in public methods exposed in the core module.", "path": "api/src/main/java/org/apache/iceberg/io/FileReader.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What is being accessed from `iceberg-common`?", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This list shouldn't be in core, even in a comment. It will just become stale and inaccurate.", "path": "core/src/main/java/org/apache/iceberg/io/FileAccessFactory.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "As above, I don't think that specific object models should be listed here. Also, if they are listed anywhere they should be listed once rather than in repeated lists that can get out of sync with the code and one another.", "path": "core/src/main/java/org/apache/iceberg/io/FileAccessFactory.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It's a best practice to avoid the name `Factory` unless the class is actually used as a factory, meaning that it is created with configuration and then used to create objects using that configuration similar to how the `FileWriterFactory` is used. In addition, it isn't clear what \"access\" means here and I would expect the class to have something to do with _accessing_ files rather than creating builders for a specific object model.\r\n\r\nI see that there was [a thread](https://github.com/apache/ice", "path": "core/src/main/java/org/apache/iceberg/io/FileAccessFactory.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Set the projection schema.", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think implementations should need to consider this, rather than automatically skipping it in the interface.", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should not be ignored by default. Implementations should have to implement this.", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't agree that the current API is misleading, but I do understand the concern. The main problem is that this needs to change uses that currently call `caseSensitive(boolean)`. Also, if there are future cases that need to be case insensitive, I don't think it makes sense for them to each be configured individually given that the user/caller expectation is that all name matching is case sensitive or not.\r\n\r\nFor instance, other interfaces (including `Scan`) also expose `select(String... columns", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why can't implementations throw `UnsupportedOperationException`?", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This doesn't appear to be used and also should not be a on the interface.", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is the schema class used by the object model, right? I think that could be more obvious from both the name (S?) and description.", "path": "core/src/main/java/org/apache/iceberg/io/FileAccessFactory.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What do we typically use for this type? Is it mostly D or mostly T?", "path": "core/src/main/java/org/apache/iceberg/io/FileAccessFactory.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "> Another possibility could be that we define an intermediate Object Model (maybe something like Arrow), and provide a double transformation File Format -> Arrow -> Engine, and Engine -> Arrow -> File Format. If we don't materialize the intermediate model, then we lose performance only on the double transformation. The issue with this is that it is an even bigger overhaul of the reader/write API, and I expect that there will be a serious performance hit.\r\n\r\nThis is worth exploring. It may seem l", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "DynMethods in the `ObjectModelRegistry`", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "pvary", "body": "Currently ORC doesn't have this option, and automatically reuses the containers (at least based on some comments). Defaulting to not reusing could be a breaking change. If we accept this then we can change this behavior. Changed the comment, and we can deal with this later.", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Reverted.\r\nCC: @RussellSpitzer, @danielcweeks ", "path": "api/src/main/java/org/apache/iceberg/io/FileReader.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Removed the list", "path": "core/src/main/java/org/apache/iceberg/io/FileAccessFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "changed it to \"for example\", so we don't need to keep the models in sync", "path": "core/src/main/java/org/apache/iceberg/io/FileAccessFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "This is a sizable change. I will first revisit https://github.com/apache/iceberg/pull/12774#discussion_r2162731545, and will revert if we still need this", "path": "core/src/main/java/org/apache/iceberg/io/FileAccessFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I followed the pattern that if something is not mandatory on the API then I provided a default implementation for it.\r\nThis is an edge case, as most reader will have some kind of config, so removed the default implementation.", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I followed the pattern that if something is not mandatory on the API then I provided a default implementation for it.", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Left it here, to show how vectorized reading will be implemented. This could be a common configuration which could be used instead of the original `recordsPerBatch`. The user could use`set(String, String)` to set this. ", "path": "core/src/main/java/org/apache/iceberg/io/ReadBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "For Spark:\r\n- E is StructType\r\n- D is InternalRow/ColumnarBatch\r\n\r\nFor Flink:\r\n- E is RowType\r\n- D is RowData\r\n\r\nE supposed to be engine specific object model type\r\nD supposed to be data type", "path": "core/src/main/java/org/apache/iceberg/io/FileAccessFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "> > Another possibility could be that we define an intermediate Object Model (maybe something like Arrow), and provide a double transformation File Format -> Arrow -> Engine, and Engine -> Arrow -> File Format.[..]\r\n> \r\n> This is worth exploring. It may seem like it would be slower, but vectorized reads are much faster so I think it would be better overall as long as we could adapt to the right object model. It may be slightly slower for Avro (that can't vectorize reads and can produce specific ", "path": "core/src/main/java/org/apache/iceberg/io/ObjectModel.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Reverted anyway", "path": "core/src/main/java/org/apache/iceberg/io/FileAccessFactory.java", "line": null, "type": "inline"}], "12808": [{"author": "nastra", "body": "there isn't a bigquery-bundle, so I believe this can be removed", "path": ".github/labeler.yml", "line": null, "type": "inline"}, {"author": "nastra", "body": "this should be referencing the version defined in `libs.versions.toml` via `libs.versions.hadoop3.common`", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "nastra", "body": "this dependency should already be included via https://github.com/apache/iceberg/blob/main/build.gradle#L205, so can be removed here", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "This method is unused. ", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetaStoreClient.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "\"retry on UnknownHostException\" is obvious when reading the code. It would be nice to explain **_why_** we retry on such exceptions. ", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetaStoreClientImpl.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "What is `b/399885863`? ", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetaStoreClientImpl.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "We could replace these code with lambda `() -> create(dataset)`", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetaStoreClientImpl.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "nit: BigQueryMetaStoreClient -> BigQueryMetastoreClient? ", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetaStoreClient.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "Return value of createDataset, setDatasetParameters, removeDatasetParameters are never used. ", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetaStoreClient.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "This repository isn't only for Google engineers :)", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetaStoreClientImpl.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "What happens if the dataset is changed between L223-237 concurrently? Same for removeDatasetParameters. ", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetaStoreClientImpl.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "This method can be static. Same for convertExceptionIfUnsuccessful. ", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetaStoreClientImpl.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "Does `locationUri` never end with `/`? ", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "`commitStatus == BaseMetastoreOperations.CommitStatus.FAILURE` is always true in the current logic. ", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryTableOperations.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "How about renaming to FakeBigQueryMetaStoreClient? ", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/FakeBigQueryMetaStoreClient.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "Unused methods. ", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/FakeBigQueryMetaStoreClient.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "```suggestion\r\n assertThat(metadataLocation).isPresent();\r\n```", "path": "bigquery/src/test/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreTestUtils.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "Can we include the table name in the error message? ", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetaStoreClientImpl.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "we typically use `Preconditions.checkArgument` for argument validations like these", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "let's omit the `PROPERTIES_KEY` prefix as I don't think that helps in readability. Also the project doesn't use something like this in any other place around the codebase", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "why is this and the gcp-location needed in this class for BigQuery? I think these changes here should be reverted as this class holds stuff for the FileIO implementation and not for a particular catalog impl", "path": "gcp/src/main/java/org/apache/iceberg/gcp/GCPProperties.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "please use // for inline comments like these", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n private String catalogName;\r\n```", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "IMO this should be rather `Preconditions.checkArgument(.. != null, errorMsg)` as it is clearer for a reader of the error msg to know why something isn't a valid argument. ", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreClientImpl.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "TBH I find those TODOs with that custom tag a bit confusing for readers of the code as it's like adding references to internal JIRA tickets. I understand that this is meant for Google engineers but I don't think stuff like this should be added to Iceberg becuse only a subset of people know what it actually refers to.", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreUtils.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "does it make sense to add `testing-enabled` and `filter-unsupported-tables` to the Iceberg codebase? I would suggest that we find a way on how we can prevent introducing those properties", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "IMO it's better to use `Preconditions.checkArgument(null != ..., \"Invalid BigQuery client: null\")` because it indicates that a wrong argument was passed in a clearer way than a NPE", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this should be defined in `libs.versions.toml` as `hive4-metastore` and a separate `hive4` that points to that version", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "nastra", "body": "what does the comment mean here?", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "nastra", "body": "TBH I don't really understand why all of these custom tests are needed and also why certain tests are Disabled as they all pass (maybe the disabling was from the old PR?). I was able to trim down the test class to the diff below and everything was passing.\r\n```\r\npublic class TestBigQueryCatalog extends CatalogTests<BigQueryMetastoreCatalog> {\r\n\r\n @TempDir private File tempFolder;\r\n private BigQueryMetastoreCatalog catalog;\r\n\r\n @BeforeEach\r\n public void before() {\r\n catalog = initCatalog(\"", "path": "bigquery/src/test/java/org/apache/iceberg/gcp/bigquery/TestBigQueryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I don't think any of these tests that use optional fields are needed here. The goal here is to verify all of the functionality at the catalog level", "path": "bigquery/src/test/java/org/apache/iceberg/gcp/bigquery/TestBigQueryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "method names should follow Java conventions and not contain _ in their names", "path": "bigquery/src/test/java/org/apache/iceberg/gcp/bigquery/BigQueryTableOperationsTest.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "maybe just `Dataset create(Dataset dataset)` as it's obvious from the parameter that this creates a Dataset?", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "same as above, maybe just name it `delete(DatasetReference datasetReference)`?", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "`load(DatasetReference datasetReference)`?", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n Table loadTable(TableReference tableReference);\r\n```\r\nthe codebase typically tries to not use `get` prefixes, so `load` might be better here", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "what does patching a table mean? Should this maybe be called `updateTable`?", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "TBH I don't think this flag is actually needed. Since you already have defined another `initialize` method that takes `BigQueryMetastoreClient` as a parameter, you can pass the fake client there for testing", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "the hive catalog uses `list-all-tables`, would it make sense to name it the same way here?", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I would probably just remove this comment", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I don't think I understand this comment. Isn't it really because BigQuery only support single-level namespaces? At least that's what is being indicated in `validateNamespace`. So maybe just remove this comment to avoid confusion for readers?", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "nit: please add a newline after a closing }. See also https://iceberg.apache.org/contribute/#block-spacing", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": 155, "type": "inline"}, {"author": "nastra", "body": "this should probably also be checked in `dropNamespace`/ `createNamespace`.", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this needs to check whether the namespace exists or not and throw a `NoSuchNamespaceException` if the namespace doesn't exist. I also realized we don't have a test for this in `CatalogTests`, so I opened https://github.com/apache/iceberg/pull/12873", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": 224, "type": "inline"}, {"author": "nastra", "body": "ah nvm, I missed that the validation is already part of `toDatasetReference`", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "it seems a bit weird to handle IAE like this and then throw a NSNE", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": 278, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n privateNamespace toNamespace(Datasets dataset) {\r\n```", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n privateMap<String, String> toMetadata(Dataset dataset) {\r\n```", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "is this the same as `gcp-location`?", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": 332, "type": "inline"}, {"author": "nastra", "body": "this TODO still refers to an internal ticketing system", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "both comments seem rather unnecessary to have", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "why not check if the namespaces actually exist via `namespaceExists(..)`?", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "why not check this via `tableExists()`?", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I would probably move this check up to when the other namespace checks are done", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\npublic class BigQueryCatalog extends BaseMetastoreCatalog\r\n```", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this can be private as it's not used anywhere else", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryTableOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "maybe just pass the `TableReference` directly to the constructor instead of having to pass three different parameters", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryTableOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n private boolean hiveEngineEnabled(TableMetadata metadata) {\r\n```", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryTableOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "you might want to double-check whether cleanup is done correctly here. My local `bigquery` directory had leftover folders after running these tests", "path": "bigquery/src/test/java/org/apache/iceberg/gcp/bigquery/TestBigQueryCatalog.java", "line": 53, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Should these be in GCPProperties? ", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Can we follow a similar pattern to what the `GlueCatalog` implementation does? We can have an initialize whihc accepts the client, and for testing purposes we pass in a mocked client. Instead of having to handle some \"testing-property\" and branch off of that", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Seems like we do that below, do we need the testingEnabled flag then? Doesn't seem like we should need it", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "That is not true since `assertThat(Optional.empty())` always passes. ", "path": "bigquery/src/test/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreTestUtils.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "These constants can be private. ", "path": "bigquery/src/test/java/org/apache/iceberg/gcp/bigquery/TestBigQueryCatalog.java", "line": null, "type": "inline"}, {"author": "ebyhr", "body": "> drop with purge\r\n\r\nI would remove this comment. It's obvious from the code. ", "path": "bigquery/src/test/java/org/apache/iceberg/gcp/bigquery/TestBigQueryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I would probably also update the naming here and for the remove and omit the `Dataset` from the method name", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I don't think we should keep this flag as we don't do something similar in other catalog implementations in the Iceberg codebase and that would set a precedent that using something like this is ok whilein fact there are better ways to tackle this (having a separate `initialize` method that takes a custom client impl needed for testing)", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "see my comment in https://github.com/apache/iceberg/pull/12808/files#r2059731731 on this", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "looks like indentation is off here", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n ImmutableMap.of(\r\n```", "path": "bigquery/src/test/java/org/apache/iceberg/gcp/bigquery/BigQueryTableOperationsTest.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "is this one needed given that it's only used at a single place?", "path": "bigquery/src/test/java/org/apache/iceberg/gcp/bigquery/TestBigQueryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this can probably also be inlined, since it's only used at a single place", "path": "bigquery/src/test/java/org/apache/iceberg/gcp/bigquery/TestBigQueryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "is this still true? when I removed the `Disabled` flag the test would pass", "path": "bigquery/src/test/java/org/apache/iceberg/gcp/bigquery/TestBigQueryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "given that the tests here that live in the Iceberg codebase are always running against the Fake client, I would suggest to remove all of those `Disabled` tests (and basically trim the test down to what I suggested in https://github.com/apache/iceberg/pull/12808/files#r2055539916), since they all pass against the Fake client. If you want to run the catalog tests against an actual client, you can then basically just create a new `TestRealBigQueryCatalog` (I'm sure there's a better name) that initi", "path": "bigquery/src/test/java/org/apache/iceberg/gcp/bigquery/TestBigQueryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "@amogh-jahagirdar I don't think they should be in `GCPProperties` because these properties here are related to the actual catalog and not the FileIO implementation", "path": "bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryMetastoreCatalog.java", "line": null, "type": "inline"}], "4537": [{"author": "findepi", "body": "This needs to be addressed before the PR is merged.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Because Iceberg is a library, we can't use annotations to control serialization. The JVM makes no guarantee that annotations are available at runtime and users are in control of the final classpath. So this brings in a new set of correctness problems. That's why we always serialize with a `SomethingParser` class.", "path": "core/src/main/java/org/apache/iceberg/stats/BlobMetadata.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This shouldn't be needed.", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Iceberg avoids the use of `get` in method names. In cases like this one, it's an unnecessary additional word. And in most other cases, there's a more specific verb that should be used instead, like `find` or `fetch` that give the reader more context about what's happening.", "path": "core/src/main/java/org/apache/iceberg/stats/BlobMetadata.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why optional rather than returning `null`?", "path": "core/src/main/java/org/apache/iceberg/stats/BlobMetadata.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why not just include the values from the spec here?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "And that should be documented in the spec, right?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We don't generally work with `byte[]`, especially with no offset or length, because it requires new allocations all the time. Can these work with `ByteBuffer` instead?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "Right, i took a shortcut here and forgot to mark it as a TODO item. Sorry for that.", "path": "core/src/main/java/org/apache/iceberg/stats/BlobMetadata.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "good catch, see https://github.com/apache/iceberg-docs/pull/69#issuecomment-1096264428", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "LZ4 frame support draft in aircompressor: https://github.com/airlift/aircompressor/pull/142", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "No worries! This is a problem most people aren't aware of so it's pretty common.", "path": "core/src/main/java/org/apache/iceberg/stats/BlobMetadata.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "i would ask the question the other way around, but fine, changed to nullable.", "path": "core/src/main/java/org/apache/iceberg/stats/BlobMetadata.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "This will require https://github.com/airlift/aircompressor/pull/142#discussion_r848749110.\r\nTo unblock this, I will drop LZ4 for now from this PR.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "`jackson-datatype-jdk8` no longer used here", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "we are calling `writeEndObject` twice, is it intentional ? ", "path": "core/src/main/java/org/apache/iceberg/stats/FileMetadataParser.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "can add a break here instead and let the exp in L#112 take care of throwing UnsupportedException, something done above.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "can call `checkNotFinished()`", "path": "core/src/main/java/org/apache/iceberg/stats/StatsWriter.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "first ends `properties` field, the second end the whole `FileMetadata`.", "path": "core/src/main/java/org/apache/iceberg/stats/FileMetadataParser.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "thanks for the tip", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Nit: Usually we use the relocated `Preconditions` form.\r\n\r\nAlso, does it make more sense to just maintain a mapping between the two vs iterating over them?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsCompressionCodec.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Question: Does this particular magic bytes correlate with anything in particular or did we just decide on it?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "from perf perspective, for enum with two elements, i don't know which is actually faster\r\nfrom readability perspective, it's probably subjective which is more readable, so which one you'd prefer?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsCompressionCodec.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "This is from the format spec https://github.com/apache/iceberg-docs/pull/69:\r\n\r\n> `Magic` is four bytes 0x50, 0x46, 0x49, 0x53 (short for: Plain Format for\r\n Indices and Statistics),\r\n", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This is the set of fieldIds in the Iceberg table this blob applies to? (Probably should doc that) If these are field ID's is the association between this object and the table made at another level? (I'll probably find out if I keep reading)", "path": "core/src/main/java/org/apache/iceberg/stats/BlobMetadata.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "yes and yes", "path": "core/src/main/java/org/apache/iceberg/stats/BlobMetadata.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "empty JavaDoc", "path": "core/src/main/java/org/apache/iceberg/stats/StatsCompressionCodec.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "we tend to do these like\r\n\r\nhttps://github.com/apache/iceberg/blob/c75ac359c1de6bf9fd4894b40009c5c42d2fee9d/api/src/main/java/org/apache/iceberg/DistributionMode.java#L51-L54\r\n\r\nWhere we leverage `valueOf` to throw IllegalArgs for us and check for valid enum names\r\n\r\n", "path": "core/src/main/java/org/apache/iceberg/stats/StatsCompressionCodec.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Could we make a static final variable for this like FOOTER_COMPRESSION_CODEC, or something like that just to make it clear in the code?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Isn't this required for the footers at the moment?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "swap is a bit unclear here since we actually are just reversing the byte order here. Maybe \"invert\" or \"reverse\"?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Should this be \r\n`SUPPORTED_FLAGS = FLAG_COMPRESSED`\r\nThen when we add more flags it's just\r\n`SUPPORTED_FLAGS = FLAG_COMPRESSED | FLAG_OTHER`", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Is this possible? Just wondering if ZstdDecompressor.getDecompressedSize() can return an invalid length. If so should we be throwing a \"Corrupted file error or something like that?\"", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This always allocates bytes for the Magic byte[] array, Is this because we want to re-use the array we are allocating here? Just wondering if this could be a static final byte array since it get's used in some contexts (like counting length) which don't require the allocation.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Checking for the magic at the beginning and end of the footer?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Maybe make 4 here a final constant FOOTER_MAGIC_OFFSET", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "8 => FOOTER_FILE_FORMAT_OFFSET", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "FOOTER_FLAG_OFFSET", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "FOOTER_RESERVED_OFFSET", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Should add the full java doc for this as well not just exception stubs, also nit: we don't do vertical alignment usually", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "What's the math here? If the file length is less than 20 I thought we can't have a valid footer? (Magic + Reserved + Flag + Format + Magic) * 4 = 20?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Like before I think it would be helpful if a lot of these magic numbers had constants to label them.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "CC: @ggershinsky Could you take a look at this as well I'm not sure how we fit this into our current encryption plans since we are compressing blob blocks within this file seperately we would end up with issues if we encrypted the entire file. Is there an easy way we can compress/encrypt the blobs inside the index files?", "path": null, "line": null, "type": "review_body"}, {"author": "RussellSpitzer", "body": "What is the benefit of the multiple blobs in a file if we don't have footer information saying why we may or may not need to check a specific blob? I guess currently we can prune them based on \"columns\" but I feel like maybe we would benefit from other information? Like if a particular file had sketches that applied to different row groups in a file so the correct sketch to use would be based on min and max column values or something ? Just thinking about possible future directions.", "path": "core/src/main/java/org/apache/iceberg/stats/BlobMetadata.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "That's an empty block comment. That's a what we use in Trino to make formatter happy and to allow addition of new enum options without changing lines of existing options (clean git annotate output).", "path": "core/src/main/java/org/apache/iceberg/stats/StatsCompressionCodec.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Is there an additional `NOTICE` / license file that needs to be added to the releases with the new dependency?", "path": "build.gradle", "line": 223, "type": "inline"}, {"author": "kbendick", "body": "For the `compressionCodec`, instead of using null for uncompressed, would it make sense to use an enum and then define the possible compression types and have a value for none / uncompressed?\r\n\r\nThis would be similar to `TableMetadataParser.Codec`, which has a `NONE` value, though there might be reasons not to do that.", "path": "core/src/main/java/org/apache/iceberg/stats/BlobMetadata.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Is the `ZstdDecompressor` something that we can allocate once and share somehow?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "You might be interested in `org.apache.iceberg.util.ByteBuffers`, which has a `toByteArray` method that looks very similar to this.\r\n\r\nThere are other methods in that class you might consider using too, or placing any additional new `ByteBuffer` utility functions.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "For 2 elements actually it\u2019s probably overkill in my opinion. Good point.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsCompressionCodec.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "https://github.com/apache/iceberg-docs/pull/69 mandates the name to be lowercase.\r\nalso, since it's not internal, i wanted to have conversion to and from file-level name explicit.\r\n\r\n", "path": "core/src/main/java/org/apache/iceberg/stats/StatsCompressionCodec.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "no, it's optional", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "renaming to `swapBytes`", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "it cannot return an invalid length", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "Obviously there is a bunch of numbers that are hard to follow, and 4 is one of them.\r\nI figured that defining constants for each of them would only blur the code. \r\nThis is a low-level code and i don't think there are any abstractions here.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "As said above, defining the constants doesn't make the code clear, at least not in my eyes.\r\nThe code is organized after and follows the raw footer layout. That's clearly indicated by the numbers subsequently used in the code (4, 8, 12, 16). Constants wouldn't convey that meaning, and wouldn't make evolving the code easier either.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "> we don't do vertical alignment usually\r\n\r\nThat's what the formatter (imported from the definition linked in the project) produced for me, not my own art.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "good point", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "The footer contains a type of every blob. In particular, reading application will read only the blobs of types it can utilize for current operation. Same/similar information will be added to table metadata, so that reading application doesn't need to read stats file footer if no blobs are useful to it.", "path": "core/src/main/java/org/apache/iceberg/stats/BlobMetadata.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "I don't know when a new notice should be added, deferring to committers' judgement", "path": "build.gradle", "line": 223, "type": "inline"}, {"author": "findepi", "body": "None is not a compression. The file format spec defines there should be no mention of compression coded, when blob data is not compressed. Internally, in Java code we could still have an enum with additional value, but it would need to require special casing when reading, writing and compressing. Originally i had `Optional` here, but got a comment to use a nullable value instead.", "path": "core/src/main/java/org/apache/iceberg/stats/BlobMetadata.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "it's mutable", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "ByteBuffers.toByteArray is awesome, thanks!", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "findepi", "body": "`byte[]` is mutable, so i would need to `.clone()`/ copy when returning from here.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is already included because it is pulled in through ORC. Thanks for asking about it, @kbendick. We always need to be careful about this!", "path": "build.gradle", "line": 223, "type": "inline"}, {"author": "rdblue", "body": "Nit: we would normally include an empty newline between the control flow block and the following statement.", "path": "core/src/main/java/org/apache/iceberg/util/JsonUtil.java", "line": 180, "type": "inline"}, {"author": "rdblue", "body": "I think columns should probably be a list rather than a set because order matters when computing inputs to blobs. For example, if you have a blob for NDV across two columns you'd need to know how to construct the value tuple that is used as input to the theta sketch.", "path": "core/src/main/java/org/apache/iceberg/stats/BlobMetadata.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: We generally prefer to keep `Preconditions` first in the method, and not embed them in other statements.", "path": "core/src/main/java/org/apache/iceberg/stats/FileMetadata.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "As I noted on the spec, I think this should be `compression-codec` to align with the convention used in other JSON parsers.", "path": "core/src/main/java/org/apache/iceberg/stats/FileMetadataParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is this used elsewhere or can it be package-private?", "path": "core/src/main/java/org/apache/iceberg/stats/FileMetadataParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: Iceberg doesn't use `get` in method names. It usually signals that there's either a more specific verb, like `load` or `find` or `fetch`, or is just needless filler. In this case, I'd omit it.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsCompressionCodec.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I agree, we can use the `valueOf` with `toUpperCase`.\r\n\r\nEven if the spec uses lower case, this doesn't need to be that strict when interpreting input. Be liberal with what you accept and strict with what you produce.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsCompressionCodec.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "While it is mutable, it should be fine to rely on people not modifying it. This isn't public.\r\n\r\nI'd just make this a constant, `MAGIC`.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why use `0xFF & value` above, but `Byte.toUnsignedInt` here?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: `get` in a method name.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "If this isn't supported, then why list it in the spec? I think having it in the spec implies that it is required.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can this be package-private?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsCompressionCodec.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: Iceberg adds empty lines between control flow blocks and the following statements.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is there a maximum size that we could copy out in this case?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I guess it doesn't matter since we use the byte array path anyway.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Does this modify `input`? If it does, then we should pass in `input.duplicate()` instead so that the original buffer is unchanged.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsFormat.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What is expected for `footerSize`? Should that match `FooterPayloadSize` or be larger by some constant to account for magic and other non-payload bytes?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It seems strange to me to check for null with an `Optional`. We typically don't use `Optional` and just pass `empty` as `null`. Since you're checking anyway, that seems like a reasonable update.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "In Iceberg, we typically use `null` to signal that a lazy variable hasn't been initialized. That works cleanly with `transient`.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think converting the magic number to an int makes this code harder to reason about. Magic is not checked in a tight loop, so I don't think that the optimization of comparing the bytes all at once is worth it. I'd prefer to just check each byte in a loop and avoid needing to validate that this is correct:\r\n\r\n```java\r\nstatic final int MAGIC_AS_NUMBER_LE = new BigInteger(swapBytes(getMagic())).intValueExact();\r\n```", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What do you think about using the same optimization as in Parquet and reading the last 16kb of the file, then reading more if the footer size is larger than that? I think we may as well start with that.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "The new `RangeReadable` interface has a `readTail` method that you can use to avoid needing `fileLength`. S3 allows you to read the last N bytes of the file without knowing where to seek to.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This breaks forward compatibility if we ever decide to use the reserved bytes, which defeats the purpose. If we have to increment the format version to use them, we should just remove them.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why not translate to/from `StatsCompressionCodec` in the parser? That would make more sense to me.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "You can use Iceberg's `Pair` for this, rather than `Map.Entry`.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "In Iceberg, we typically omit `IOException` and wrap with `UncheckedIOException` because that has to happen in so many places anyway. I think you can remove the exception.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why return a `Stream`? Is there a case where you wouldn't read all of the blobs into memory at once?\r\n\r\nWe also typically use `Iterable` rather than `Stream`.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "You might consider lazily opening the stream.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsReader.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Missing newline between methods.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsWriter.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Shouldn't compression be configured for the writer and then applied to every blob uniformly?", "path": "core/src/main/java/org/apache/iceberg/stats/StatsWriter.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I agree with Kyle. I think this would be cleaner if you added `NONE` and always passed the codec as the enum.", "path": "core/src/main/java/org/apache/iceberg/stats/BlobMetadata.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "When modifying state by setting an instance field, we like to use `this.` so it is obvious that it isn't a local variable.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsWriter.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It doesn't look like you need a separate declaration anymore.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsWriter.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Looks like channels aren't guaranteed to write all of the bytes, so this should check the return value from `write`.", "path": "core/src/main/java/org/apache/iceberg/stats/StatsWriter.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this should use the builder pattern that we use for other file formats. That way all of the metadata and config are set ahead of time and the writer just needs to append each payload. You can see `Parquet` for examples, but it would look like this:\r\n\r\n```java\r\n try (FileAppender<Blob> writer = StatsFiles.write(outputFile)\r\n .set(\"key\", \"value\")\r\n .compressFooter()\r\n .build()) {\r\n writer.addAll(blobs);\r\n }\r\n```", "path": "core/src/main/java/org/apache/iceberg/stats/StatsWriter.java", "line": null, "type": "inline"}], "14117": [{"author": "RussellSpitzer", "body": "I was wondering if a UDTF in this context is just a UDF that returns List<Struct<>>\r\n?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "because here we say `return either:`, maybe the bullets should be some thing like\r\n\r\n* a single scalar value for scalar functions (UDFs)\r\n* a table with zero or multiple rows for table functions (UDTFs)\r\n\r\nI don't know if zero row is allowed or not for table functions during the discussion", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: change `Most` to `Many`", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `function-uuid` to `uuid`. function is the assumed context", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "do we need a small `Terms` section to explain the term `secure function`?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "struct or list of structs?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I am wondering if we can just add a `current-version` in the overload definition directly, e.g. above line 149?\r\n\r\nI will be easier to correlate that way. we can also avoid repeating the overload-uuids here.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `definitions`?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "can add another simple example for UDTF and illustrate the return type struct?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "wondering if it is helpful to add a `Field type` to those metadata tables?\r\n\r\nI know most tables in view spec don't have the type column. [one table](https://iceberg.apache.org/view-spec/#sql-representation) does include the type column", "path": "format/udf-spec.md", "line": 52, "type": "inline"}, {"author": "flyrain", "body": "Agreed that it seems redundant. It is self-contained at the same time. I was trying to be consistent with table/view specs, which contains `table-uuid` and `view-uuid`. ", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Good point, will add it", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I see. agree that we can keep it consistent with `table-uuid` and `view-uuid`", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Per discussion, we will use a dedicated attribute to indicate whether it's UDTF.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Not sure I understood. The `Representation` node used to have a type(must be `sql`) at this moment. Per offline discussion, we think it's redundant, we could add it back when we introduce other type(e.g., python) UDF. ", "path": "format/udf-spec.md", "line": 52, "type": "inline"}, {"author": "flyrain", "body": "Added in the new commit ", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "made the change", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Added one example for udtf", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "I think struct is fine. For UDF, I think it's OK to support all Iceberg type. ", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I mean json field type. E.g., the `secure` field is a boolean type, the `location` field is a string type, the `format-version` is a number.", "path": "format/udf-spec.md", "line": 52, "type": "inline"}, {"author": "stevenzwu", "body": "`definition-log`? similar to the `snapshot-log` field in table metadata.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "might be more readable if this is an object (instead of an array). e.g.\n\n{\n \"overload-uuid\": \"d2c7dfe0-54a3-4d5f-a34d-2e8cfbc34111\",\n \"overload-version-id\", 2\n}", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "we already have a `function-uuid`. wondering if this can be a simple monotonically increasing number maybe called `overload-id`", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "`body` is a bit too general here. maybe this array can be called `overloads`", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Yeah, one of my versions actually have it, but most of view and table spec don't have it. I think this is mainly because the json data type is pretty straightforward, basically they are string, number, boolean, object and array. https://datatracker.ietf.org/doc/html/rfc8259#section-7. ", "path": "format/udf-spec.md", "line": 52, "type": "inline"}, {"author": "flyrain", "body": "I'd try to position the `definition-versions` as a monolithically increased snapshot of all versioned things in UDF spec. In that case, it has the lineage and it's different from things like `metadata-log` or `snapshot-log` in table spec. Does that make sense? I'm open to name change as well.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "I was debating with myself on this, I think the overload uuid would make more sense than a monotonically increasing number. We won't be confused by the number, when overload creation and deletion could happen, and the same number may point to different overloads.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "or `definitions`? I'm open to a better name.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Nice suggestion! Let me make the change.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "changed it to `overload-versions` in the new commit. Any name suggestion is welcome.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "sounds good. JSON example is also a reference", "path": "format/udf-spec.md", "line": 52, "type": "inline"}, {"author": "stevenzwu", "body": "`overload-versions` looks good to me", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "We might want to define the result zero or more rows with a uniform schema.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "Nit: I think we need to be careful about the wording here because we don't intend to actually replace the file, but rather, update a reference atomically.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "I feel like we are missing the type information for each of the defined fields. Are they a primitive (like a string/Boolean) or a structured/nested type defined below.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "We refer to these as \"Scalar UDFs\" after this, should we just call it there here as well?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Thanks a lot for the review, @danielcweeks ! @stevenzwu asked a similar question here, I answered here, https://github.com/apache/iceberg/pull/14117#discussion_r2373667060", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Debated this with myself a bit. Should we still call it Scalar UDFs if it can return `struct`, `list`, etc.? My thoughts is to not mentioned `Scalar` to avoid confusion, I can remove them in the other places.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "That's a good point, let me reword it.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Emphasis on a \"uniform\" schema, right? Good point, let me change it.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Thinking a bit more, it should be OK to use \"scalar\" to refer one value as a counterpart of multiple rows. Made the change, thanks for the suggestion.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Per community sync, changed it to `definition-log`, which doesn't keep lineage of the global version change. `timestamp` is still used to indicate when the change happens.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Changed it to monotonically increasing number id", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Added it to the spec. Please take a look!", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Added type for each field, please take a look.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I thought we agreed to change it to `Scalar functions (UDFs)` in Dan's comment. Scalar function is a pretty common term (used in Spark, Flink, Trino.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "> can be moved across catalogs\r\n\r\nwhat does this mean?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit on description\r\n```\r\nA string to string map of UDF properties\r\n```", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "struct is also an Iceberg field type. can it also be for scalar function (struct -> struct)?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "should `name` be immutable? typically function signature (like Java) doesn't include parameter name", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "Since Iceberg data type uses list, should we use `list` instead of `array` here?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "there is no enum type in Iceberg. I guess we should use `string` here. we can clarify the valid values as `udf` and `udtf` in the type column or the description column.\r\n\r\ne.g. the table spec has the following definition for data file `content` type\r\n```\r\nint with meaning: 0: DATA, 1: POSITION DELETES, 2: EQUALITY DELETES\r\n```", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "do we need to call out these two are mutually exclusive in the notes here or the description column?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "> Highest `overload-id` currently assigned in this metadata file\r\n\r\nnit: Is this more clear?\r\n```\r\nHighest `overload-id` currently assigned for this UDF\r\n```", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "For example, user can copy a UDF metadata json file from one catalog to another. It should still work in the new catalog after a `registration`", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "The name itself doesn't have to be immutable for callers, as the order of parameter matters more. Names are mainly used by the versioned representations, they should be consistent across multiple versions. Otherwise, the rollback would be problematic. For example, we need to keep the name the same when add/rollback versions.\r\n```\r\n { \"name\": \"x\", \"type\": \"int\", \"doc\": \"Input integer\" }\r\n ...\r\n \"overload-version-id\": 1,\r\n \"deterministic\": true,\r\n ", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "not sure if I understand how the parameter renaming cause problem for rollback.\r\n\r\n> To change them, a new overload must be created. \r\n\r\nIs it ok to add an overload with only parameter name change, while params type and order are the same? How would client/engine resolve to the correct overload?\r\n", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "> not sure if I understand how the parameter renaming cause problem for rollback.\r\n\r\nTaking the example I put above. If we rename it to `y` at some point, then rollback to v1 or v2 will cause inconsistency between representation `x+1` and parameter name `y`.\r\n\r\n> Is it ok to add an overload with only parameter name change, while params type and order are the same? How would client/engine resolve to the correct overload?\r\n\r\nIt shouldn't be allowed, as the signatures are the same. ", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "I was trying to use the json data type name here, https://www.w3schools.com/js/js_json_datatypes.asp. I'm OK if we agreed to use Iceberg data type names here, maybe that's more consistent with other type we put here, like `long`", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "yes, removed \"`struct` for UDTF\" to reduce confusion", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "> The metadata is self-contained and can be moved across catalogs.\r\n\r\nGot it. I am not sure we need to call this out at high-level goals. E.g., we didn't seem to call this out for the view spec.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I see. using JSON types make sense to me, since the metadata is stored as a JSON file in the end. We just need to be consistent. E.g., we can change the `long` type to `number`. ", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "got it. basically, parameter renaming is not allowed. hence, we require the `name` and `type` are immutable.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Because we are intending to open this up to other languages like Python in the future, maybe we want this to be titled \"Iceberg UDF Spec\"?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "```suggestion\r\n- **Scalar functions (UDFs)** \u2013 returns a scalar value, which may be a primitive type (e.g., `int`, `string`) or a non-primitive type (e.g., `struct`, `list`).\r\n```", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "```suggestion\r\n- **Table functions (UDTFs)** \u2013 returns a table with zero or more rows of columns with a uniform schema.\r\n```", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "> semantics for representing\r\n\r\nIs this talking about semantics or representation? I think it is probably more representation than semantics.", "path": "format/udf-spec.md", "line": 37, "type": "inline"}, {"author": "rdblue", "body": "These names are a product of the evolution of this spec and I think we are now in a good place to consider choosing more consistent names. It is strange that \"definitions\" is actually a list of \"overloads\". I think we should choose a single term for a function signature and use that here. To me, \"definition\" inside \"definitions\" makes the most sense.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "> Invocation order **must** match this list.\r\n\r\nDoes this need to be a requirement for all engines? What if the engine has a way to match parameters by name?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "If this is an Iceberg data type, then it can't be a `string` because complex Iceberg types are represented as objects:\r\n\r\n```json\r\n{\r\n \"type\": \"list\",\r\n \"element-id\": 3,\r\n \"element-required\": true,\r\n \"element\": \"string\"\r\n}\r\n```\r\n\r\nIceberg does not parse these types other than to/from the JSON representation and that representation requires field IDs. If we want to omit the field IDs then that's fine but I think we need to be more specific. We don't want to need to parse complex types as stri", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why use a `long`? Do we expect more than 2 billion versions of an overload?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Do we need an ID for the overload? Isn't it identified by the function parameters?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Where do we specify the requirements for the returned values? I think that the function must produce values of that type, right?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This has the same problem as above. We need some representation of an Iceberg type that doesn't require string parsing.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Do we want to mention that the parameters do not have default values?", "path": "format/udf-spec.md", "line": 90, "type": "inline"}, {"author": "rdblue", "body": "Is there a way to specify that null values are not allowed? In Iceberg nested types, each field/element/value is required or optional.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I agree. Since these names are embedded in the SQL text of a function, they can't be changed without breaking older versions.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What if the return type generated by one engine is different? For instance, what if I have an `index_of` function that looks for a value in a list. In one engine, lists are limited to < Integer.MAX_VALUE and in another they are not so the engines produce `int` and `long` return values. I think it would be okay to promote a definition that produces `int` to `long`. I'm also not sure how you would do it without replacing the value. Otherwise you'd have multiple definitions with the same parameters", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think that this section needs to match the View spec. The representation should have only a `type` field. The only currently-defined type is `sql`. Then the dialect and body are fields defined by the SQL representation. Also, body should use the field name `\"sql\"` to match. I don't think that we want to have different rules for essentially the same structure.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think that `overload-id` introduces a slight problem. It allows defining functions with conflicting/identical input parameters because the ID identifies the overload rather than the parameters themselves. Is this the only place where `overload-id` is used? ", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "How about `version-id` since we don't need more than one version concept in this spec?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is this a hint? I think if it is present then it needs to correctly describe behavior.\r\n\r\nThis also appears to be confusing to me. From a direct reading, I would expect `returns_null` to indicate that the function may return null for any input, not that for a particular input it MUST return null.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think we need to define the cases where you must use the same set of parameters. For instance, you cannot have different return values for two functions with the same parameters (`(int, long) -> long` and `(int, long) -> int` cannot coexist). Similarly, you can have similar parameters in different overloads, like `(int) -> int` and `(bigint) -> bigint`.", "path": "format/udf-spec.md", "line": 90, "type": "inline"}, {"author": "rdblue", "body": "It is unclear to me what \"immutable\" means. Does it mean that you can't change these without updating the `overload-id`? That seems incorrect to me because the overload ID is more about tracking than identity. I think a better way to phrase this is:\r\n1. Function definitions are identified by the tuple of types and there can be only one definition for a given tuple\r\n2. All parameter names must match the definition in all versions and representations", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "When needed and where it does not conflict with the rule above.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "I think this is more about semantics than representation. The goal is to ensure consistent semantics(e.g., same behaviors) across engines, while allowing dialectic representations for each engine. Representations don\u2019t have to be identical, but they should preserve the same meaning. I'm open to suggestion if wording could be improved.", "path": "format/udf-spec.md", "line": 37, "type": "inline"}, {"author": "rdblue", "body": "After talking with Dan about the issue we discussed in the sync, I think that it makes sense to have a list of parameter names in the SQL representation. That way each representation is self-contained and consistent. And there's no need to have restrictions on whether names can change. The names in the definition and docs are shown as the definition, but the names used in SQL are specific to that SQL. It's the same idea as having a param name in a Java interface that can differ in the definition", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "That sounds a good idea. To avoid duplication as most of representations may not need different names, we might still allow SQL representation to use the default parameters. So that only renaming triggers the copying of parameters to individual representations. ", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Was removed due to an offline feedback. I'm OK with either way. Added it back. ", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Changed it to int", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Changed the name from `overload` to `definition`", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Changed the id from `long` to a parameter tuple string per discussion. Please take a look.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "It\u2019s the only place in the spec that uses this. The ID will be used for conflict detection. I\u2019ve updated it to use the parameter tuple as the identifier instead of a numeric value.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "That\u2019s a good question. Searched a bit, some engines like Spark[1] and Snowflake[2] support named parameter invocation, similar to how Python supports named arguments. Trino doesn't. Because of that, parameter names become part of the function\u2019s signature, which means that renaming parameters can break existing invocations.\r\n\r\nFor example, a Spark app might call a function defined as `foo(int a, int b)` using named arguments: `foo(a => 1, b => 2)`. If the function definition later changes to `fo", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Added a new note to clarify that parameter names must not change due to named parameter invocation.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Added an optional parameter list in the representation, also clarified that the tuple of types identify a definitioin.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Added a note to call it out that parameters have no default value. Thanks for the suggestions. We may think about support it later. Snowflake[1] and Postgres support it. Spark[2] plans to support it. \r\n\r\n1. \u201cEach optional argument has a default value that is used when the argument is omitted.\u201d \u2013[ Snowflake Docs](https://docs.snowflake.com/en/developer-guide/udf/udf-calling-sql) \u00b7[ Snowflake Blog](https://medium.com/snowflake/snowflake-supports-named-and-optional-arguments-389d2500726f)\r\n2. [UDFs", "path": "format/udf-spec.md", "line": 90, "type": "inline"}, {"author": "flyrain", "body": "For https://github.com/apache/iceberg/pull/14117/files#r2445396315", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "> define the cases where you must use the same set of parameters. \n\nThis was done when we uses the type tuples as definition id. It guarantees that functions with the same type tuples cannot co-exist, even their return types are different.", "path": "format/udf-spec.md", "line": 90, "type": "inline"}, {"author": "flyrain", "body": "Changed to `called_on_null_input` and `returns_null_on_null_input`. These are common terms used across engines, like Trinio, Microsoft SQL Server, and Snowflake. Keeping them consistent would be preferable. \r\n\r\n> Is this a hint?\r\n\r\nIt should not be a hint for engines supporting `null handling` flag, but it's practically a hint for engines not supporting it. For example, Spark will always be with `called_on_null_input` even the UDF is marked as `returns_null_on_null_input`, of course, the UDF aut", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Made the change accordingly. Added the requirement for returned values.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Yes, we don't want different definitions with the same parameters. I think it\u2019s fine to define the return type as long in that case, and let individual engine handle the type promotion separately. In that case, the UDF definition is still consistent across engines. WDYT? ", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Added one rule for inputs to clarify it.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "[doubt]\n1/ what would engines do if they can enforce these requirements ? FAIL, otherwise one can just use an engine that doesn't enforce this and get udf definition \n2/ Do we want to say something about predicate re-order attack, implying the udf should be executed completely (not participating in optimizer's predicate reorder) and in an isolation ? example attack : https://docs.snowflake.com/en/user-guide/views-secure#how-might-data-be-exposed-by-a-non-secure-view\n", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "Are the names need to be considered as case-sensitive or in sensitive always ? we might need this info when binding to get field-ID\n\na quick search shows PG supports case sensitive params\n\n```\nCREATE FUNCTION get_double(input_value int, \"MyArg\" int)\nRETURNS int AS $$\nBEGIN\n -- You MUST use quotes to access it\n RETURN input_value + \"MyArg\";\nEND;\n$$ LANGUAGE plpgsql;\n```", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "[doubt] if engines are deciding the resolution rule, then wouldn't the out of udf becomes engine dependent for example if some one prefers double over float and both over-loads exists this would make them produce different output ? ", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "1. Good point. I think the intent is that if an engine declares support for `secure=true`, it must enforce these requirements; otherwise it should reject the function as unsupported. That way, users can rely on consistent behavior instead of silent non-compliance.\r\n2. Yes, I agree we should mention that secure UDFs must be executed atomically and not be reordered or optimized away in predicate pushdowns. I\u2019ll add a note.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Right, and the spec can only minimize that.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Good catch. I think we could standardize parameter names as case-sensitive, matching SQL identifier rules when quoted (e.g., \"MyArg\"). Engines that fold identifiers to lower case can normalize internally, but the canonical definition in the metadata.json should preserve the declared casing. WDYT? We can bring it up to the community sync. ", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "+1 on storing this as case sensitive always as there may be cases where engine storing UDF is operating on case insensitive mode (Thank you for bringing this in the sync today). Nevertheless at runtime engine when try to execute the UDF and are in case-sensitive or in-sensisitve mode can call appropriate bind method.", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Shall we specify if a property is not supported by an engine, it will be ignored (engine may log a warning)?", "path": "format/udf-spec.md", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Shall we also specify if an engine does not support secure UDFs, it MUST reject functions with secure=true at create/load/registration time and MUST NOT execute them?", "path": "format/udf-spec.md", "line": null, "type": "inline"}], "4037": [{"author": "kbendick", "body": "This function is written such that it can be used in `getStringMap` and the differences between `getStringMap` and `getStringMapOrNull` would be in precondition checks, but I chose not to make that change to the existing code for now.", "path": "core/src/main/java/org/apache/iceberg/util/JsonUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think there's probably a better name for this module. Using \"Jackson\" ties this to the implementation, which we probably don't want. And \"module\" seems like an extra word that isn't carrying a lot of meaning, to me at least. It looks like this is probably related to how Jackson is configured?\r\n\r\nHow about `RESTSerializers` instead?", "path": "core/src/main/java/org/apache/iceberg/rest/RESTJacksonModule.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can you help me understand what these configurations are doing?", "path": "core/src/main/java/org/apache/iceberg/rest/RESTJacksonModule.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What is the scope of the context that this is modifying?\r\n\r\nDoes the schema need to be deserialized before the partition spec?", "path": "core/src/main/java/org/apache/iceberg/rest/RESTJacksonModule.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Do you mean deserialization?", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why not just copy the namespace levels? It looks like this would continue adding to the namespace every time `withNamespace` is called, rather than setting the namespace?", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think it's clear from the name that this is not replacing the set of properties. I think a verb like `set` would fix this. For example, `setProperties(map)` implies to me that you're adding all the new properties and not removing anything.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is good because if we're relying on Jackson to automatically decide how to serialize a payload, we need to ensure that it matches the spec.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Looks like you can call these directly rather than using satisfies.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I would put this assertion in a separate test. The result is going to be the same for all instances of the object, right? So we don't need to check both expected and actual, nor do we need to do it when comparing expected and actual.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this should be `assertEquals` rather than `assertSame`. `Assert.assertSame` in the JUnit API tests object identity, not equality and I think we should use the same definition.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Better name might be \"EMPTY_PROPERTIES\"", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This behavior is fine with me.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think we need to test 3 things: missing `properties`, `properties` that is explicitly null, and empty properties.\r\n\r\nOn the serialization side, it's perfectly fine if no properties are serialized as an empty map only.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "remove / removeAll?", "path": "core/src/main/java/org/apache/iceberg/rest/requests/UpdateNamespacePropertiesRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "update / updateAll?", "path": "core/src/main/java/org/apache/iceberg/rest/requests/UpdateNamespacePropertiesRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "You can copy this builder for the create request.", "path": "core/src/main/java/org/apache/iceberg/rest/responses/CreateNamespaceResponse.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can we use `GetNamespaceResponse` for this as well? They're exactly the same thing.", "path": "core/src/main/java/org/apache/iceberg/rest/responses/CreateNamespaceResponse.java", "line": 33, "type": "inline"}, {"author": "rdblue", "body": "The spec needs to be updated to match this. I agree with returning the namespace and actual properties so that the caller doesn't have to immediately send a GET request, but we should make sure the implementation and spec match.", "path": "core/src/main/java/org/apache/iceberg/rest/responses/CreateNamespaceResponse.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "add / addAll?", "path": "core/src/main/java/org/apache/iceberg/rest/responses/ListNamespacesResponse.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "add / addAll?", "path": "core/src/main/java/org/apache/iceberg/rest/responses/ListTablesResponse.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "No need to use the methods here. It will just create a useless empty list.", "path": "core/src/main/java/org/apache/iceberg/rest/responses/UpdateNamespacePropertiesResponse.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "addMissing / addRemoved / addUpdated?", "path": "core/src/main/java/org/apache/iceberg/rest/responses/UpdateNamespacePropertiesResponse.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'd probably remove these. The requests are likely to be built and automated. Better to have an `isEmpty` method on the request, but I probably wouldn't do that either.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/UpdateNamespacePropertiesRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Yeah this one is strange for sure. I'm going to use what's in `GetNamespaceRequest`.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "These are a workaround for Jackson since Iceberg doesn't use the standard `get/set` bean notation. This allows Jackson to work with the fields more directly and not require custom serializers (with basically a ton of boilerplate code) for all of the request/response objects. ", "path": "core/src/main/java/org/apache/iceberg/rest/RESTJacksonModule.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "I think it's fair to separate the two as the may evolve independently over time. I'd prefer not to try optimizing by reusing request/response objects.", "path": "core/src/main/java/org/apache/iceberg/rest/responses/CreateNamespaceResponse.java", "line": 33, "type": "inline"}, {"author": "kbendick", "body": "Agreed and updated", "path": "core/src/main/java/org/apache/iceberg/rest/responses/ListTablesResponse.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Fair point. I was using it to use the `toString` method of `ImmutableList` because Jackson can inject a list of some other type, but `ArrayList<String>` prints correctly so the defensive method calls aren't needed.\r\n\r\nRemoved.", "path": "core/src/main/java/org/apache/iceberg/rest/responses/UpdateNamespacePropertiesResponse.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Good call.", "path": "core/src/main/java/org/apache/iceberg/rest/responses/UpdateNamespacePropertiesResponse.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "didn't want to repeat the same comments across all classes, so just wanted to mention that my comments on `CreateNamespaceRequest` + `TestCreateNamespaceRequest` can be applied to the other classes as well", "path": null, "line": null, "type": "review_body"}, {"author": "nastra", "body": "just wondering whether it would make sense to have `CreateNamespaceRequest` just implement `equals()` rather than having to do it in test code?", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "just curious whether it would make sense to test more non-happy paths, where you're passing:\n* a null/empty json\n* a json with fields that that Ser/De components don't understand (think about somebody mistyping `namespace` or `properties`)\n* a json where `namespace` / `properties` get strings that they can't interpret (such as numbers)", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "do we need a unit test for this maybe? I know this is implicitly tested in the happy path, but I believe we should still test the non-happy path and verify that we're getting the error message that we're expecting", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I'll be adding more unit tests for the non-happy paths. I removed some of the tests I had for preconditions (as some of the preconditions were determined to not be needed) and I might have accidentally removed this one. I'll find where I stashed it and add it in if needed (there's also been thought given to removing the `Preconditions` to provide more flexibility to implementors). \ud83d\udc4d \r\n\r\nBut agree on the non-happy path testing. I didn't want to clutter the PR too much at first in case things need", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Opened a PR to update the spec: https://github.com/apache/iceberg/pull/4039", "path": "core/src/main/java/org/apache/iceberg/rest/responses/CreateNamespaceResponse.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "would it make sense to add a comment to the code describing this workaround? I think this isn't clear to everyone stumbling across this part of the code why it's being done", "path": "core/src/main/java/org/apache/iceberg/rest/RESTJacksonModule.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "* null/empty JSON - I agree we should make sure that required fields are present\r\n* JSON with extra fields - I think extra fields should be ignored by the client so that we can evolve the payloads. Clients shouldn't fail if the service returns extra fields, and services shouldn't necessarily fail if the client sends extra fields.\r\n* JSON with invalid data - I agree", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "+1, Module has a specific meaning in jackson https://fasterxml.github.io/jackson-databind/javadoc/2.7/com/fasterxml/jackson/databind/Module.html", "path": "core/src/main/java/org/apache/iceberg/rest/RESTJacksonModule.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I wonder if we should keep the consistency here: schema, partition spec and sort order all have their own `Parser` that is Iceberg's bridge for JSON serde. Should we also first have `TableIdentifierParser`, `NamespaceParser` (or maybe just everything in 1 parser class is fine...) and then have these Jackson specific serializers. This would also help Trino/Presto integration because registering a Jackson serializer is not an option.", "path": "core/src/main/java/org/apache/iceberg/rest/RESTJacksonModule.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I think we can have the javadoc here for every request saying it maps to the XXXX in the OpenAPI spec, for more clarity of the readers who encounter this class. (I don't know if we can link that spec, if so it would be better)", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": 31, "type": "inline"}, {"author": "jackye1995", "body": "can we have a separated PR just for this to fix the quotation and other unrelated stuffs?", "path": "rest_docs/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Agreed. It was originally `Module` in the Jackson sense, as I was going to export the `Module` for others to add. But decided that it was easier to contain inside a method for people.\r\n\r\nWill rename.", "path": "core/src/main/java/org/apache/iceberg/rest/RESTJacksonModule.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Renamed to `RESTSerializers`.", "path": "core/src/main/java/org/apache/iceberg/rest/RESTJacksonModule.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Added the comment that Dan made almost word for word, though it lives in `JsonUtil` now. Open to moving the visibility changing back into this class and changing the comment as well.", "path": "core/src/main/java/org/apache/iceberg/rest/RESTJacksonModule.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Updated the comment.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Updated (and added a check for `null` on the input as well as I see people do that sometimes in tests etc).", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Good point. I'll do it in just a separate test, one time.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Removed to not use `satisfies`. I'd hoped that would provide more benefit than it did. Will remove the other `satisfies` instances as well.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "As per a comment from @nastra, I chose to implement `equals` directly on the class and then this method became unnecessary as I can just do the equality check in line in the test.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Yeah I agree. I changed it to simply implement `equals` (and `hashCode`). Now none of the tests will have this `assertEquals` method, just use the `equals` defined on the class.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Added a few tests for malformed.\r\n\r\nSpecifically added:\r\n- an empty json\r\n- (String) null - simulating if the response entity has no body at all\r\n- fields are spelled incorrectly\r\n- `namespace` / `properties` have fields that aren't the correct type (invalid data)\r\n\r\nI agree that extra fields should just be ignored.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "This specific Precondition was moved to the constructor (so that Jackson would go through the same path), but I have asserted on the message.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Oh yeah. I don't even know how those changes got into this branch. I'll definitely revert them \ud83d\udc4d ", "path": "rest_docs/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "kbendick", "body": "These have been removed. Sorry about that!", "path": "rest_docs/rest-catalog-open-api.yaml", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Earnest question: what should we say it maps to? The name? The name is the same as the classes in the OpenAPI spec. Is there something else we should link to?\r\n\r\nAs for actually linking, I agree that would be better though I'm not sure if it's doable. I'll look into it, but I suspect that it's not.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": 31, "type": "inline"}, {"author": "kbendick", "body": "Not sure where my original comment went, so repeating it here.\r\n\r\nI had originally done it that way, but last minute made a change. I was going to just use `TableIdentifierParser`, because `Namespace` is just an array of strings. However, if Trino needs it then that's a different story.\r\n\r\nI'll work on adding the `*Parser` back in for `TableIdentifier` (and possibly for `Namespace` as it sounds like Trino will need it).", "path": "core/src/main/java/org/apache/iceberg/rest/RESTJacksonModule.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I can move the `TableIdentifierParser` and associated tests into another PR if that would be helpful @jackye1995.\r\n\r\nI'm also thinking that `NamespaceParser` would in fact be needed, because of the way explicitly write `startObject` and `endObject` when generating json.", "path": "core/src/main/java/org/apache/iceberg/catalog/TableIdentifierParser.java", "line": 100, "type": "inline"}, {"author": "kbendick", "body": "If we're good with the tests, I'll start adding them to more of the request / response objects.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Ok. Great. I've updated the constructors to default `null` / missing to empty map.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "This has been merged.", "path": "core/src/main/java/org/apache/iceberg/rest/responses/CreateNamespaceResponse.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I added `TableIdentifierParser` and a suite of tests for it that are similar to the ones used in the recent `SnapshotsRefParser` PR.\r\n\r\nI might move that to its own PR so it can be merged sooner.", "path": "core/src/main/java/org/apache/iceberg/rest/RESTJacksonModule.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "The way Jackson handles this with the current `visibilityModifiers` based approach for deserialization, things like `{ \"dropped\": 123 }` still deserialize as though they were `{dropped: true}`.\r\n\r\nPresumably anything that's truthy, but boxing `this.dropped` from primitive `boolean` helped somewhat (at least with strings). I'll add more tests.", "path": "core/src/main/java/org/apache/iceberg/rest/responses/DropNamespaceResponse.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I opened an issue to follow up on this. Since the class names currently match up with the class names in the spec, I think we can handle that as a follow up (I genuinely need to see in which ways is that possible - though maybe somebody else knows better already).\r\n\r\nhttps://github.com/apache/iceberg/issues/4068", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": 31, "type": "inline"}, {"author": "kbendick", "body": "Closing this as we've decided to keep the responses separate for now.", "path": "core/src/main/java/org/apache/iceberg/rest/responses/CreateNamespaceResponse.java", "line": 33, "type": "inline"}, {"author": "rdblue", "body": "Annotation classes are ignored if they are not present in the classpath. Relying on annotations brings in a new class of problems caused by the end user messing up their classpath, so I am strongly in favor of avoiding them. If Jackson can infer everything that we need it to without them, then we can use that inference. Otherwise I think we should make serialization more explicit.\r\n\r\nI believe that we don't currently use Jackson annotations, so this change would also require updating all of our ", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Let me check if Jackson can infer everything.\r\n\r\nI would be in favor of the more explicit SerDe as well given what's mentioned above (or using a library to generate the classes for us would be my preferred approach but I know that comes with its own set of issues).", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "So I realized the issue. We can't validate the input if we just let Jackson infer things.\r\n\r\nFor example, if a server sends a `CreateNamespaceResponse` of `{ }`, it will not fail on deserialization and we will get `CreateNamespaceResponse(namespace=null, properties=null)`.\r\n\r\nThe annotations force it to go through the constructor, which is where we throw if the required / not-null field `namespace` isn't found.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Lazily instantiated so that we serialize null vs empty in a way that will match what is parsed with equals. Accessor methods return immutable copies, so null values are tolerated.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/UpdateNamespacePropertiesRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "ImmutableMap.Builder will throw on a null map, so this allows for overwriting with `null`.\r\n\r\nPossibly we should throw instead so people don't use the builder improperly?", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "This is the form I'm planning on using for the valid SerDe tests so the named constants for the JSON can be inlined.", "path": "core/src/test/java/org/apache/iceberg/rest/responses/TestUpdateNamespacePropertiesResponse.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think this should be `public` because we don't want to leak Jackson in our API. Can it be private or package-private?", "path": "core/src/main/java/org/apache/iceberg/catalog/TableIdentifierParser.java", "line": 72, "type": "inline"}, {"author": "kbendick", "body": "Presently it's used in `org.apache.iceberg.rest.RESTSerializers`, so it can't be made package-private as is.\r\n\r\nThe `JsonSerializer<TableIdentifier>` (and corresponding Deserializer) could be moved into another class that is public, which calls into this one that is package private? \r\n", "path": "core/src/main/java/org/apache/iceberg/catalog/TableIdentifierParser.java", "line": 72, "type": "inline"}, {"author": "kbendick", "body": "Here's the code that calls it: https://github.com/apache/iceberg/pull/4037/files#diff-ba7699a6ab80be9031fd6324b347a087f10646d57339a47339f60699a6bf4b37R103-R109", "path": "core/src/main/java/org/apache/iceberg/catalog/TableIdentifierParser.java", "line": 72, "type": "inline"}, {"author": "kbendick", "body": "`TableMetadataParser` has the same `public` type signature in it as well.", "path": "core/src/main/java/org/apache/iceberg/catalog/TableIdentifierParser.java", "line": 72, "type": "inline"}, {"author": "rdblue", "body": "Okay, should be fine then. Thanks!", "path": "core/src/main/java/org/apache/iceberg/catalog/TableIdentifierParser.java", "line": 72, "type": "inline"}, {"author": "rdblue", "body": "It looks like this could be put in `JsonUtil` as `getStringArray`.", "path": "core/src/main/java/org/apache/iceberg/rest/RESTSerializers.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Most util classes in the codebase use a singular `Util` suffix. Not a big deal, but it would be nice to match.", "path": "core/src/main/java/org/apache/iceberg/rest/RESTUtils.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Minor: This seems like an odd location for a `builder()` factory method. I normally put these just above the builder class.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What fails about this? Is it that there are unknown keys in the request?", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This class appears to only be used in tests. Is it necessary to make it public or part of src/main?", "path": "core/src/main/java/org/apache/iceberg/rest/RESTUtils.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This doesn't work like the other builders where we have, for example, `update` and `updateAll` that don't replace the entire map.\r\n\r\nI would say that if you pass `null` here then this method should fail with a `NullPointerException` in `properties.putAll`.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": 86, "type": "inline"}, {"author": "kbendick", "body": "It fails because there aren't values for what is needed. That's why `validate` is required.\r\n\r\nWith the way the ObjectMapper is presently configured, it doesn't fail on unknown fields (though other object mappers might).", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I updated it to a `ValidationException` before I saw this. Can change to an NPE if you'd like.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": 86, "type": "inline"}, {"author": "rdblue", "body": "I'd probably use Preconditions.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Since this led to modifying the object inside of `validate`, I think it is better to remove `equals` and `hashCode` and default to `ImmutableMap.of()` in the `properties` getter. You can add `assertEquals` in the tests to reuse code.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think you can make `expected` an object and pass that to the object mapper so that these can be shared across test classes.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Default properties here rather than modifying the object.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I moved it to `test` for now. We might wind up needing it later, but given that it's not needed presently anywhere outside of tests I think that's not a bad idea to keep it out of any potentially public APIs for now.", "path": "core/src/main/java/org/apache/iceberg/rest/RESTUtils.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Updated to NPE via `Preconditions.checkNotNull`.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": 86, "type": "inline"}, {"author": "kbendick", "body": "Updated to use `Preconditions`.", "path": "core/src/main/java/org/apache/iceberg/rest/requests/CreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I just used a templated abstract base class instead. That way the code is shareable but we don't have to cast. Let me know if you'd prefer just using `Object`.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Would be nice to have error message validation.", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Will add that in!", "path": "core/src/test/java/org/apache/iceberg/rest/requests/TestCreateNamespaceRequest.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think it's fine to catch empty namespace in the `CreateNamespaceRequest` object, but this should allow an empty namespace because it is a valid thing. You can have tables in the catalog root, so their namespaces are empty.", "path": "core/src/main/java/org/apache/iceberg/catalog/TableIdentifierParser.java", "line": null, "type": "inline"}], "1525": [{"author": "kbendick", "body": "Some initial comments (in part so I can more easily follow along), but this is very exciting.", "path": null, "line": null, "type": "review_body"}, {"author": "kbendick", "body": "EDIT: I goofed. We're creating iceberg tables from Spark tables. So my 1st suggestion should be ignored entirely. Of course spark tables won't have hidden partition specs as that's an iceberg concept and we're using this to convert tables from spark to iceberg \ud83e\udd26 . Is it possible to re-use a spark table's bucket based partitioning? I've never personally used it due to the small files problem it can generate, but are our hash functions for bucketing so different from the Spark bucketing (or some o", "path": "spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I think that you can get the table DDL via a utility function in catalyst, and then there's a function that returns a schema from DDL. However, if this is returning an Iceberg schema then do please ignore me :)", "path": "spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Can we create a separate PR for this and merge it in sooner? I imagine this PR will take longer to merge because of its large scope, but this seems useful now.", "path": "spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Definitely want this to be more clear, I think your documentation is correct, either the table is already partitioned with identity transforms, if so we need to know which, or it has no partitioning information.", "path": "spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Yeah this is the IcebergSchema :)", "path": "spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Sure, I just added this in because my tests were yelling at me all the time because I like to throw random kill codes around when debugging. I decided to just make sure it would stop bothering me.", "path": "spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I figured once I typed it out. I should really start deleting my comments once I come to the answer rubber duck debugging with myself in the github comments \ud83d\ude05 .", "path": "spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "So I've thought about this more, and I think that what might be missing / what is bothering me is honestly just an empty line between the doc comment and the section for params. \ud83d\ude05 \r\n\r\nSo definitely file that one under `nits`.", "path": "spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I can create a separate PR to handle this now if you don't mind.", "path": "spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "@RussellSpitzer I created a PR that cherry picks just this change as this PR's scope is pretty large and this small change can be merged in much more quickly: https://github.com/apache/iceberg/pull/1529", "path": "spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think that we should leak the v1 Spark API (`TableIdentifier`) in a util class like this. What about passing database and name separately? Then the caller is responsible for adding database, which avoids the need to use v1 `spark.catalog()` to default it.", "path": "spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "#1529 was merged.", "path": "spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think that `spark.catalog().currentDatabase()` is correct. I thought that `spark.catalog()` always returns the built-in v1 catalog.", "path": "spark3/src/main/java/org/apache/iceberg/spark/CreateActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is it possible to use the default `V2SessionCatalog` and Iceberg's non-session catalog?", "path": "spark3/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This doesn't look qualified?", "path": "spark3/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this is referred to as \"snapshot\" in the tests, right? (From the `SNAPSHOT TABLE` command?)\r\n\r\nAnd the other one is a \"migrate\"?", "path": "spark3/src/main/java/org/apache/iceberg/spark/MigrateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We normally put constructors before public methods. Minor, but this was harder to find than needed.", "path": "spark3/src/main/java/org/apache/iceberg/spark/MigrateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "How does the user know which one is going to happen? Table names are passed into all of the methods in `CreateActions`.", "path": "spark3/src/main/java/org/apache/iceberg/spark/MigrateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'm finding it hard to follow when this would migrate a table vs snapshot a table, and it also requires a separate class.\r\n\r\nDid you consider an API using verbs directly? I'm imagining something like this:\r\n\r\n```java\r\nActions.migrate(\"db.table\").execute();\r\nActions.snapshot(\"db.table\").as(\"db.table_iceberg\").execute();\r\n\r\n// maybe even create like?\r\nActions.createTable(\"db.table\").like(\"db.table_hive\").execute()\r\n```\r\n\r\nWe could add methods as well, but those verbs correspond to the SQL that we ", "path": "spark3/src/main/java/org/apache/iceberg/spark/CreateActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Does this mean that only Spark tables are supported and not Hive tables? I don't think that Hive tables have providers.", "path": "spark3/src/main/java/org/apache/iceberg/spark/MigrateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style nit: Both \"Spark\" and \"OrFail\" are implied, so I think they are just making this method name longer. It could be `getSessionCatalog`.", "path": "spark3/src/main/java/org/apache/iceberg/spark/MigrateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think everything after the staged table is created should be in the `try` block so that any exception thrown will roll back any changes that were made.", "path": "spark3/src/main/java/org/apache/iceberg/spark/MigrateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "If we wanted the most limited api I think we want\r\nMigrate - Takes one arg (identifier)\r\nSnapshot - Takes two args (identifier source, identifier dest, location)\r\n\r\nI'm a little worried about having a \"snapshot\" verb with \"as\" since I feel like it also requires an \"at\" for a location and then I have to do a runtime check to make sure they are both used unless I make a series of chaining classes.\r\n\r\nSnapshotStart --- .as(identifier) --> SnapshotWithDest --- at(location) --> CreateAction\r\n\r\nAlt", "path": "spark3/src/main/java/org/apache/iceberg/spark/CreateActions.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Sure, the main issue is that the old version just always used the string literal \"default\" no matter what. Which I don't think is the right thing to do.", "path": "spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I actually am not using this method any more so I'll drop it and revert back to \"default in the base specForTable", "path": "spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I took a more locked down approach to this for the mean time. We only expose a\r\nActions.migrate\r\nActions.snapshot\r\n\r\nTo the public, this class instead becomes package protected with a few public methods, \r\nThose inherited from Action\r\nwithProperty/ies : for adding table properties\r\nas : for changing the destination", "path": "spark3/src/main/java/org/apache/iceberg/spark/CreateActions.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Changed up the api to be much more explicit about what is happening.", "path": "spark3/src/main/java/org/apache/iceberg/spark/MigrateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Added in a Hive test (at least I think so, used CREATE EXTERNAL TABLE LOCATION ....) which I though triggered the hive path, works for this.", "path": "spark3/src/main/java/org/apache/iceberg/spark/MigrateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "The provider for hive tables is \"hive\", doublechecked", "path": "spark3/src/main/java/org/apache/iceberg/spark/MigrateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Sounds good, fixed", "path": "spark3/src/main/java/org/apache/iceberg/spark/MigrateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Yes added in test for this, but removed the Hadoop one which I think may not work for either Snapshot or Migrate because of directory requirements. I believe this was passing previously because some of the names were colliding.", "path": "spark3/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "It was once :P, renamed", "path": "spark3/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Removed this entirely, we now require a Catalog and Identifier to remove any ambiguity here. We parse this pair from the name as a multipart identifier, see new utility methods.", "path": "spark3/src/main/java/org/apache/iceberg/spark/CreateActions.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Base class for Actions that are common to both Spark 2 and 3", "path": "spark/src/main/java/org/apache/iceberg/actions/CommonActions.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This change has the unfortunate consequence that there is now an Actions in both Spark2 and Spark3 modules, which means that our abstract tests which use both have to delegate their \"Actions.forTable\" method to their implementations.", "path": "spark/src/main/java/org/apache/iceberg/actions/CommonActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Typo: Should be \"migrate\" not \"snapshot\".", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "The doc here is really specific to Spark, but I don't think there is a need for it. How about something like \"The table will no longer be accessible using the previous implementation\"?\r\n\r\nAlso, new data will be written to the `data` directory to avoid breaking any copies of the table. When we migrate, we leave a `db.table_hive` copy that can be renamed back in place to roll back the operation. The user is responsible for dropping the `_hive` copy or doing renames to rollback. Not sure how much o", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Same here, should be \"migrate\".", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": 154, "type": "inline"}, {"author": "rdblue", "body": "The wording in this Javadoc and the argument names seem a little confusing to me. I think it should be \"Creates a new Iceberg table that is a snapshot of the given source table\", then \"The new table can be altered, . . .\". Referring to the table that gets created as a table should help users understand what is happening a bit better than referring to it as a \"snapshot\", which has a conflicting meaning when used as a noun. Here, I think we want to stick to using it as a verb.\r\n\r\nSince the \"source", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It may be helpful to link to the `CreateAction` API rather than just calling out that it is an Action.", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "`Actions.migrate` returns a `CreateAction`, but `as` seems to change the operation from an in-place migration to something unclear. Is the original table retained if the name doesn't conflict?\r\n\r\nSimilarly, I don't think there is a need for this with `Actions.snapshot` because snapshot accepts a destination table name. When I suggested `as`, my intent was to use it as a way to pass the destination table name for snapshot. But if you can't create a `SnapshotAction` without a table name we don't n", "path": "spark/src/main/java/org/apache/iceberg/actions/CreateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why not `set` and `setAll`? Is \"additional\" more clear?", "path": "spark/src/main/java/org/apache/iceberg/actions/CreateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can we avoid Scala in this API to avoid breaking changes? I think converting to `List` immediately and then calling this method is going to be better.", "path": "spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why is this needed?", "path": "spark3/src/test/java/org/apache/iceberg/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What is `IcebergV2Catalog`? It looks like `SparkCatalog` supports staged tables with both Hadoop and Hive catalogs.", "path": "spark3/src/test/java/org/apache/iceberg/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Are these private?", "path": "spark3/src/test/java/org/apache/iceberg/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: extra newline", "path": "spark3/src/test/java/org/apache/iceberg/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can you use `Assume` for this instead of `if`? That way it shows up as a skipped case.", "path": "spark3/src/test/java/org/apache/iceberg/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I like the code reuse that these tests have, but I don't think that the pattern of a single method like this is very readable. For example, it isn't clear what `expectedMigratedFiles` is until you read this, but it differs between test cases.\r\n\r\nI think a better pattern for reuse is to use separate methods that are well named. These tests have a great example of what I'm talking about with `testIsolatedSnapshot`. You could add a boolean to `testCreate` for whether this method should call `testIs", "path": "spark3/src/test/java/org/apache/iceberg/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Session catalog?\r\n\r\nDo we have a way around this restriction? In our implementation, we load the source table using our Hive or Spark table implementation and check that it is what we expect. Then we use that implementation to load the partitions. Would we similarly require a v2 table implementation to make this catalog agnostic?\r\n\r\n(This isn't a blocker, just curious to hear your ideas.)", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This excludes Hive tables that don't use Spark's provider, including those created with `STORED AS parquet` instead of `USING parquet`. I don't think that is necessary. Isn't the only requirement that all of the partitions are a supported format?", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Should this use a static set of table providers instead of passing them in?", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This was actually part of an earlier discussion I had with @aokolnychyi, We were discussing whether we should be preserving the properties set in the origin table for the operation. I had \"additional\" here because I settled on a behavior we had previously be using internally which copied over whatever properties the original source originally had. \r\n\r\nI can go back to \"set and setAll\" but I would like to hear your opinion on copying properties from the origins.", "path": "spark/src/main/java/org/apache/iceberg/actions/CreateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Already upset about that Scala 2.13 change :) Yeah no problem, I'll stay Java only in message signatures with the exception of Spark Specific classes.", "path": "spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "The limitation here is based on how the current SparkUtil.import code is written. It can't handle DSV2 catalogs or things like that so I thought this would be the easiest approach for now.\r\n\r\nI think what we should do is modify that utility function to be less SparkCatalog specific, or at least have ways of calling it that describe a table based on partitions (like in your code) and other properties directly rather than assuming it is a \"table name\" that can be looked up in the catalog.\r\n\r\nThere", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Yeah I can probably allow this and test it better. I was running into issues because the HadoopOps fails if you manually specify a location for metadata or data since that violates the table structure. Since the current implementation always manually specifies these locations it breaks with the Hadoop Catalog if you don't pick out the exact right location when snapshotting.\r\n\r\nLet me go back and see if I can make that less confusing and support the Hadoop backed Catalog better", "path": "spark3/src/test/java/org/apache/iceberg/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Yep, just learned about Assume last week. Will implement it here!", "path": "spark3/src/test/java/org/apache/iceberg/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This is mostly because the tests take way too long when I was running them locally because of HiveMetastore retries. When every test had a unique name I didn't have to drop tables and I wouldn't have to wait for the metastore to check for 2 minutes to see whether the table existed when I called DROP IF EXISTS. \r\n\r\nThis is one of the reasons I brought up the retry length a while back, I just couldn't efficiently test this code on my local machine.", "path": "spark3/src/test/java/org/apache/iceberg/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It sounds like creating a snapshot table should just use the new table's default location rather than requiring a location. Why did you choose to require a location?", "path": "spark3/src/test/java/org/apache/iceberg/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think a `DROP IF EXISTS` should take 2 minutes with the retry. That sounds like something in the catalog is broken. The retry should only happen if the metadata file is known, but can't be loaded. In that case, the table does exist.\r\n\r\nMaybe the test before/after methods are dropping the temporary files before dropping the table?", "path": "spark3/src/test/java/org/apache/iceberg/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think both migrate and snapshot should copy everything from the original table, including properties. That's one reason why I like `set`: it makes no guarantees about the other table's properties, only that the given key/value will be set in the new table.", "path": "spark/src/main/java/org/apache/iceberg/actions/CreateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I'll consider the fall back behavior in the action itself, I think we could leave another copy behind but I think that probably should be a user operation rather than a automatic part of migrate. I could be convinced otherwise though :shrug:", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I can remove it, for snapshot it doesn't make sense but for Migrate it allowed you to use the same location for data as the original table but using a different catalog identifier, it wouldn't remove the old identifier if it didn't match. That may be of limited utility so I'll drop it.", "path": "spark/src/main/java/org/apache/iceberg/actions/CreateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Yep, switching.", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Yeah let me double check this.", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Just had an internal user asking the same thing, gonna allow both.", "path": "spark3/src/test/java/org/apache/iceberg/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "So this is ok, Managed tables return a provider of hive.\r\n\r\n![image](https://user-images.githubusercontent.com/413025/99304613-33e67700-2818-11eb-8f25-03cfa32e5fef.png)\r\n", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Also added managed hive table tests\r\n", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Well, it seems we rely too much on reflection here. When we decided to use reflection to split actions, we did not have these static methods and it was reasonable since the scope of the change was smaller compared to introducing `BaseActions`. I am no longer sure reflection is a good idea here as making these methods work with reflection is more complicated than having `BaseActions`. On top, we don't have compile time checks. Users will call these methods in Spark 2 and will get runtime exceptio", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": 92, "type": "inline"}, {"author": "aokolnychyi", "body": "What are your thoughts on this, @RussellSpitzer @rdblue?\r\n\r\nI think it is not too late to reconsider this.", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": 92, "type": "inline"}, {"author": "rdblue", "body": "I think not implementing these in Spark 2 is definitely a reason to reconsider this.\r\n\r\nWhat about somewhere in the middle? We could introduce an API that isn't static, then call its methods from the static ones here. Then we just need an implementation class, which we could load dynamically.", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": 92, "type": "inline"}, {"author": "RussellSpitzer", "body": "I still prefer having BaseActions with an Actions per module. I think having two classes each which determine their implementation at runtime is not much better than all the static method reflection currently in the PR. The user experience would be worse, since there is a weird middle method to call, and we still would have runtime exceptions for bad method calls.\r\n\r\nMaybe I don't understand the API that isn't static plan.", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": 92, "type": "inline"}, {"author": "aokolnychyi", "body": "If we don't assign an explicit location while snapshotting, how are we going to validate the new location is different compared to the existing table location?", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I am not sure this is correct. Shouldn't we be assigning the table location, not data location?", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I don't think there is a way we can actually know what the Metastore does, so there is a possibility that a user's default location could be the same", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "We have to check both the data and metadata locations eventually to ensure users are not changing it through table properties.", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Is it enough to check if it is instance of `BaseCatalog` we introduced recently?", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Yep, Sorry this PR has been in progress for a very long time, I haven't kept up with all the other changes.", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think we can change `SparkTableUtil` if we need to. Can be done later, though.", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I feel like asking for an explicit location in case of snapshot is safer but I can be convinced otherwise.", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "We can't actually do this with the current package structure since baseCatalog is package private in Spark and this is in Actions. We could always move spark's action package though", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "`specForTable` is a bit hacky. At a minimum, it seems strange that this would create a string table name just for `specForTable` to split on `.` immediately.\r\n\r\nSince the source table is loaded as a v1 table just after this, why not use the schema and partition fields from the `CatalogTable` instead?\r\n\r\nWe should probably also deprecate `specForTable` if we are going to maintain this, since this can be a much better utility for conversion.", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why not use `TableCatalog`?", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "How is `hive` handled? Do we just assume that the partitions are a supported format?", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": 53, "type": "inline"}, {"author": "rdblue", "body": "It would probably be better to use a mutable map because this one will reject duplicate properties instead of overwriting.\r\n\r\nYou can also use `ImmutableMap.builder()` instead of supplying the key and value types.", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3MigrateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can't this use `stagedTable` properties instead?", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3MigrateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Since we know the properties are in the table, should we look up this location from properties?", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3MigrateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is another place where we might want to update our own API rather than creating a special instance (`v2BackupIdentifier`) to pass through.", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3MigrateAction.java", "line": 86, "type": "inline"}, {"author": "rdblue", "body": "I think it is usually better to do work like this in a `finally` block using a `threw` boolean:\r\n\r\n```java\r\nboolean threw = true;\r\ntry {\r\n // something that might fail\r\n threw = false;\r\n} finally {\r\n if (threw) {\r\n // clean up\r\n }\r\n}\r\n```\r\n\r\nThat has the advantage that whatever happened in the block, you get the correct exception rather than needing to throw a generic `RuntimeException` which would make it difficult to use this action in a higher-level application. Also, there are technic", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3MigrateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Since the changes are only staged, this should happen after restoring the backup table, in case there is a failure in the abort.", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3MigrateAction.java", "line": 104, "type": "inline"}, {"author": "rdblue", "body": "I think of \"apply\" as using the name mapping. What about \"add\" instead because this creates it and adds it to the table?", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Here as well, this will fail if the properties conflict. Since we know that Iceberg won't modify them it is safe to use a mutable map.", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3SnapshotAction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can you add more whitespace between control flow blocks in this PR?", "path": "spark3/src/main/java/org/apache/iceberg/actions/SparkActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why is the location passed to `snapshot`? That seems like something that should be set on the action instead because it isn't required.", "path": "spark3/src/test/java/org/apache/iceberg/actions/TestCreateActions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I would probably remove this. I don't see a need to supply location this way and it adds complexity with an additional dynamic method call.", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "This seems to be still open.", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think the last time we chatted having `SparkActions` or moving current actions to `org.apache.iceberg.spark` was one of the most promising ideas due to its simplicity. \r\n\r\nI agree it is not something we should address in this PR but I'd try to solve it before 0.11.", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": 92, "type": "inline"}, {"author": "RussellSpitzer", "body": "woops only converted the lower one, this one is fixed too now", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Removed, added another Interface SnapshotAction, with \"withLocation\" method for supplying location", "path": "spark/src/main/java/org/apache/iceberg/actions/Actions.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This isn't actually used, I forgot to remove it, we now use the Transforms out of the V1Table representation", "path": "spark3/src/main/java/org/apache/iceberg/actions/Spark3CreateAction.java", "line": null, "type": "inline"}], "10981": [{"author": "szehon-ho", "body": "@jiayuasu do you have any suggestion for default here?", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "I could add this section later too, once its implemented (same for ORC below)", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "How does this work? Does Iceberg need to interpret each WKB to produce this value? Will it be provided by Parquet?", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think that this could ever be a `geometry` because the only way that would happen is using the `identity` transform, which is probably not allowed.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Except for `geometry`?", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Yes, once we switch to Geometry logical type from Parquet we will get these stats from Parquet.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "My bad, let me revert this one", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Should we mention that it is the parquet type `BoundingBox`?", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Yep thanks", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "@rdblue @flyrain hmm is there a specific reason? I think it could work technically as Geometry can be compared, unless I miss something", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "yea will add a footnote here", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Maybe that's fine if it is comparable, but practically people will always use `xz2`, right? I'm not sure, but wondering if there is some implications, e.g., too expensive, or super high cardinality, so that we don't recommend user to use the original GEO value as the partition spec.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Are we going to support bucketing on GEO?", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Yea I think z, m are just 0 in this case?", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "yea I think its possible to do (it's just the wkb value after all), you are right , not sure if any good use case. Yea we have to get the wkb in any case, i am not sure if its that expensive, but can check. But I guess the cardinality is the same consideration as any other type (uuid for example), and we let the user choose ?", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "I think its possible, again not sure the utility. Geo boils down to just WKB bytes", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "yea sure, i had Edges already as E, so need to think of another letter", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "> We should say For Geometry type, this is a WKB-encoded Point composed of the min value of each dimension in all Points in the Geometry. Then we don't have to worry about the Z and M value.\r\n\r\n@jiayuasu @Kontinuation @wgtmac Done, thanks.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "@flyrain \r\n> Should we mention that it is the parquet type BoundingBox?\r\n\r\nActually looking again after some time, not sure how to mention this here, as that is filetype specific. This is an optional field, and only set if type is parquet and bounding_box is set, but that's implementation detail .", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Changed from type to encoding.\r\n\r\nActually i made all three have default values (C=\"OGC:CRS84\", CE=\"PROJJSON\", E=\"planar\"), will that make sense? Trying to make the common case be less verbose. @wgtmac @jiayuasu ", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Related to above comment, I think these will all be optional (take a default value if not specified).", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "I think I copied from Parquet pr wording at the time. Done, thanks.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Done, let me know if it sounds ok or not.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "flyrain", "body": "I feel that the argument for `identity` can apply here as well. In that case, we can support it, but it's users' call to use it or not.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "@jiayuasu i put it in the example (if you render the page). let me know if its not what you meant.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "@dmitrykoval i was debating this. \r\n\r\nI guess we talked about it before, but the Java reference implementation, we cannot easily do pruning (file level, row level, or partition level) because the JTS library and the XZ2 only support non-spherical. We would need new metrics types, new Java libraries , and new partition transform proposals if we wanted to support it in Java reference implementation.\r\n\r\nBut if we want to support it, Im ok to list it here and have checks to just skip pruning for s", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Yea , forgot to mention explicitly that in Iceberg, pruning is always an optional feature for reads, so no issue.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Is there anything for this specific line we need to change? As long as we get from Parquet some way we are ok here, but is the format of the lower/upper bound still ok?", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I think maybe we should just link out for the requirements here since it's a bit complicated.\r\n\r\nThe description as well could be \r\n\r\nSimple feature geometry [Appendix G](#appendix-g), Parameterized by ....\r\n\r\nI also don't think we should allow it to be unset ... can we just require that a subclass is always picked? We could recommend a set of defaults for engines to set on field creation but I'm not sure we need to be that opinionated here.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Dead link, can we get a reference to ArchiveX or something stable?\r\n\r\nI found one other reference that works here, \r\nhttps://www.dbs.ifi.lmu.de/Publikationen/Papers/SSD-XZ-Order.final.pdf\r\n\r\nBut Universities are generally terrible at keeping their links up", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Style request, let's try to move these notes into footnote [2] Or a reference to a Geometric spec doc spot since this description is already quite long", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Style request ```[Appendix G](#appendix-g)```", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "```[Appendix G](#appendix-g)```", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "```[Appendix G](#appendix-g)```", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "nit: missing . after 3", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Link ```[Appendix G](#appendix-g)```", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "link again", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "might as well fix this one too while we are at it :)", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I think maybe we should just link out for the requirements here since it's a bit complicated. Remove \"an object\"\r\n\r\nThe description as well could be\r\n\r\nSimple feature geometry [Appendix G](https://github.com/apache/iceberg/pull/10981#appendix-g), Parameterized by ....\r\n\r\nI also don't think we should allow it to be unset ... can we just require that a subclass is always picked? We could recommend a set of defaults for engines to set on field creation but I'm not sure we need to be that opinionate", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Let's move these details out of the description and either into the footnotes ore another section for geometry. ", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think we can be more specific here and call out the standard that we are referencing, like we do with IEEE 754. This is \"Geometry features as WKB(link) stored in coordinate reference system C and edge type E (see Appendix G)\"\r\n\r\nI would also say that \"If not specified, C is OGC:CRS84 and E is planar\".", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We may want to change this to be like `identity`, using `Any except [...]`.\r\n\r\nI would not include geo as a source column for bucketing because there is not a clear definition of equality for geo. The hash would depend on the structure of the object and weird things happen when two objects are \"equal\" (for some definition) but have different hash values.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think that we need clear details about how to implement this in the Iceberg spec, rather than a reference to an external paper (although that is good to include as well).\r\n\r\nWhat about removing xz2 for now and adding it later? We've already made the changes to the spec to support adding new transform functions without breaking forward-compatibility, so we don't need to couple these things together. If we uncouple the additions, then it is much more likely that geometry can go into v3.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Could you remove the reformatting so we can more easily look at the changes?", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why a WKB point rather than a set of 3 or 4 values?", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this was on purpose to separate the sections in markdown.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I would probably _not_ specify how to hash geometry because we don't yet know how to do it correctly. The reason why we have the second table (hash requirements that are not part of bucket) is that we don't want anyone to forget that float and double should hash to the same value.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think the example should be a well-known CRS, like `OGC:CRS84`. We should also add a _note_ that states that implementations are encouraged to store the PROJJSON representation of the CRS in a table property using the ID from this field.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "@rdblue am not entirely sure where this part of the spec is used. Should it also match the above (the more optimized serialization for stats)?", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think we need to be a little more specific here. The way I read this is that you can take min and max values for each dimension in the point, but that isn't sufficient for spherical edges.\r\n\r\nI think this needs to state that the lower and upper bounds must be less than or equal (or greater than or equal) to the values of any point that is located on an edge of the geometry object. In other words, the bounding box must contain all points that are in the geometry object.\r\n\r\nIf we don't have that", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'd prefer not to reference a PR.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why big endian? Most of the time we use little endian in the format, with the only exception being the encoded decimal values.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "@szehon-ho, I don't think we should specify this or allow geometry in bucket transforms because of issues with equality.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Also, what is the encoding for these values? 8-byte IEEE 754?", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is used for default values and for encoding values in JSON expressions for filtering.\r\n\r\nWhat about using WKT here instead of WKB?", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Per a previous conversation , it could be beneficial to have it in this form even for spherical edges. An engine could do some conversion from the bounding box to be useful for spherical edge. \r\n\r\nDo you mean, you do not want this option at all ? (I suppose due to risk of mis-interpreting it)", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Makes sense, added little-endian and 8-byte IEEE 754 (ie double type) for each coordinate", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Removed these, can add the logical types once the pr is merged in Parquet", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Good idea, added.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Thanks @Kontinuation for the suggestion. Added specification to use NaN for a coordinate if unset.", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "@jiayuasu @rdblue does this sound better now?\r\n\r\nWondering, can we just say 'westernmost bound of all geometries in the file' for both PLANAR and SPHERICAL", "path": "format/spec.md", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "@wgtmac @paleolimbot as well", "path": "format/spec.md", "line": null, "type": "inline"}], "2660": [{"author": "rdblue", "body": "@HeartSaVioR, can you help us find a better way to do this? It would be great to have your review here since you're the expert on Spark streaming!", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We're trying not to expand our use of Guava because it needs to be shaded and can change between versions. I think there should be an easy non-Guava way to do this. Can you try that instead?", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We could add a `fromJson` method that accepts an `InputStream` instead of using `CharStreams` as well.", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: we don't typically use `final` for local variables because the compiler is good at inferring it and it doesn't show up in byte code. There is little value in adding it.", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "NPE? If currentSnapshot() is null then we would npe here calling parentID", "path": "core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I would recommend we pass only the SparkSession here and not the context as well.\r\nThis would avoid any possibility that you accidentally pass a SparkSession from another context. Also fine to do SparkSession.active.get() (i forget if that's the real call)", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This may cause us issues in the future, HDFSMetadataLog is a little internal but it extends Logging which is very internal and has burned me in the past. That said I think this would be good re-use .... Anyone else have strong feelings on this?", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "One other issue, we can't support S3 based paths with this \r\n```\r\nNote: HDFSMetadataLog doesn't support S3-like file systems as they don't guarantee listing files in a directory always shows the latest files.\r\n```", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "In the current implementation this should probably remove offsets from the offset log that are older than \"end\" but I'm not sure we want to keep that Spark internal class in the implementation here", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": 163, "type": "inline"}, {"author": "RussellSpitzer", "body": "This implementation is I think a bit different than the contract in Spark Datastream.\r\n\r\nIt says\r\n``` /**\r\n * Returns the initial offset for a streaming query to start reading from. Note that the\r\n * streaming data source should not assume that it will start reading from its initial offset:\r\n * if Spark is restarting an existing query, it will restart from the check-pointed offset rather\r\n * than the initial one.\r\n */\r\n ```\r\n \r\n Which I believe means we should never be looking a", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": 153, "type": "inline"}, {"author": "RussellSpitzer", "body": "HDFSMetadataLog.purge()", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": 163, "type": "inline"}, {"author": "RussellSpitzer", "body": "Should add \"This version of Iceberg only supports version CURR_VERSION\"", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/StreamingOffset.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "why call initialOffset here? Doesn't seem to be used and I'm not sure we want to init it here?", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "May be cheaper to read this out of the summary", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Will this be a problem if we call latestOffset multiple times and get back the same snapshot every time?\r\nFor example\r\n\r\nStart ----\r\nLatestOffset().snapshotId() == InitialOffset\r\nnoChanges\r\nLatestOffset().snapshotId() == InitialOffset\r\n?", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I think I noted this below, but I believe this may end up with us doing full scans when we recover from checkpoints since the initialOffset gets reset if the checkpointLog is constructed. Not sure though", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Probably want to clarify this, \"Cannot read data from a snapshot which as already been expired, snapshot id\" or something like that", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Not sure if we should break or ignore when we hit something we can't deal with ....", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "A single test is probably not sufficient :) The issue being that the list operation in S3 was eventually consistent (at least in the past) so it wouldn't guarantee that getLatest would return the latest written file. I know they fixed some of those issues but I think they were just around reading your writes, not around the list operation ... maybe someone with more S3 knowledge than me can chime in", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Oh i see in your link they say list is fixed, then we should be fine. My concern about using this class is that it's internal to Spark, so we would be subject to breakage if say we build against Spark 3.0.1 and they make a change in Spark 3.0.2", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "But what about\r\n\r\n```\r\n if (offsetLog.isOffsetLogInitialized()) { \r\n initialOffset = offsetLog.getLatest(); \r\n return initialOffset; \r\n }\r\n ```\r\n \r\n Won't this get hit if I\"m recovering from a checkpoint?", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I think I was confused because we are using the metadata log class to store a single file. So what we are really doing here is checking whether this particular instance was created with a checkpoint directory that has an initial offset stored or not. If so I think we should break this logic out into the constructor or a function called by the constructor.", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": 153, "type": "inline"}, {"author": "RussellSpitzer", "body": "I think i understand now, I was confused as to why we had the MetadataLog class but we are only ever storing one value. In that case we don't have anything we need to do here", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": 163, "type": "inline"}, {"author": "RussellSpitzer", "body": "I think we should probably initialize InitialOffset here and make it final", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "It would be very helpful to me if these tests extracted some of their common code into helper functions so it was easier to determine what the tests are doing.", "path": "spark3/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreamingRead3.java", "line": 252, "type": "inline"}, {"author": "jackye1995", "body": "Yes S3 is now strongly consistent in listing and read-after-write, so that warning message is out of date.\r\n\r\nFor the point that Russell pointed out, I remember we had a discussion in the last sync meeting about if we would like to have 1 runtime for 1 Spark minor release. I don't recall we arrived at a conclusion on that. If we decide to go with that approach, then this method is fine.\r\n\r\nIf we do not go with that approach, then it is probably gonna be a trouble in the future. Because S3 is not", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why public and not package-private? If we can avoid making this public, then that's a good thing.\r\n\r\nAlso, should we move these inner classes out of the `Scan` now that they are used by both `Batch` and `MicroBatchStream`?", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: we normally would not add newlines between each argument like this. The creation of the `StreamingOffset` can fit on one line.", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "When starting from an existing snapshot (`initialOffset` is not `START_OFFSET`), the end offset for the initial snapshot is the total number of data files in the table and `scanAllFiles` should be true.\r\n\r\nThe logic here will produce a `latestOffset` that uses the number of files added in that starting snapshot, even though `scanAllFiles` is true. I don't think that's correct.\r\n\r\nFor example, if I have a table with the following snapshot list:\r\nid | added_files | existing files\r\n-- | -- | --\r\n1 ", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should not be a local variable. Can you either use 0 directly or make this a common constant?", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can this be delegated to `initialOffsetStore`? It seems like something the store could handle if we initialized it with access to the table.", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: use of `get` in a method name is typically superfluous or not descriptive. In this case, is there a better verb for what is happening in this method? `get` rarely gives the context about what is happening when reading the calling code.", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: Seems like we init this var but never use?", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "We may want to split `SparkBatchScan` that currently implements both `Scan` and `Batch` in the future (not now).", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java", "line": 113, "type": "inline"}, {"author": "aokolnychyi", "body": "The stats estimation we use in the scan that relies on the latest snapshot should work for batch and micro-batch, right? I think Structured Streaming does not push predicates currently so we will use the snapshot summary if the underlying table is partitioned?", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java", "line": 114, "type": "inline"}, {"author": "aokolnychyi", "body": "Is this used? I think it would be a good idea to add log messages that will help debugging. ", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: I think the formatting we have in `SparkFilesScan` is slightly easier to follow (splitSize, splitLookback).", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: Do we need `initialOffsetStore` as a var? Seems like we only use it to init `initialOffset`.", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "How will this interface be used? Do we plan to have alternative implementations? Is there a particular reason to expose a number of methods instead of just `initialOffset` and hide the rest in the implementation class instead of the `getOrWriteInitialOffset` method.", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Absolutely! We need to check how Spark uses stats for streaming relations. Maybe, it can actually be fine this way as the scan object is reused between batch and micro-batch. ", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java", "line": 114, "type": "inline"}, {"author": "aokolnychyi", "body": "I think it is reasonable to try splitting in a follow-up PR. We can separate and reuse the `Scan` abstraction and then have independent hierarchies for `Batch` and `MicroBatch`. Not something to worry about now.", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java", "line": 113, "type": "inline"}, {"author": "aokolnychyi", "body": "We don't clean up anything cause we don't store anything except the initial offset? The rest is managed by Spark?", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": 163, "type": "inline"}, {"author": "aokolnychyi", "body": "It looks like we expose all available data at once. This may be too much to be processed in one micro batch. I know Spark 3.1 added a limit API. Do we plan to leverage it once we switch to 3.1?", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": 106, "type": "inline"}, {"author": "aokolnychyi", "body": "@rdblue @RussellSpitzer, how do you feel about using snapshot summary props here? Do we consider them reliable enough?", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Essentially, we need a limit for how much data we should expose in a single micro-batch (think should be size driven).", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": 106, "type": "inline"}, {"author": "aokolnychyi", "body": "What would 0 indicate here?", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java", "line": null, "type": "inline"}], "7880": [{"author": "nastra", "body": "there are a few things in this class that are code duplicates of `BaseMetastoreTableOperations` so we might want to consider refactoring this out and make it re-usable", "path": "core/src/main/java/org/apache/iceberg/view/BaseViewOperations.java", "line": 205, "type": "inline"}, {"author": "jackye1995", "body": "can we just let `BaseMetastoreCatalog` implement `ViewCatalog`? Feels like we are treating `ViewCatalog` more like a `SupportsView` mixin", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreViewCatalog.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "Is there really a case we will not use `catalog()` for the view catalog? I think the implementation always have to implement both, otherwise you cannot do things like check if a view is a table.", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I am starting to wonder, do we still need this `ViewBuilder` for all these parts? We have switched to immutable for everything else, would it be better to also do that for `View`? And we just keep the `create()`, `replace()`, `createOrReplace()` to complete the creation", "path": "api/src/main/java/org/apache/iceberg/view/ViewBuilder.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "this assumes we have SQL representation, and we don't really have a path forward to adopt other representations in this ViewBuilder interface. That goes back to the last comment, should we just use the immutable builder to build the information, and only keep the `create()`, `replace()` and `createOrReplace()` in the current ViewBuilder to complete the operation?", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreViewCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "we probably could do that, it just requires a few small tricks to make the generic definition in `ViewCatalogTests<...>` work properly. \r\nThis would mean that the default implementation in `BaseMetastoreCatalog` would have to throw an UOE for all the methods from `ViewCatalog`. Any catalog that wants to support Views, would then only have to implement", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreViewCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I've done that in one of the latest commits", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreViewCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "yes I agree, the current version of `ViewBuilder` is a bit awkward, because you're not able to specify multiple representations, when you create or replace the view. I was thinking that we deprecate the representation-specific methods in `ViewBuilder` so that we end up with:\r\n\r\n```\r\npublic interface ViewBuilder {\r\n ViewBuilder withSchema(Schema schema);\r\n\r\n default ViewBuilder withRepresentation(ViewRepresentation representation) {\r\n throw new UnsupportedOperationException(\"Adding a view re", "path": "api/src/main/java/org/apache/iceberg/view/ViewBuilder.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I think `ViewBuilder` is a class that we added a long long time ago without much concrete knowledge about how we are going to do other things, so I am good with deprecating all those methods. I am thinking if we could simplify it further, can we do something like `BaseViewBuilder extends ImmutableView.Builder`?", "path": "api/src/main/java/org/apache/iceberg/view/ViewBuilder.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "what's the difference of this vs `implements Catalog, ViewCatalog`?", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "Both `Catalog` and `ViewCatalog` define `name()` and `initialize(..)` methods and the main purpose of `CatalogWithViews` is to de-conflict this. For `BaseMetastoreCatalog` we can de-conflict this inside `BaseMetastoreCatalog` but there's no way to de-conflict this for `ViewCatalogTests` because we can't define `ViewCatalogTests<C extends Catalog & SupportsNamespaces & ViewCatalog>` without errors", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "We had a short discussion with @danielcweeks / @rdblue / @Fokko last week on the `ViewBuilder` API and I just wanted to give below an overview of what options we have when\r\n1) creating a view with one (or more) view representations\r\n2) adding additional representations, updating an existing representation and deleting representations\r\n\r\nWe currently don't know yet how the final Builder would look like, so the purpose here is to facilitate a discussion and to define the API and its behavior.\r\nIt", "path": "api/src/main/java/org/apache/iceberg/view/ViewBuilder.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "+1 for B, I was asking about extending `ImmutableView.Builder` basically because I think we should directly expose the representation object in the builder, so plan B looks good to me. I hope there could be an easier way to express updating a representation instead of doing a remove and re-add, but it might be just unnecessarily complex, I am good with the current proposed `UpdateViewRepresentation` interface.", "path": "api/src/main/java/org/apache/iceberg/view/ViewBuilder.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "With this semantics, we are basically treating `dialect` as the unique identifier of a representation within a list of representations. Should we make that explicit in the spec?", "path": "api/src/main/java/org/apache/iceberg/view/UpdateViewRepresentation.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "And should we also prevent adding views with duplicated dialect name? We today don't check that in https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/view/ViewVersionParser.java#L78", "path": "api/src/main/java/org/apache/iceberg/view/UpdateViewRepresentation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is this actually necessary? Can't a catalog implement two interfaces that have identical overlap?", "path": "api/src/main/java/org/apache/iceberg/catalog/CatalogWithViews.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'm not a big fan of making API changes just to facilitate tests. Can we use `instanceof` checks to have multiple ways of interacting with the catalog instead?", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Are we sure that it's a good idea to have `ViewOperations`?", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": 321, "type": "inline"}, {"author": "rdblue", "body": "I don't think that it was a good idea to add the Table version of this. Can we remove it?", "path": "core/src/main/java/org/apache/iceberg/view/HasViewOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "The purpose of `newTableOps` is to be able to define `loadTable` in this class. I think this should either not use `ViewOperations` or should implement `loadView` in a similar way.", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It's weird to have such a long variable name when it conflicts with the type name. Normally we'd use simply `ident` or `identifier` because we prefer shorter names if the extra words are not adding clarity.", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Should `t` be `v`?", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This isn't necessarily true. If the identifier is invalid then this should be thrown. If the view doesn't exist, this should be `View does not exist: %s`.", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "For views, do we want to continue using the String name or should we return an identifier?", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Looks like this isn't thread-safe. Do we note that anywhere in Javadoc for this class? Seems likely that the table methods are also not thread-safe.", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": 332, "type": "inline"}, {"author": "rdblue", "body": "Why \"location\"?", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is already checked above by `views.containsKey(from)`. I think it would make more sense to use `get` there and do the null check, then this can be skipped. Then the logic here can be:\r\n\r\n```java\r\nviews.put(to, fromView);\r\nviews.remove(from);\r\n```", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why does this need to supply a `FileIO`? Engines should not be reading files from what is in the view, so I don't think it is necessary to expose one. It's not in the `View` API.", "path": "core/src/main/java/org/apache/iceberg/view/ViewOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Packages in Iceberg are typically plural, like `actions`, `events`, `exceptions`, `expressions`, `metrics`, `transforms`, or `types`.", "path": "core/src/main/java/org/apache/iceberg/view/BaseView.java", "line": 19, "type": "inline"}, {"author": "rdblue", "body": "This should be package-private since it implements a public interface. No need to expose it. That also means we can drop the `View` prefix since it's in a `views` package.", "path": "core/src/main/java/org/apache/iceberg/view/ViewPropertiesUpdate.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "Yes this is necessary to make the code compile", "path": "api/src/main/java/org/apache/iceberg/catalog/CatalogWithViews.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "@rdblue I'm not sure I understand what you mean by using `instanceof` checks to have multiple ways of interacting with the catalog. Prior to https://github.com/apache/iceberg/pull/7880/commits/fa3ce01275e890fd639b53ee19ab360bf35e4e22 there was a `BaseMetastoreViewCatalog` implementation that had `extends BaseMetastoreCatalog implements ViewCatalog`. In the `ViewCatalogTests` we'd then define a `Catalog catalog()` and a `ViewCatalog viewCatalog()` method to interact with one or the other catalog", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "makes sense, I've updated `loadView()` but omitted the metadata handling that `loadTable()` is doing ", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "yes, good catch. updated it", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I've corrected this and moved it to `BaseMetastoreCatalog`", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "yes this isn't thread-safe (same applies for `renameTable()`) and we currently don't have Javadoc or anything that would indicate this. I'll make a separate PR and will follow-up on this", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": 332, "type": "inline"}, {"author": "nastra", "body": "`InMemoryFileIO` uses the `location` as the key when it stores files", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "makes sense, done", "path": "core/src/main/java/org/apache/iceberg/view/ViewPropertiesUpdate.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "we're using `io()` mainly to read+write the view metadata in `BaseViewOperations`. Are you suggesting we should remove `io()` from `ViewOperations` and rather add `protected abstract FileIO io()` to `BaseViewOperations`?", "path": "core/src/main/java/org/apache/iceberg/view/ViewOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "are you suggesting to rename `org.apache.iceberg.view` to `org.apache.iceberg.views`?", "path": "core/src/main/java/org/apache/iceberg/view/BaseView.java", "line": 19, "type": "inline"}, {"author": "nastra", "body": "good point and this relates to https://github.com/apache/iceberg/issues/7419. \r\nShall we maybe pass the catalog name and the identifier to the view? Or do you think we won't need the catalog name and we should just keep the identifier and change the `View` API to return `TableIdentifier identifier()` rather than `String name()`?", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Yes. This should not be public.", "path": "core/src/main/java/org/apache/iceberg/view/ViewOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should not be modified by a property update. Changing properties does not alter the current version of the view. I think the confusion is that there is a table \"version\" that is updated every time we write the metadata file. But that version is not the format version nor the snapshot ID, it's just a counter that we increment to help identify the metadata files.\r\n\r\nThis version is more like the snapshot ID, but it is assigned sequentially. This should be entirely handled by the `ViewOperatio", "path": "core/src/main/java/org/apache/iceberg/view/ViewPropertiesUpdate.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I wasn't entirely sure at the beginning but in hindsight I agree that changing properties of a view shouldn't change the view's version id. I've updated the code and the test to reflect this", "path": "core/src/main/java/org/apache/iceberg/view/ViewPropertiesUpdate.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "We did some API exploration with Ryan and I've opened https://github.com/apache/iceberg/pull/7992 to discuss API improvements (to not pollute this PR with more changes)", "path": "api/src/main/java/org/apache/iceberg/view/ViewBuilder.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I've added the catalog name and the identifier to `BaseView` and `BaseView#identifier()` returns the identifier while `BaseView#name()` returns the fully-qualified name that includes the catalog name.\r\n\r\nNot sure if it's worth, but we could introduce a `FullIdentifier` class that would hold the original catalog name and the underlying `TableIdentifier` (which could also be helpful for #7419). With that we could gradually move away from the plain string to `FullIdentifier`.", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I'm currently exploring how the APIs would look without having `ViewOperations`", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": 321, "type": "inline"}, {"author": "rdblue", "body": "If we haven't already released with this name, then yes.", "path": "core/src/main/java/org/apache/iceberg/view/BaseView.java", "line": 19, "type": "inline"}, {"author": "rdblue", "body": "Sorry, I should be more clear. An interface like this is a bit of a red flag because it is just combining two other interfaces. When I asked if this is necessary, I wanted to understand _why_ it is needed. What doesn't compile? What alternatives have you tried?", "path": "api/src/main/java/org/apache/iceberg/catalog/CatalogWithViews.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I think it's too late to change package names because we released already with this naming", "path": "core/src/main/java/org/apache/iceberg/view/BaseView.java", "line": 19, "type": "inline"}, {"author": "nastra", "body": "I've tried to describe the reasoning in https://github.com/apache/iceberg/pull/7880#discussion_r1250333139. Basically you'll get a compilation error saying `org.apache.iceberg.BaseMetastoreCatalog inherits unrelated defaults for initialize(String, Map<String, String>) from types org.apache.iceberg.catalog.Catalog and org.apache.iceberg.catalog.ViewCatalog` and `org.apache.iceberg.BaseMetastoreCatalog inherits abstract and default for name() from types org.apache.iceberg.catalog.Catalog and org.a", "path": "api/src/main/java/org/apache/iceberg/catalog/CatalogWithViews.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Minor: the project's style is to use short names instead of names that duplicate the type. The variable or argument name should be simple and is intended to tell you what you're working with, not to do the same thing as the type name.", "path": "core/src/main/java/org/apache/iceberg/view/BaseView.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think it's a bit misleading to have this here.\r\n\r\nTables have a similar version number, but tables also use \"snapshot\" rather than \"version\". Since \"version\" has a specific meaning in view metadata, we shouldn't have a separate version number here.\r\n\r\nThe table equivalent was unused for years, so I don't think this is needed. Looks like `InMemoryTableOperations` started using it, but unless I'm missing something it should be using `writeNewMetadataIfRequired` instead.", "path": "core/src/main/java/org/apache/iceberg/view/BaseViewOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Yeah, `InMemoryCatalogTests` pass with this patch:\r\n\r\n```patch\r\ndiff --git a/core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java b/core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java\r\nindex 3956e9192a..5aac3478b7 100644\r\n--- a/core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java\r\n+++ b/core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java\r\n@@ -301,7 +301,7 @@ public class InMemoryCatalog extends BaseMetastoreCatalog implements Supports", "path": "core/src/main/java/org/apache/iceberg/view/BaseViewOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "The table equivalent is only used in children and in tests. I think we can make this `protected`.", "path": "core/src/main/java/org/apache/iceberg/view/BaseViewOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think these `doRefresh` and `doCommit` methods should be abstract. The reason why they have default implementations in `BaseMetastoreTableOperations` is that they were added later when `refresh` and `commit` implementations were introduced.", "path": "core/src/main/java/org/apache/iceberg/view/BaseViewOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this can be private. Only `InMemoryTableOperations` calls it on the table side and that's probably incorrect.", "path": "core/src/main/java/org/apache/iceberg/view/BaseViewOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Where do we ensure that `metadata.location()` does not have a trailing slash? I think we need to call `LocationUtil.stripTrailingSlash`.", "path": "core/src/main/java/org/apache/iceberg/view/BaseViewOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Do we allow compressing the view metadata? I think that we should. I'd also rather default to gzip than uncompressed.", "path": "core/src/main/java/org/apache/iceberg/view/BaseViewOperations.java", "line": 151, "type": "inline"}, {"author": "rdblue", "body": "This changes so seldom that I think we're okay for now.", "path": "core/src/main/java/org/apache/iceberg/view/BaseViewOperations.java", "line": 205, "type": "inline"}, {"author": "rdblue", "body": "Minor: prefer `io` to `fileIO`. Similar to my other comment, it isn't necessary to duplicate the information from the type.", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this is unnecessary. It gets `currentVersion()` only to pass it back to a method defined by the parent. Instead, `BaseViewOperations` should handle the version internally with a method like `writeNewMetadataIfNeeded`.", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is incorrect because there is no guarantee that `base` and `currentMetadataLocation` match. `ViewMetadata` needs to carry the location it was loaded from so it can be used here.", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": 452, "type": "inline"}, {"author": "rdblue", "body": "Minor: looks like the behavior of `InMemoryCatalog` differs from other catalogs here. The Hive catalog includes the catalog name in `tableName()`, but both in memory tables and views use just `identifier.toString()` that doesn't include the catalog name.\r\n\r\nIt would be nice to follow up with a fix that creates a string per table/view operations that matches the behavior of other catalogs.", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "There is no equivalent check in `TableOperations` above. I think that's needed.\r\n\r\nAlso, using two maps (`tables` and `views`) introduces a race condition between this check and the `views.compute` below. In order for this class to be thread safe (and I think it needs to be) this should have a `synchronized` block when checking and modifying catalog state.", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I like deconflicting in `BaseMetastoreCatalog`.\r\n\r\nFor the test issue, we can use `instanceof` checks rather than having one variable that has all of the traits. For example:\r\n\r\n```java\r\n public boolean supportsViews() {\r\n return catalog() instanceof ViewCatalog;\r\n }\r\n\r\n public ViewCatalog viewCatalog() {\r\n if (catalog() instanceof ViewCatalog) {\r\n return (ViewCatalog) catalog();\r\n }\r\n }\r\n```", "path": "api/src/main/java/org/apache/iceberg/catalog/CatalogWithViews.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Does this need to be public or could we use `protected` to allow people to extend it?", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This needs to use `setCurrentVersion(viewVersion, schema)` instead. That's how versions should be added through the `ViewMetadata` API. The `addVersion` and `setCurrentVersionId` methods are for the parser because they depend on the version ID not being reassigned.", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I thought that the schema was required now? Shouldn't this be an exception?", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "There should be no need to create a `ViewOperations` multiple times, since the identifier doesn't change.", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think this should use `apply` because it is calling `setProperties` and `removeProperties` directly. If you call `setProperties` with `newProperties` then `ViewMetadata` is responsible for merging the properties.\r\n\r\nI'd probably update `apply` to run some `internalApply(ViewMetadata)` that produces `ViewMetadata`. Then `apply` can call that and return `properties` and this can also call it to build metadata for the commit.", "path": "core/src/main/java/org/apache/iceberg/view/PropertiesUpdate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this is incorrect. This seems to imply that if there are no representations added that the replace is a noop, but that's clearly not the caller's intent. I think that this API should fail if there were no representations added.", "path": "core/src/main/java/org/apache/iceberg/view/ViewVersionReplace.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Need to make sure schema is handled correctly here as well.", "path": "core/src/main/java/org/apache/iceberg/view/ViewVersionReplace.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "See the PR I opened for an example of how to remove `CatalogWithViews`.", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": 54, "type": "inline"}, {"author": "rdblue", "body": "This test case is doing too much. This should test just the behavior of the view catalog, not the behavior when a catalog supports both views and tables. Using a separate test case for additional tests when a catalog supports tables is a good idea because it also allows us to turn off the tests when working with only a view catalog.\r\n\r\nIn addition, I think that a result of testing this way (focusing on `viewExists`) results in missing test cases that should be here. For example, `tableExists` sh", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is it possible to create an expected version and have just one assertion?", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "updated, thanks", "path": "core/src/main/java/org/apache/iceberg/view/BaseViewOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "should be fixed now", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I added `currentMetadataLocation` tracking to `ViewMetadata`, so this should be fixed now", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": 452, "type": "inline"}, {"author": "nastra", "body": "oops, that's indeed an oversight, which I fixed as part of this PR", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "the equivalent code in `BaseMetastoreTableOperations` seemed to be using `LocationUtil.stripTrailingSlash` only for the case when `TableProperties.WRITE_METADATA_LOCATION` was set, which we currently don't have here.\r\nRelated to this, should we maybe have a similar property in `ViewProperties`?", "path": "core/src/main/java/org/apache/iceberg/view/BaseViewOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "yeah I wasn't sure whether we wanted to add that in the first version. Would you like me to add that in this PR or in a follow-up? We would probably also add respective properties to `ViewProperties` to control this", "path": "core/src/main/java/org/apache/iceberg/view/BaseViewOperations.java", "line": 151, "type": "inline"}, {"author": "nastra", "body": "`protected` should be ok here", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "makes sense, updated", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "you're right. This is code that was aligned with the old builder. I've updated this to how you suggested it", "path": "core/src/main/java/org/apache/iceberg/view/PropertiesUpdate.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I've updated this so that empty commits are not allowed", "path": "core/src/main/java/org/apache/iceberg/view/ViewVersionReplace.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I've split this out into separate tests and also added a test that verifies you can't create a table if a view with the same name already exists", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "good idea, i've done that here and in a few other places", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "While doing this refactoring for `renameViewUsingDifferentNamespace()` I've noticed that the `defaultNamespace` of the current view version was still set to the old namespace, which is wrong. \r\nI think we should disallow renaming a view if the namespaces are different of the `from` / `to` identifiers. Thoughts on this?", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this is a leftover from how the old builder handled this. I've removed the null check and we're always setting the schema now (if it ends up being null, we'll also propagate this to the caller)", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "> I think we should disallow renaming a view if the namespaces are different of the from / to identifiers. Thoughts on this?\r\n\r\nShould be same as rename table behavior I guess. (Which will allow renaming tables to different namespaces)", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "should be fixed now", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "> Should be same as rename table behavior I guess.\r\n\r\nIn that case the rename would be a full view metadata replace, because we need to write a new view version and update the default namespace being used in the view's current version", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}], "10179": [{"author": "nastra", "body": "please update these to Junit5 tests that use AssertJ assertions. You can use `TestRewriteDataFilesAction` as a reference", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestFlinkManifest.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "please update this (and other tests in this PR) to JUnit5 + AssertJ, because we're in the process of migrating JUnit4 to JUnit5 and new tests should be written using JUnit5 + AssertJ", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestSinkV2Committer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n@ExtendWith(ParameterizedTestExtension.class)\r\n```", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestSinkV2Committer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n @TempDir private Path temp;\r\n```", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestSinkV2Committer.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Is this different in anyway from the current `DeltaManifests`?\r\nIf not, we should reuse the old one, or move it here. Having 2 classes with the same implementation should not happen", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/DeltaManifests.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Could we remove the V2 from the name?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkV2Aggregator.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I do not get this - we should not read any of the table metadata after the writer has been initialised for now.\r\nLet's talk about this.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/writer/RowDataTaskWriterFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I would like to avoid this. Seems like a bad idea", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/CachingTableSupplier.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: newline", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": 523, "type": "inline"}, {"author": "pvary", "body": "This implementation does not refresh anything... so the class comment doesn't make sense.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/SimpleTableSupplier.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Could this be an inline static class in IcebergSink?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/SimpleTableSupplier.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "How much of these are really needed to be generic util methods?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergFlinkManifestUtil.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "How much is the difference between this, and the `ManifestOutputFileFactory`?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergManifestOutputFileFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Do we need a state for this?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkAggregator.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit:newline", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "What's the difference between this, and `org.apache.iceberg.flink.sink.BaseDeltaTaskWriter`?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/writer/BaseDeltaTaskWriter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Apart from this, and a few other classes, I think most other classes could be shared between the to Sinks.\r\nDo I forgot something? \ud83d\ude04 ", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/writer/IcebergStreamWriter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "My mistake", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/CachingTableSupplier.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "My mistake", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/writer/RowDataTaskWriterFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Should we do that in a different PR? Either before, or after this?\r\nOne of the important thing with this PR is, that the original behaviour is not changing. It is hard to reason about it, if we change the test files in the same PR.", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestSinkV2Committer.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "In all of these cases where we need to expose the class, we might want to add the `@Internal` annotation to highlight that these are internal classes", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/BaseDeltaTaskWriter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "We do not need this to be public\r\n\r\nPlease go through all of the files to check if we need them to make public", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "This is not true - we use a different committer.\r\nPlease check the comments", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Please check the comments", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Do we handle this correctly?\r\nIs the comment correct?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: newline", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: newline", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": 509, "type": "inline"}, {"author": "pvary", "body": "What is this `_1` in the name?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "What is this `_1` in the name?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Why is this not private?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Why is this not private?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Do we need a class for this?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergFlinkManifestUtil.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "The `create()` means, that we do not really have the `checkpointId` in the filename anymore.\r\nCould we find a way to get this back?\r\n\r\nAlso do we have the parameters at hand, to use the original `ManifestOutputFileFactory`?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergManifestOutputFileFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "This could be interesting with unaligned checkpoints", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkAggregator.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Please read and fix the comments", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Is this different from the `org.apache.iceberg.flink.sink.IcebergStreamWriterMetrics`?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/writer/IcebergStreamWriterMetrics.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: Please move these to the bottom of the class", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Why do we have tests with `@Ignore`?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Why do we have tests with `@Ignore`?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Is this needed?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestIcebergFlinkManifest.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: remove", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestIcebergFlinkManifest.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Is this needed?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestSinkCommitter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Is this used?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestSinkCommitter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Let me rephrase: It's only used in a single place, do we really need a constant for it? \ud83d\ude04 ", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestSinkCommitter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Missed this one", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestSinkCommitter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "But only once", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestIcebergFlinkManifest.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "But only once", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestIcebergFlinkManifest.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Discussed it with @gyfora, and the solution is correct.\r\nIn the `prepareSnapshotPreBarrier` we should emit the aggregated records and then we do not need to have a state", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkAggregator.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Since FLIP-372, we can separate the input, and the output type.\r\nThe `IcebergSinkWriter` could emit `WriteResult` objects, and we can emit `SinkCommittable` which is basically just manifest file, and a checkpoint id.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkAggregator.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: formatting", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Maybe `<p>`?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: formatting", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Not too elegant - could we find some workaround for setting the committer for tests, or inheriting from IcebergSink for testing, or something else?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: newline", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkAggregator.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: why is the default value? I do not think we need this", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommittable.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I think we can do this in the constructor", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "How do we handle Metrics?\r\nFLIP-371 was for enable metrics for the Committers. We need to use it.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Add a version too, so if we want to move forward from here, then we can keep the state.\r\n```\r\nint version = in.readInt();\r\n```\r\n```\r\nout.writeInt(this..getVersion());\r\n```", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/WriteResultSerializer.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "emits a `WriteResult`. Please revisit the comment", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/writer/IcebergSinkWriter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: newline", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestIcebergFlinkManifest.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: newline", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/committer/TestIcebergFlinkManifest.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: comment is not fixed", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: Fix the comment to match the implementation", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkAggregator.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: this is not needed", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkAggregator.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Do we need this `initialized`?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkAggregator.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Can we add the `operatorId` to the method as well?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommittable.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Add the version to the serialized byte array too, and read it back", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommittableSerializer.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: could we please order the constants based on the visibility", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Do we need to increment `continuousEmptyCheckpoints` here, and handle `maxContinuousEmptyCommits`?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Or just remove this, and leave it to the flow to handle this directly?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "nit: Could we put this comment to the message body, so we don't duplicate the comment?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/writer/IcebergSinkWriter.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "What was the original intention of this test? Why it was V2? Do we still has test for that case?\r\nSame question for `TestFlinkIcebergSinkV2Base`", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/TestFlinkIcebergSinkV2.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Use `RewriteDataFilesAction` in Spark, or Flink could help you with this. Also this PR is part of the ongoing work to provide continuous table maintenance in Flink ", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": 219, "type": "inline"}, {"author": "stevenzwu", "body": "> Commit only on committer 1\r\n\r\nThis has always been a sticking issue to me. It will be confusing for users when looking at the DAG and wondering why other committer subtasks get no traffic or why there are so many committer subtasks.\r\n\r\nI asked the Flink community on this before. There was a suggestion of supporting control of committer parallelism. That will be sufficient for this purpose. But I think it is not formally pursued and completed. \r\nhttps://lists.apache.org/thread/82bgvlton9olb591", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": 117, "type": "inline"}, {"author": "stevenzwu", "body": "I think we should pass in `manifest` as `DeltaManifests` (not as `byte[]`). I know current code does byte[]. We can move the serialization and deserialization to the `SinkCommittableSerializer`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommittable.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `IcebergSink.` not needed. applicable to other places", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `snapshotSummary` is probably more accurate. properties can cause a little confusion as table properties", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": 131, "type": "inline"}, {"author": "stevenzwu", "body": "is this a valid requirement to have multiple sinks writing to the same Iceberg table in a single Flink job? can't the streams be unionized first before writing to Iceberg?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": 177, "type": "inline"}, {"author": "pvary", "body": "I wouldn't want to wait for another release from Flink for this. The SinkV2 will be functional without this so I don't theink we need to block the whole post commit compaction on this.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": 117, "type": "inline"}, {"author": "pvary", "body": "This restriction would be a regression compared to the old sink implementation. I have seen users having multiple sinks to the same table.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": 177, "type": "inline"}, {"author": "stevenzwu", "body": "nit: remove unnecessary whitespace change", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/BaseDeltaTaskWriter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "why do we copy the code here? \r\n\r\nor do we just want to move the code here? I didn't see the old file removed. so doesn't look like a move to me.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/writer/IcebergStreamWriter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "why would `sinkCommitter` not be null? is it related to the `setCommitter` for testing purpose? if yes, should we use Mockito to replace committer during unit test?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `@Internal`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/writer/IcebergSinkWriter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "pass the `InitContext` directly to the `IcebergSinkWriter` constructor directly. move the `IcebergStreamWriterMetrics` inside the constructor too. feels a bit cleaner this way", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": 195, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `this.` is not needed", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "oh. I see why you named it `snapshotProperties` earlier. it is to be consistent with the current `FlinkSink`. I will take back my earlier comment. I guess it is is probably good to keep the name consistent", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: add some comment that this force commit to happen only on committer subtask 0.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": 250, "type": "inline"}, {"author": "stevenzwu", "body": "nit: maybe `IcebergCommitter` is a more specific/informative name.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "`IcebergCommittable` is probably more accurate/specific", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommittable.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "maybe `IcebergWriteAggregator` or sth else more specific than `Sink`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkAggregator.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "probably can keep those static util methods inside `FlinkManifestUtil`?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "probably can keep those static util methods inside `FlinkManifestUtil`?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this doesn't need to be copied here. we can probably call the static util method directly `FlinkManifestUtil#writeDataFiles`. same thing for other methods copied from `FlinkManifestUtil `", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkAggregator.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "`WriteResult` doesn't implement `toString` method. it is probably not meaningful to log the value. if we want, we can probably log a few summary attributes like\r\n\r\n```\r\nIceberg writer subtask {} attempt {} flushed {} data files and {} delete files \r\n```", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/writer/IcebergSinkWriter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: why not just `VERSION`? it could mean the current version. or just get rid of it completely based on the way it is used currently. ", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommittableSerializer.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "why not return this byte[] directly? why do we need to wrap?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/WriteResultSerializer.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "what's the reason of adding this?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this class seems to share a lot of code with current `IcebergFilesCommitter`. Can we make them static util method to share the code? I would imagine both old and new sink will be maintained for quite a long time (like 1 year or longer).", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this seems not used here", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "how do we notify commit success/failure to the `CommitRequest`?\r\n\r\n```\r\n\r\n /**\r\n * The commit failed for known reason and should not be retried.\r\n *\r\n * <p>Currently calling this method only logs the error, discards the comittable and\r\n * continues. In the future the behaviour might be configurable.\r\n */\r\n void signalFailedWithKnownReason(Throwable t);\r\n\r\n /**\r\n * The commit failed for unknown reason and should not be retried.\r", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this involves a check/read of the table snapshot history for every commit, which add to latency. the behavior/overhead is different. Current code only retrieve the max committed checkpoint id during job initialization phase.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "```\r\nglobal forces all output records send to subtask 0 of the downstream committer operator. This is to ensure commit only happen in one committer subtask. Once upstream Flink provides the capability of setting committer operator parallelism to 1, this can be removed. \r\n\r\nlink to Flink discussion thread.\r\n```", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": 250, "type": "inline"}, {"author": "stevenzwu", "body": "Iceberg style is to only use `this.` when changing the value, like from constructor or setter.\r\n```\r\nthis.varA = newValue\r\n```", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "yeah. `CheckpointCommittableManagerImpl` marks those requests as committed if no retryLater or failure when the commit doesn't throw an exception.\r\n```\r\n @Override\r\n public Collection<CommittableWithLineage<CommT>> commit(\r\n boolean fullyReceived, Committer<CommT> committer)\r\n throws IOException, InterruptedException {\r\n Collection<CommitRequestImpl<CommT>> requests = getPendingRequests(fullyReceived);\r\n requests.forEach(CommitRequestImpl::setSelected);\r", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I understand this is needed. it is unfortunate that the new sink V2 API enforce unnecessary check/overhead to Iceberg sink. we don't have to resolve it in this PR. but it is sth worth follow up to see if we can avoid it.\r\n\r\nNew sink V2 tried to hide a lot things from connectors, which also result in this side effect that connectors lose control. ", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/SinkCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "is `prefix` used here?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/ManifestOutputFileFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "do we need to remember `builder` like here?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "is this style common? builder modify the target object state directly? should those be done in the sink constructor (instead of Builder.build method)? This way, the class members can be final.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "curious why some class members are final and set in the constructor here while others are final and set in the builder?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "it is probably better to keep it next to the `.global()` than method Javadoc", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this can be problematic. see the `generatePath` method.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/ManifestOutputFileFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: typically use the constant `1` here, as VERSION is just the latest/default version. see example from Flink's `FileSourceSplitSerializer`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergCommittableSerializer.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: mark as `@Internal`? same for other classes in this sub package?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "why is this resolved without response?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/WriteResultSerializer.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "for consistency, `VERSION` marks the current/latest version as returned by `getVersion`. deserialize use constants directly for all supported versions.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/WriteResultSerializer.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this static util method already exists in `FlinkManifestUtil`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergWriteAggregator.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I think this check `element.getValue() instanceof CommittableWithLineage` probably should be moved inside the if section with a `Preconditions` check. otherwise, if upstream changed, we can fail silently here.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergWriteAggregator.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this.prefix doesn't seem to be used later on. it seems `prefix` is only used as worker pool size. technically, it is not used as `prefix`.\r\n\r\nold code use operator ID in the worker pool name. I guess it is not obtainable in the v2 sink?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "why do we add an empty commit request in this case? can the method simply return here?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I am wondering if it is correct to use the `jobId` and `operatorId` from the committable or it should be the current running job? Imagine the scenario where the commitable was restored from a checkpoint/savepoint from a previous job.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this was a static method from the old code. can we reuse it here?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: this can be made a static util method for reuse if table io is passed in", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "let's also consider another scenario where the commit requests may accumulated over multiple jobs (after checkpoint and restart). what should be the expected behavior?\r\n\r\nalso noticed that in v1 case multiple commitable can be collapsed to a single commit to Iceberg, while in v2 table they are committed separately and sequentially. ", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "thinking about it again, current behavior seems correct with the way `getMaxCommittedCheckpointId` is implemented", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "should we use Flink's `CheckpointStoreUtil#INVALID_CHECKPOINT_ID` instead?\r\n\r\n`INITIAL` is also not accurate here. technically, initial checkpoint id is `1` as defined in Flink code\r\n<img width=\"476\" alt=\"image\" src=\"https://github.com/apache/iceberg/assets/1545663/bd5037ec-1138-415f-bd65-a4e4be0159d0\">\r\n", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/committer/IcebergCommitter.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "let's mark it as `@Experimental` to star with", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java", "line": 123, "type": "inline"}], "4071": [{"author": "jackye1995", "body": "I think last time we ended our discussion here. For branch, we can either create and fail when exist, or do upsert. It seems like there are valid use cases for both, so what about `branch(String name, long snapshotId, boolean replace)`? The same `replace` flag can also be potentially added to `tag` if fit. @rdblue ", "path": "api/src/main/java/org/apache/iceberg/UpdateSnapshotRefs.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this can be more direct: \"Create a new branch pointing to the given snapshot id\"", "path": "api/src/main/java/org/apache/iceberg/ManageSnapshots.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "\"Remove a branch by name\"?", "path": "api/src/main/java/org/apache/iceberg/ManageSnapshots.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this should state the behavior of branch properties, like snapshots to keep.\r\n\r\nAlso, are you sure that this should create or update the branch? I think that makes sense for writing to a branch (since there is a job that has done work) but I'm not sure about doing it for metadata-only operations.", "path": "api/src/main/java/org/apache/iceberg/ManageSnapshots.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "\"Rename a branch\"?", "path": "api/src/main/java/org/apache/iceberg/ManageSnapshots.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "You can remove \"snapshot reference\". It reads fine as \"Create a new tag referring to the given snapshot id\". You can also use that pattern for branch javadoc above.", "path": "api/src/main/java/org/apache/iceberg/ManageSnapshots.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "`manageSnapshots` isn't implemented in transaction tables. I think you need to add package-private operations like `CherrypickOperation` from the last PR.", "path": "core/src/main/java/org/apache/iceberg/SnapshotManager.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Oh yes definitely. I was mostly going for the API definition here and filling in a dummy implementation. I will add these operations in this PR as well", "path": "core/src/main/java/org/apache/iceberg/SnapshotManager.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Yeah, strangely enough I cut out \"snapshot ref\" for the branch javadoc, but still had it for tag.", "path": "api/src/main/java/org/apache/iceberg/ManageSnapshots.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Yeah (so sorry for the late reply on this), I also missed out on defining APIs for updating reference retention properties. I definitely agree with creating/updating the branch after writing. If someone wants to do an operation to reset an existing branch to a new snapshot they would have to do a drop + create. which is slightly more cumbersome. Perhaps this is not really a meaningful metadata operation and also the explicit drop + create is better for safety (a user who wants to do this know e", "path": "api/src/main/java/org/apache/iceberg/ManageSnapshots.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "A rename/update to a ref will be an ordered Remove + Create. Seems a bit inefficient in the sense of how many updates are needed to encapsulate a single update, but seems like the intention of this metadata update is to give an ordered, immutable series of operations to build metadata state.", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "This seemed like the best way to encapsulate the \"delta\" of reference changes that could occur.", "path": "core/src/main/java/org/apache/iceberg/SnapshotReferenceChanges.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "In the builder layer would we want a separate createBranch/createTag? We do that for remove, but for creating it seems redundant. What would createBranch do differently than createTag beyond some validation that the ref is a branch or a tag? ", "path": "core/src/main/java/org/apache/iceberg/TableMetadata.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Iceberg doesn't use one line per argument. Can you reformat these cases?", "path": "core/src/main/java/org/apache/iceberg/ManageSnapshotReferenceOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I see the rationale for this, but I don't think there is a reason to have separate `CreateSnapshotRef` and `SetSnapshotRef` classes. I think we should just add the ref metadata to the `SetSnapshotRef` class. `SetSnapshotRef` was intended to be an idempotent update that creates or updates a ref.", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think APIs that force you to pass in `null` are generally hard to use and force you into code that is difficult to read. For example:\r\n\r\n```java\r\ntable.manageSnapshots()\r\n .createBranch(\"test\", snapshotId, null, 100, null)\r\n .commit();\r\n```\r\n\r\nI think it is better to use separate methods for each ref config setting:\r\n\r\n```java\r\ntable.manageSnapshots()\r\n .createBranch(\"test\", snapshotId)\r\n .setMinSnapshotsToKeep(\"test\")\r\n .commit();\r\n```\r\n\r\nI'd replace all of the variations of cr", "path": "api/src/main/java/org/apache/iceberg/ManageSnapshots.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I would use specific methods for minSnapshopsToKeep, maxAge, and maxSnapshotAge", "path": "api/src/main/java/org/apache/iceberg/ManageSnapshots.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What about `replaceTag(name, snapshotId)` to update a tag to a new snapshot? We should probably also have replace for branches.\r\n\r\nShould we also have methods that reference other branches or tags? For example, `createTag(\"test-2022-03-11\", \"test\")` where \"test\" is a branch name?", "path": "api/src/main/java/org/apache/iceberg/ManageSnapshots.java", "line": 130, "type": "inline"}, {"author": "rdblue", "body": "I'd probably use `Map<String, SnapshotRef>` and return the resulting ref map rather than building an object to hold the changes. That way, the result of `apply` isn't an intermediate value, but the actual ref map that will be produced.", "path": "core/src/main/java/org/apache/iceberg/ManageSnapshotReferenceOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is there a better name for this class? Maybe `UpdateReferencesOperation`?", "path": "core/src/main/java/org/apache/iceberg/ManageSnapshotReferenceOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can you update this so that the last `manageSnapshotReferenceOperation` is reused? That way we don't create a ton of unnecessary operations in the transaction.", "path": "core/src/main/java/org/apache/iceberg/SnapshotManager.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Rebase? I think these are already in master.", "path": "core/src/main/java/org/apache/iceberg/TableMetadata.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is this needed?", "path": "core/src/main/java/org/apache/iceberg/TableMetadata.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think in the builder, we want `removeTag`/`removeBranch` that produce `RemoveSnapshotRef` and `setTag`/`setBranch` methods that produce `SetSnapshotRef`.", "path": "core/src/main/java/org/apache/iceberg/TableMetadata.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It would be fine to have this as an internal method, since it looks like this is used internally.", "path": "core/src/main/java/org/apache/iceberg/TableMetadata.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Ah I missed that we already had a SetSnapshotRef. Considering that it's idempotent, then yes, we should just add the ref information to it.", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Makes sense, will update.", "path": "core/src/main/java/org/apache/iceberg/SnapshotManager.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "If I am understanding properly, this effectively follows a similar pattern to what we have in UpdateProperties? In that case, that makes sense to me.", "path": "core/src/main/java/org/apache/iceberg/ManageSnapshotReferenceOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Yeah, exactly.", "path": "core/src/main/java/org/apache/iceberg/ManageSnapshotReferenceOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Should be \"UpdateSnapshotReferencesOperation\".", "path": "core/src/main/java/org/apache/iceberg/BaseTransaction.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Else should be on the same line as the end of the if block: `} else {`", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Please add an empty between control flow blocks and the following statement.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should check that no ref was returned by `put`. Otherwise this just overwrites an existing branch.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Same here. This should check that there is no existing branch.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this needs to remove the original branch, or else it is a copy.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why is age `Long` instead of `long`?", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Will this throw an exception if `ref` is null?\r\n\r\nEven if it does, I think checking here is a better idea. You can give a better error message than \"Invalid ref\" because the problem was that the caller is trying to update a reference that doesn't exist.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Does this throw a reasonable error message when the ref is a tag?", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 160, "type": "inline"}, {"author": "rdblue", "body": "Should there also be a `setMaxSnapshotAgeMs` as well?", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 155, "type": "inline"}, {"author": "rdblue", "body": "This is good, but it also needs to commit this if there are `cherrypick` or `rollback` calls. Table modifications need to be kept in order, but we want to make as many ref changes as we can simultaneously.", "path": "core/src/main/java/org/apache/iceberg/SnapshotManager.java", "line": 134, "type": "inline"}, {"author": "rdblue", "body": "I think these methods don't quite get the behavior correct. In the table update APIs, changes should be applied in sequence so that each one updates the table as though it were part of its own commit. For example, in the `UpdateSchema` API, two calls to `moveFirst` will move one column, then the second. The two operations don't conflict.\r\n\r\nHere, calling `createBranch` followed by `setMinSnapshotsToKeep` on the same branch will result in an error because this adds the new ref to `refsToUpdate` a", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 54, "type": "inline"}, {"author": "rdblue", "body": "I don't think this operation should retry, like SchemaUpdate. Updating a schema applies changes to the current schema, just like this applies changes to the current set of refs. In order to retry, this would need to keep track of the changes and re-apply them each time it attempts to commit. Since this isn't tracking changes (and would be harder to do that way) I suggest just not retrying.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "@rdblue I was still working on changes and iteratively pushing to my branch (sorry for the confusion), I think this was reviewed when it wasn't yet complete on my side. Let me know what you think on the latest !\r\n\r\nFor supporting transaction.updateSnapshotReferencesOperation().createBranch(...).setMinSnapshotsToKeep() to avoid the error, in my latest change when we look for refs in setMinSnapshotsToKeep we always search for the \"current\" ref. Where current is either the ref in the refsToUpdate ", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 54, "type": "inline"}, {"author": "rdblue", "body": "Okay, as long as this achieves the expected behavior it should be okay. It does seem like copying the refs into a new map and only working with that map is going to be the most straightfoward approach though. Falling back means you have to remember to do it.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 54, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "On line 197 we check if it is null. We fail with \"No such ref with name <name>\" . Let me know if you meant something else! ", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "It should not be Long, just a remnant from my previous code that I forgot to update.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I believe I added that, line 182? ", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 155, "type": "inline"}, {"author": "rdblue", "body": "Yeah, this is fine now. I don't think you mean to be validating that the ref is a branch though.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Good point. If we do transaction.updateSNapshotReferenceOperations().createBranch(\"branch1\", 1).createBranch(\"branch1\", 2) this should fail, because independent commits would fail due to the ref already existing. To have equivalence with the chaining, it should also fail. Will update this.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Yeah, I think when I enumerate the different cases explicitly in tests we will get better confidence on if this is the right way. I will also look into just using a copy of the refs. ", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 54, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Ah yes, that should be removed. There is another place where I accidentally copied that validation check, but it should not be there", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "@rdblue I was iterating some more on this PR today, thank you so much for taking a quick look! I think some of your comments were on my initial changes, which I resolved in later iterations. Let me know your thoughts on the latest version. @jackye1995 as well.\r\n\r\nI think from this point, I will be adding tests to validate the expected behavior which can surface any other bugs/cases in the implementation which I haven't caught yet. ", "path": null, "line": null, "type": "review_body"}, {"author": "amogh-jahagirdar", "body": "This isn't right, we should be producing metadata updates when we clean up the dangling references. I will update to use removeBranch/removeTag depending on the ref type.", "path": "core/src/main/java/org/apache/iceberg/TableMetadata.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Definitely, we're idempotent for creates/updates on the name of the reference. For replace, we can just do an update on reference structure. For a proper rename though, we need to remove the branch with the old name. ", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "https://github.com/amogh-jahagirdar/iceberg/blob/master/api/src/main/java/org/apache/iceberg/SnapshotRef.java#L156\r\n\r\nIt throws \"Tags do not support setting minSnapshotsToKeep\" which I think is reasonable to surface ", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 160, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Yup, updated! I was thinking if there's a good abstraction here, but I think for now it's simpler just to keep track if there's an ongoing ref operation. In the cherrypick/rollback operations, prior to actually performing those operations we would check if there's an ongoing ref operation and commit it.\r\n\r\nFor ref metadata chains, a caller would call the commit on the snapshot manager, which would then accomplish our goal for wrapping as many ref changes as we can in a single commit. Let me know", "path": "core/src/main/java/org/apache/iceberg/SnapshotManager.java", "line": 134, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "When writing tests, and debugging realized that I need to emit a MetadataUpdate here. Will do that on next revision", "path": "core/src/main/java/org/apache/iceberg/TableMetadata.java", "line": 1048, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Should be updated to call setBranch depending on if it's a branch or setTag if it's a tag", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I probably can break this up to test branch/tag creation separately. Same applies with removal. ", "path": "core/src/test/java/org/apache/iceberg/TestSnapshotManager.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I have updated the PR/ I have implemented the snapshot ref operation API to just operate on a shallow copy of the refs, and just do the diffing at apply time, rather than keeping structures which keep track of the diffs.I have also added test cases. Let me know what you think @jackye1995 @rdblue !", "path": null, "line": null, "type": "review_body"}, {"author": "amogh-jahagirdar", "body": "Just diff should be sufficient", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Fix indent", "path": "core/src/test/java/org/apache/iceberg/TestSnapshotManager.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Updated to just use a copy of the refs ", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 54, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Yes, makes sense. Updated to remove retrying", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "The [`SetSnapshotRef` update](https://github.com/apache/iceberg/blob/master/open-api/rest-catalog-open-api.yaml#L1233-L1242) in the REST spec has all of the fields here, rather than holding a `SnapshotRef` object.\r\n\r\nTo make this serialize like the spec, this should have `type`, `maxRefAgeMs`, `maxSnapshotAgeMs`, and `minSnapshotsToKeep` fields.", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This isn't correct. This needs to create a tag or branch based on the existing one in `metadataBuilder` and set the fields that are non-null in this object.", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: always add a newline between control flow blocks and the following statement.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this should return `updatedRefs` instead of the changed refs. I think there is more value in returning to the caller the final state, rather than changes.\r\n\r\nThat way you also don't need to explain in detail what the map contains (e.g., null values for removals).", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Since I think `apply` should return the full updated map of references, I think this could be merged into `internalApply`. That would probably be more straightforward than returning a map and then interpreting it.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This check duplicates the validation above that checks `updatedRefs.get(name) == null`. It's okay to ignore the return value of `put` or to remove the `get` check in favor of this one.\r\n\r\nI would remove the other check and just use this one.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Same here. I'd remove this in favor of checking the return value from `put`.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This will also return the ref that was removed. It would be better to call `Snapshot ref = updatedRefs.remove(name)` and then run the validations on the returned ref.\r\n\r\nIf validation fails, the operation is no longer valid, so don't worry about changing state and then failing.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Same as above.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Looks like this needs to validate the return value from `put` or else this will replace branches.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Do you think we should also have a `replaceBranch(String target, String source)` that is like `rename`, but will replace the target branch?", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 100, "type": "inline"}, {"author": "rdblue", "body": "Should this be `updateTag`? Or is it really replacing?", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 145, "type": "inline"}, {"author": "rdblue", "body": "I think this should test whether the metadata has changed:\r\n\r\n```java\r\n if (updated.changes().isEmpty()) {\r\n return;\r\n }\r\n```", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 50, "type": "inline"}, {"author": "rdblue", "body": "Can you add a `TODO` comment to move the operations in `SetSnapshotOperation` to this class? I don't think that needs to be done as part of this commit, but we should do it eventually.\r\n\r\nThat operation handles `setCurrentSnapshot`, `rollbackToTime`, and `rollbackTo`, which are all operations that this class could handle by updating the set of refs. Also, `rollbackTo` could have a branch version, `rollback(String name, long snapshotId)` that would validate the ref is a branch and validate that t", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 31, "type": "inline"}, {"author": "rdblue", "body": "Indentation in this file is off. Can you please fix it?", "path": "core/src/test/java/org/apache/iceberg/TestSnapshotManager.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Normally, I think this sort of problem would be `IllegalArgumentException`. That's what we typically use for arguments to a table change that are incompatible, not `ValidationException`. ", "path": "core/src/test/java/org/apache/iceberg/TestSnapshotManager.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Ah oops, yeah we just need 1. I'll just do the put and validate. ", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Good point, this will simplify it ", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I think updateTag is better. I was going off our previous threads where discussed replaceTag, but really what this is doing is just taking an existing tag and updating the snapshot Id it's pointing at.\r\n\r\nWhen I think about it some more, assignTag makes even more sense to me. Let me know your thoughts @jackye1995 @rdblue ", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 145, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Thanks for pointing to the spec, will keep that in mind and update.", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Yup! I'm in agreement there, all the snapshot operations can be applied within the context of refs as well. I'll add a class level comment ", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 31, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Currently since UpdateSnapshotReferenceOperation is contained within a transaction, we need to commit the transaction table operations even if there are no changes (to advance the hasLastOpCommited flag). This is because we have started a transaction and the table operations last operation has not committed; when the overall transaction goes to commit, it validates if the last operation has committed and fails if it's not the case. Since this is transaction table operations and we're not updati", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 50, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Actually, take back what I said. assign does not really imply the original tag reference changed in terms of where it's pointing. People may interpret it as a create. So I will use updateTag, and replaceBranch", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 145, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Thanks, missed that when updating the branch we weren't updating any of the properties. After going through https://docs.google.com/document/d/1D0R3G0slssEhggH5XnIzMwsUIP-c385Qp2sjv5E7e6E/edit#heading=h.eo4x0coo8esy and discussing with @rdblue got some more clarity on what the goal of applyTo is. \r\n\r\nI have now made the SetSnapshotRef update encapsulate only the changes. Ref name, snapshot id and type will always be maintained on the update, but retention properties can be null. At the time of a", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I'm hesitant on this since it feels a little overengineered. But having a builder for the Metadata update seemed like the best way to iteratively update the produced change in the Builder.SetBranch itself as we determine what's different between branches, and preserve immutability. We could also just add setters on the update but that seemed to go against the whole principal? @rdblue thoughts ", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Need to add a test for rename", "path": "core/src/test/java/org/apache/iceberg/TestSnapshotManager.java", "line": 667, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Sorry, want to verify I am getting the semantic of this. Do you mean that target will be updated to point to the head of source? \r\n\r\nOverall, I feel in the API we currently have good coverage on the metadata operations users would want to do with branching/tagging and we can add new APIs in subsequent changes (I'm more hesitant to commit to defining more APIs to avoid any confusion in semantic). But I'm not super opinionated on this! \r\n\r\nThoughts? @rdblue ", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 100, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Ah oops. I need to actually build the ref and set it.", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "There's still some more tests I should add (for example failure cases like doing branch operations on tags and vice versa, validating the null checks etc). Don't want to block on review, if there's anything fundamental ", "path": null, "line": null, "type": "review_body"}, {"author": "amogh-jahagirdar", "body": "Doing a second pass, I think having setter is fine. The MetadataUpdate does not really need to be immutable.", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I need to add checks to prevent removing/renaming the main branch.", "path": "api/src/main/java/org/apache/iceberg/ManageSnapshots.java", "line": 111, "type": "inline"}, {"author": "rdblue", "body": "Ah, I see. It makes sense that this needs to ensure the commit happens because of that check. Let's leave it as it is then.", "path": "core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java", "line": 50, "type": "inline"}, {"author": "rdblue", "body": "I don't think we need \"NumberOf\" isn't that omitted elsewhere?", "path": "api/src/main/java/org/apache/iceberg/ManageSnapshots.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this should be idempotent. If there is no ref, then that's okay.", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'd prefer just checking `currentRef != null` here, rather than assigning a boolean from a comparison.", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: missing space between `if` statements.", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think these should be set if `minSnapshotsToKeep` is set. If the ref is a branch, then the builder should reject this. And if this is a new ref, then it should only be set if it is explicit (non-null).", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "These objects are immutable, so there should be no setters.", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think this is going to serialize properly because `SnapshotRefType.toString` will be upper case. Can you convert to a String for storing in this class?", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Please remember to format code! There are several places that are missing require whitespace.", "path": "core/src/main/java/org/apache/iceberg/TableMetadata.java", "line": 1028, "type": "inline"}, {"author": "rdblue", "body": "I don't think this needs 2 constructors.", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What about `setRef` instead of checking that `setTag` was called with a tag? Is there a significant difference between how they are handled?", "path": "core/src/main/java/org/apache/iceberg/TableMetadata.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this logic is incorrect. These properties should be set if they were explicitly set for the ref. This expects the ref to already have the required changes, so it is sufficient to test whether the incoming `ref` has a non-null `minSnapshotsToKeep`. And since the action for a non-null value is to set the value, you can just pass all of these through.", "path": "core/src/main/java/org/apache/iceberg/TableMetadata.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think we need this check. Changes that come through should be idempotent, or should use requirements to validate that they can be applied. If there's a need for a requirement we can look into adding one. But I think it should be fine if this just goes ahead with the update. We already check in requirements that the referenced snapshot has not changed.", "path": "core/src/main/java/org/apache/iceberg/MetadataUpdate.java", "line": null, "type": "inline"}], "15006": [{"author": "amogh-jahagirdar", "body": "cleaning this up, since it's private, I'm pretty sure I'd be able to just be able track table operations here rather than the table.....", "path": "core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Probably keep a boolean in-case we detect a duplicate. That way we don't have to pay the price of grouping by referenced file everytime to detect possible duplicates; only if we detect it at the time of adding it, we can do the dedupe/merge", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "looks like some unit tests mock table and don't implement `HasTableOperations`? ", "path": "core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "We also could just keep a mapping specific for duplicates. That shrinks down how much work we need to do because instead of trying to group by every referenced data file in case of duplicates, we just go through the duplicates set. It's maybe a little more memory but if we consider that we expect duplicates to generally be rare it feels like a generally better solution", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "```suggestion\n List<DeleteFile> newDVs = Collections.synchronizedList(Lists.newArrayList());\n```", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "This is during a single write ? are we worried of compaction cases or is it just safety net ?", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "I wonder now lets say DV1 and DV2 both set position in the vector is this gonna be an illegal state because merging will be simply union them ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "i believe if something happens on the delete we would still wanna call `dvFileWriter.close()` ? but since we need to call close before we access result i guess best we can do is to double check if we were able to close or not and then in finally try closing ?", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Yeah I was changing the algorithm to be more like what I mentioned above and realized that it would throw a concurrent modified exception, since we're now parallelizing across impacted referenced files, it's now a concurrent map. let me know what you think. ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Primarily a safety net, because this check is only for the case where duplicate DVs are added in a single commit. \r\n\r\nThe way I think about it is if someone is adding DVs in a single commit that are duplicate AND span multiple sequence numbers, then something is probably wrong (it already is just from the duplicate perspective, it's doubly wrong if the seq numbers are different for the duplicates) and from a merging perspective we'll really have no basis for inferring what the right sequence num", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "make this more readable ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Hm, I think in the context of merging within the same commit, that it's not illegal.\r\nIf DV1 has positions 1, 2, and 3 set\r\nAnd Dv2 for some reason has positions 3, 4 set\r\n\r\nIt may indicate that a writer in a single commit is doing duplicate work, but logically from within a commit perspective doesn't seem wrong to merge.\r\n\r\nSince this is all within the same commit, it feels reasonable to me to just produce 1, 2, 3, and 4. \r\n\r\nI think where this would make a difference is if we wanted to do merg", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Good catch , will fix this up. ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Yup we should! I hit this when I added a eq. delete test, should be fixed now", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I would group all the merged DVs and then write out the puffin, right now in case of duplicates we'd be eagerly producing a tiny puffin per data file that originally had duplicates", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I take back what I said, I'll leave this as is now unless there's some strong opinions here, it's a classic tradeoff between parallelism and file sizes.\r\n\r\nIf we work from the assumption that there won't be too many data files with duplicate DVs, then there won't be too many small puffins created. From a metadata perspective it doesn't really matter because it's the same number of entries in delete manifests after merging, but this is mainly just from a storage I/O cost perspective. But again, ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "The worst case from a storage I/O perspective is where there's an unclustered delete and many data files which are split, and DVs are produced for some random set of positions in every data file. But in that case there's almost certainly going to be a distributed engine at play, and in that case, I think we could do an optimization in the spark integration before handing off DVs to the commit path (basically a distributed fixup would be required in that case). ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Sure, good idea, I agree there should be some telemetry on when this happens. I'll add it above though, after the writing actually occurs. ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": 1109, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "because we may be merging duplicates, we don't update the summary for delete files until after we dedupe and are just about to write the new manifests", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": 272, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "This approach keeps track of duplicate DVs as they are added for commit. The logic was already tracking newDVRefs below so as soon as a delete file is added which has a referenced file already in newDVRef, this map can be populated.\r\n\r\nAnother option is to just track a boolean if there are _any_ duplicates, then at the end we could loop over all the DVs and group by referenced file to figure out the duplicates. This option doesn't seem great when we expect the average case for a data file to no", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "This fix surfaced an issue in some of the TestPositionDeletesTable tests where we were setting the wrong data file for delete file; we'd just add a DV for the same data file, and then it'd get merged with the new logic , and break some of the later assertions.", "path": "spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/source/TestPositionDeletesTable.java", "line": 458, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "This test had to be fixed after the recent changes because the file paths for data file B and B2 were set to the same before, so the DVs for both referenced the same file (but that probably wasn't the intention of these tests) so it was a duplicate. After this change we'd merge the DVs in the commit, and then it'd actually get treated as a dangling delete and fail some of the assertions.\r\n\r\nSince these tests are just testing the eq. delete case we could just simplify it by removing the usage of ", "path": "spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveDanglingDeleteAction.java", "line": 373, "type": "inline"}, {"author": "RussellSpitzer", "body": "Are we saving that much by keeping this in an optionally populated map? I feel like we could just have \"newDVRefs\" just be Map<String DataFileName DeleteVector>\r\n\r\nThat would increase our memory usage by the name of every Data File Name, but I feel like that can't be that much since we are already storing all the Datafile objects completely ... Just feel like things are easier that way...", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "LIttle weird that we are now writing delete vectors on the driver?", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Changed because we'll now dedupe these?", "path": "spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestPositionDeletesTable.java", "line": 412, "type": "inline"}, {"author": "RussellSpitzer", "body": "This is just for freeing up references?", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Writes new merged deletes for all datafiles which have duplicate DVs and returns them?\r\n\r\nJust a little worried we have a function here which does IO but it isn't quite clear in the method name", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I think the Spec and tuple checks are also very cautious :) Theoretically I imagine this only happens if I have duplicate file paths in the table, not sure how else this could happen", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I would throw in a test here for >2 delete files, just because I believe the code actually already covers this", "path": "core/src/test/java/org/apache/iceberg/TestRowDelta.java", "line": 1893, "type": "inline"}, {"author": "RussellSpitzer", "body": "Probably doesn't matter but in the real world these could also potentitally have existing overlapping deletes\r\nIe \r\nTask 1 has existing DV and merges a few new Deletes\r\nTask 2 has existing DV and merges a few new deletes\r\n\r\nI think the logic is fine though", "path": "core/src/test/java/org/apache/iceberg/TestRowDelta.java", "line": 1903, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "You're right, I'll make this clearer that it's actually doing an I/O.", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Yeah, and also I think the original intent of the test was to have this be a delete file for dataFileB to begin with", "path": "spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestPositionDeletesTable.java", "line": 412, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I believe we need to clear this state in case of retries", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "It certainly is simpler that way but I think it's more than just the data file name because if we have newDVRefs be a map tracking all DVs for a data file then we'd need to track a DeleteFileSet per data file (which we expect to have 1 in most cases), and we also already track the deletes just as part of regular tracking. So then the average case memory for delete files goes from O(delete files) to O(2 * delete files). Unless we further split the delete file tracking? i.e. separate the tracking ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "These tests are failing with the fix to core, right? I think we generally try to update one Spark version per commit unless there are failures that would leave main in a broken state.", "path": "spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestPositionDeletesTable.java", "line": 412, "type": "inline"}, {"author": "RussellSpitzer", "body": "Why would it be 2 * Deletefiles? We would just use the value set as the list of all delete files?", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I guess my thought here, is if we are going to spend the time to clean it up we may as well just do String, DeleteFiles map. If we were just going to throw an error I would keep newDVRefs as a set.", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why change the logic here? It seems like all this is actually doing is is changing the `newDVRefs.add` call to `addDV` and removing the summary update.", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": ">Why would it be 2 * Deletefiles? We would just use the value set as the list of all delete files?\r\n\r\nYeah I just meant that it'd be 2 * delete files if we had both deleteFilesBySpec and dvRefs as a map. We could split the tracking so that deleteFilesBySpec only tracks eq. deletes and v2 position deletes, and dvRefs is only V3 DVs for a given data file. Then it's just O(deletes), but then separating the tracking also means some changes in other places. \r\n\r\n>I guess my thought here, is if we are ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I meant we would change \r\n\r\n private final Set<String> newDVRefs = Sets.newHashSet(); \r\n to \r\n private final Map<String, DeleteFiles> \r\n \r\n It shouldn't duplicate the objects if we add to both of these simultaneously, so just 2 times the references, one by spec and one by targetFileName\r\n \r\n I think I am convinced we should fix this rather than throw an error", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Yeah that's right, it's failing with the changes to core in this PR because the test was setting up duplicates (unintentionally as far as I can tell) and now we'd be merging them and breaking some of the assertions in the test.", "path": "spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestPositionDeletesTable.java", "line": 412, "type": "inline"}, {"author": "rdblue", "body": "I agree with Russell here. I think it would be better to keep track of the DVs in a different way.\r\n\r\nNow that there may be duplicate DVs that get rewritten, we can no longer trust `newDeleteFilesBySpec` because it could contain DVs that will be replaced with compacted ones. I think what I would do is a bit simpler: I would keep a multimap of `DeleteFile` by referenced data file (location). Then in `newDeleteFilesAsManifests`, you would loop through each referenced data file and process the `Del", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should support reading any positional DeleteFile and produce a new DV right? Not just a DV?", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Looks like this produces a different Puffin file for each compacted DV, but I think it would be better to produce a single Puffin file with all of the DVs. What was the intent behind creating a file per DV?", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Yeah see my comment here https://github.com/apache/iceberg/pull/15006#discussion_r2680201703\r\n\r\nI was trying to avoid further complication of trying to coalesce all the DVs since they may have been produced with different specs, and the writer is bound to a spec so after merging, we have to do another grouping before writing. \r\n\r\nIt's definitley doable though, I just wasn't sure it was worth the complication since we expect it to be rare.\r\n\r\n", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Fair points, I can just change newDVRefs to a multimap. Are we sure we would want to just leave deleteFilesBySpec as is? I guess it's not much of a memory concern since as @RussellSpitzer said it's just refferences but are we worried about the state of that map having potential duplicates?", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "From a performance standpoint though, I don't think there's too much to be concerned about doing another grouping since typically there'd only be a single spec or a handful of specs. It was mainly just avoiding more implementation complexity. Though this is another pass over all the DVs when doing that grouping.", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "The table APIs currently prevent adding position deletes to a v3 table, so we'd never exercise it but I think it's good to have this logic also include position deletes as a better safeguard in case of inadvertent changes or newer APis etc.", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Yeah, at the time it felt more readable for me, but with the new set of changes I went back to something closer to what we had before to minimize the changeset here", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "cc @rdblue @RussellSpitzer made this a multimap. I originally wanted this to be a fileScopedDeletes (e.g. also have v2 file scoped deletes) so that in case of any mistakes in implementation we could merge the v2 position deletes with the DV but I think the fundamental challenge is that we don't have a good way in the. Iceberg core modules to read the position deletes. All that code is in the data module. I think we'd have to do some questionable things with InternalData if we really cared about ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Done, I just changed this to 3 duplicate DVs", "path": "core/src/test/java/org/apache/iceberg/TestRowDelta.java", "line": 1893, "type": "inline"}, {"author": "RussellSpitzer", "body": "We are just throwing an error if we have a position delete too then right?", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "~Don't we need to pass newFile into the DeleteFileSet?~", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "ah nvm I get it, too early", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Filter for duplicates first, otherwise we spend a lot of context shifts here for nothing.\r\n\r\nIe\r\n\r\nduplicates = Entires.filter(dvs.size >= 2)\r\nThreads.foreach(duplicates)", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I also have a slight personal bias towards making this function take an arg which is just the list of entries with duplicates so that there is clearer behavior from the method signature, but that's just me.\r\n\r\n```\r\nmergeAndWriteDvs(List<DeleteFileSet> dvsToMerge)\r\n```", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Not sure if we need the MergedDVContent Class here?\r\n```\r\nfor each (duplicates) {\r\n newDv = merge(duplicates)\r\n newDeleteFIlesBySpec.remove(duplicates)\r\n }\r\n```\r\n?", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I probably missed this, but is there a reason why we wouldn't just write a single puffin file? I thought this was going to be pretty small (hopefully)", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I hate english, I feel \"merged DVs\" refers to the dvs that were merged?\r\n\r\nMaybe\r\n\r\nMerged DVs were created from duplicate deletion vectors but the duplicates have been left as Orphan files. Writers should merge DV's prior to committing\r\n\r\nAlthough this raises the question of why we aren't just deleting the duplicates?", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Fair point, I did the filtering internally because I was trying to avoid 2 passes over all the DV referenced files but if we consider duplicates as rare, there generally wouldn't be a second pass. ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "If you mean write a single puffin for all specs, I don't think that's possible because the output file factory is scoped to a spec for the location provider internally. \r\n\r\nThough, since the final locations don't really matter in terms of the spec, we could provide a dummy unpartitioned spec in this specific case? Then we can put everything in a single file. I just didn't want to change too much there, but this would simplify a lot of code around having to group the merged DVs by spec", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I can edit the language around this, yeah I can see how it's confusing\r\n\r\n>Although this raises the question of why we aren't just deleting the duplicates?\r\n\r\nI was a bit wary about doing any more unnecessary I/Os on the commit path, want to limit the chance of failures. Thuogh we could do a best effort w try catch?", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "We'd have to make newDeleteFilesBySpec a concurrent map, which I can do but I think there was hesitation around making state more complex. The content class also is helpful for keeping track of the spec for a given file, which we end up needing for the output file factory. Though if we use a dummy unpartitioned spec for the output file factory just for this case, then we can reduce complexity here as well. \r\nWe will need the partition tuple as well for a given merged file, since at the very end ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Right, this is invoked on every add(DeleteFile) call https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L293 ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Actually more than I/O, I'm a little unsure that we can even delete the duplicates from storage without doing a mini reachability analysis of the puffin to figure out what other DVs (which may not be duplicates) have blobs in it, or we'd have to rewrite the puffin.\r\n\r\ne.g. the writer may have produced a puffin , puffinFile1 with DV1, DV2, DV3 and another puffin puffinFile2 DV4, DV2-Duplicate, DV5. Even after we merge, we can't just delete puffin file 1 and 2 in it's entirety. We'd have to rewrit", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Discussed offline, since it's generally going to be 1 partition spec, we're OK here, and we can keep the existing behavior.cc @rdblue \r\n\r\nI did double check, so while the OutputFileFactory requires a spec to be passed in the DVFileWriter isn't really bound to a spec (as expected). The DVFileWriter uses a newLocation() API which doesn't care about the spec or any tuples (again as expected).\r\n\r\nSo I think the dummy partition spec idea I had would work if we want to further simplify all this groupi", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Ugh never mind, that sounds terrible\r\n", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "it might make sense to remove the same method from `BaseDeleteLoader`", "path": "core/src/main/java/org/apache/iceberg/io/IOUtil.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "nit: both variables here can probably be just inlined", "path": "core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n \"Cannot merge duplicate added DVs when data sequence numbers are different, \"\r\n```", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "same as above", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "looks like these can all be final", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "nit: {} can be removed", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "nit: newline after }", "path": "core/src/test/java/org/apache/iceberg/TestRowDelta.java", "line": 2337, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n {\r\n```", "path": "core/src/test/java/org/apache/iceberg/TestRowDelta.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "just a nit but this could be expressed more fluently via `assertThat(positions).allSatisfy(pos -> assertThat(index.isDeleted(pos)).as(\"Expected position %s to be deleted\", pos).isTrue());`", "path": "core/src/test/java/org/apache/iceberg/TestRowDelta.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "This LGTM overall, thanks @amogh-jahagirdar, just left a few smaller comments", "path": null, "line": null, "type": "review_body"}, {"author": "amogh-jahagirdar", "body": "I agree, I went ahead and made this change ", "path": "core/src/main/java/org/apache/iceberg/io/IOUtil.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "This is moved to an IOUtil API, which we now use in the merging DV logic", "path": "data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java", "line": 283, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I'm double checking if we really need this. The DVFileWriter has some level of state tracking within it and we may be able to just leverage those cleanly", "path": "core/src/main/java/org/apache/iceberg/deletes/BaseDVFileWriter.java", "line": 82, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I remember why I did it this way, so alternatively after merging the position delete index we could just iterate over each position in the bitmap and call the existing delete API, but that felt worse than just adding an API which leverages the fact that bitmaps could do bulk merges. That being said, I'm not currently leveraging the fact that the writer itself merges the indices for a given path.\r\n\r\nAlternatively we could intialize the writer, iterate over <referenced file, duplicate set>, and lo", "path": "core/src/main/java/org/apache/iceberg/deletes/BaseDVFileWriter.java", "line": 82, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Looked at this with fresh eyes, I went ahead with the dummy unpartitioned spec approach, so we don't need to group by spec for the puffin writer. Now we'll just do a single file.", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this change is correct, but I want to note that in the future we could avoid failing by merging DVs as long as that is allowed by the operation being committed.", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": 858, "type": "inline"}, {"author": "rdblue", "body": "That sounds reasonable to me, and it may be a good idea not to delete v2 deletes that come in because we don't want this to be used for conversion.\r\n\r\nNow that I'm looking at this behavior again, I'm curious about why we use a `DeleteFileSet`. It seems strange to me that we would automatically dedup `DeleteFile` instances passed in through the API. Why not fail in that case because it is a caller bug? I can't think of a reason for committing the same DV twice.", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't like that the `DeleteFile` instances are still kept in two places, and that both places will deduplicate using sets but different logic once you account for the map keys. I know that because we are not deduplicating v2 deletes, we need a place for them to be tracked, but that doesn't mean we need to store the `DeleteFile` instances for DVs twice.\r\n\r\nThe reason why I don't like the double storage is that it doesn't handle some strange cases. For instance, what if a `DeleteFile` is added f", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I recommend avoiding methods like this that modify the instance state of snapshot producer. That should only happen in cases where we need to cache results, and cached results should be held in separate instance state. For example, this is how `newDataFilesBySpec` and `cachedNewDataManifests` work. The new data files are kept organized by spec, but that state is not rewritten. Instead, `cachedNewDataManifests` is used in case of retry.\r\n\r\nModifying the instance state like this hides how the clas", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't recommend using a synchronized list because that will have needless contention on a lock. Other places like `ManifestFilterManager` use a known task size (map size) to allocate an array that each task produces output to using an index from `Tasks.range(map.size())`.", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Yeah, Puffin files aren't specific to a partition spec. Just the `DeleteFile` entries are. It should just be one Puffin file.", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'm also not a fan of updating instance state in multiple places. This removes DVs here, but adds them in `writeMergedDVs`", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this method should return a `List<DeleteFile>` that represents all of the DVs after merging has been done. It should probably also use a cache to ensure that multiple calls produce the same list and reuse as much work as possible, although this could be done in a follow up and I could be convinced to cache at a different level (like delete manifests).\r\n\r\nThen the caller should take the list produced here, combine it with the v2 `DeleteFile` list, group by spec, and write delete manifests", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Currently, this is called in tasks submitted to a threadpool, but I wonder if there is a better model for parallelizing DV reads. We know that most DVs that are being merged are going to be in different Puffin files, so each DV read is going to be a separate IO to storage. If that's the case, why not use more memory and run `Deletes.readDV` in a threadpool for all DV files, then merge them in memory just before writing into Puffin? That way all of the reads are parallelized rather than sequentia", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think there is a guarantee about the underlying type of `partition` (it should be `StructLike`, I think). If that's the case, then `Objects.equals` is not going to work all of the time. If you want to compare partitions then you'd need a struct comparator for the struct produced by the partition spec. Honestly, I think that's a bit overkill if you know that the referenced data file matched though.", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This seems a bit heavy to me. I'd combine this with whatever cache mechanism you end up using.", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Since we're not tracking by DeleteFileSet at the time of adding, we treat every addition as a new delete, even potential duplicates (unless we want to do a look back in the list on every addDeleteFile, but I'm very against that since it's an O(deletes-added^2) operation effectively at that point for a commit).\r\n\r\nIf we look at how `hasNewDeleteFiles` is actually used, I don't think this is really consequential. hasNewDeleteFiles is true and there's a cached state we use the flag as an indication", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": 277, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Yeah had an old PR out for this https://github.com/apache/iceberg/pull/11693/files#diff-410ff1b47d9a44a2fd5dbd103cad9463d82c8f4f51aa1be63b8b403123ab6e0e (probably a bad PR title since by definition for the operation if the positions are disjoint, it's not conflicting)", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": 858, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "@rdblue These are 2 disjoint fields, one for a list of v2 deletes and a multimap for DVs.\r\nThe map is a `LinkedHashMap` because we have a bunch of tests which have expectations on the exact orders of entries in a manifest. The previous change didn't require anything because we worked with the deleteFilesBySpec, and inherently preserved the order. \r\n\r\nI personally think our tests should probably get away from expecting a certain order in manifests, and just assert the contents (or at least have v", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "@rdblue let me know how you feel about the `DeleteFileSet.of(positionandEqualityDeletes)`. \r\nI know we were kind of against de-duping but I think the fact that the two fields are disjoint now avoids that partition spec case you mentioned. I'm a bit worried that not deduping before producing the manifests is a regression compared to the previous behavior. And there's a good argument that if we can do it correctly, relatively cheaply, it's better to do it to avoid any bad metadata (similar to why ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Great point, yeah we already know the size ", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I was able to remove this, but I think I'll save caching for a follow-up if that's OK? I mainly want to aim for making sure we're just correct on this one.", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "@rdblue let me know if you feel strongly about this check. While it is `StructLike` and doesn't guarantee an equals implementation, the way I look at it is the following:\r\n\r\n1. Generally it'll be PartitionData which does do a type and value by value comparison.\r\n2. Even if the implementation changes from under us and ends up being another StructLike which doesn't override equals, then it's a reference equality which will worst case be a false positive and just fail the commit.\r\n\r\nAnother rationa", "path": "core/src/main/java/org/apache/iceberg/DVUtil.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "> It may indicate that a writer in a single commit is doing duplicate work\r\n\r\nI was mostly thinking for a pov of a left anti join kind of scenario where the executors processing splits had overlapping data (or incorrect spliting) though i am not sure its possible in spark way of doing deletes.", "path": "core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "can we keep the partition values (instead of always as an unpartitioned)? I know it would write multiple Puffin files. ", "path": "core/src/main/java/org/apache/iceberg/DVUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'd prefer not to use incorrect comparisons, even if we think they are correct most of the time. If you want to compare these, then I'd just create a correct comparator using [`forType(StructType)`](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Comparators.java#L54).\r\n\r\nWe usually prefer to reuse those, so you could also pass it in or change the method call a bit.\r\n\r\nThe problem with not using a correct comparator is that although _you_ understand and are", "path": "core/src/main/java/org/apache/iceberg/DVUtil.java", "line": null, "type": "inline"}], "1870": [{"author": "jackye1995", "body": "nit: the variable should be `deletedRecords`", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "nit: Following the convention of hive and hadoop catalog, the default name should probably be \"jdbc\"", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "nit: why not just select *?", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "we should be able to implement `SupportsNamespaces`, do you plan to do that in another PR?", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "How do we incorporate the database concept in all SQL databases? Looks like we are just creating this table in the default database. My though on this is we can have a database named `iceberg` (maybe configurable), and inside it there can be a `namesapce` table to store namespace info, and this table (maybe with a different name like `tables`) to store all table information. Thoughts?", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "1. I only see the hsqldb used for testing, what are the use of mysql and postgres?\r\n2. should mark as `testCompile` instead\r\n3. versions should go to `versions.props`\r\n4. have you checked if the license is okay for hsqldb?", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "if it's only for testing, [sqlite](https://mvnrepository.com/artifact/org.xerial/sqlite-jdbc) is probably the safest choice", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I think it is sufficient to do `Preconditions.checkNotNull`, is there any benefit for doing this complicated check?\r\n\r\n", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I think I prefer `DriverManager`, because `getConnection(String url, Properties info)` is much more flexible. Many JDBC connector needs more than username and password, for example AWS RDS needs `verifyServerCertificate` and `useSSL`. \r\n\r\nI think instead of individual config fields, JDBC catalog can expose a config prefix `jdbccatalog.property.`, and all configs under this prefix would be added to properties and initialize a connection. For example, user name and password would become configs `j", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "These changes don't look related. Could you add them in a separate PR?", "path": "core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why put this in the `hadoop` package? There isn't anything specific to Hadoop is there?", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: please don't use multiple whitespace lines. One is sufficient.", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Exceptions should not be discarded. Instead, this should create a new exception with the caught exception as a cause.\r\n\r\nAlso, we're moving from `RuntimeIOException` to Java's `UncheckedIOException`. So please use that instead.", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "The `Configurable` interface that this implements is part of `org.apache.hadoop.conf`. Though I somewhat still agree. In my implementation, I don't have it in this package.", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I agree on using `DriverManager`. Modern JDBC implementations will automatically register themselves and make themselves findable via the appropriate `META-INF/services` file. And yes, many different JDBC implementations / ways that people have stood up their relational databases require a lot more information.", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "The hadoop `Configurable` should have nothing to do with the location of this `JdbcCatalog`. It is only used by `HadoopFileIO`. Btw, I just noticed this is in the core module. Should it instead be in its own module?", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Is this List allocation necessary? It seems like it's discarded when results is reassigned from `tableDao.getAll(namespace)` below.", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Nit: More unnecessary white space.", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Would it make more sense to just store the `Namespace` directly?", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcNamespace.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "How about `org.apache.iceberg.jdbc`? I'm okay with `catalogs` as well.\r\n\r\nAs for the module, I'm okay putting this in core for now. Let's see how big it ends up being and we can move it when we know more. I think that if it has a small enough dependency footprint, we can keep it here like we talked about in the recent community sync. If it is small, then it is much easier for people to have it included in our runtime builds by default.", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Minor: Rather than breaking in the middle of a method call, the ternary operator provides good places to break lines:\r\n\r\n```java\r\nthis.fileIO = fileIOImpl == null ?\r\n new HadoopFileIO(hadoopConf) :\r\n CatalogUtil.loadFileIO(fileIOImpl, properties, hadoopConf);\r\n```", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We typically pass a connection string via [`uri`](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/CatalogProperties.java#L31).\r\n\r\nIt may also be a good idea to have a way to avoid exposing credentials. We typically allow registering a `Supplier<String>` that is called, although here it may make sense to use `Function<Map<String, String>, String>` so that the catalog config can be used.", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Looks like this doesn't yet implement an atomic swap of the old metadata location for the new one, so table updates would be unsafe because concurrent writers would clobber each other's commits. @kbendick, did your solution tackle that problem yet?", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcTableDao.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It doesn't look like the dbutils classes are very useful here compared to simple prepared statements. Unless I'm missing something, I'd rather not rely on this dependency if we can add slightly more code here to use the JDBC API directly.", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcTableDao.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think all you would need to do is to change the SQL slightly:\r\n\r\n```sql\r\nUPDATE iceberg_tables\r\nSET metadata_location = ?, previous_metadata_location = ?\r\nWHERE table_namespace = ? AND table_name = ? AND metadata_location = ?\r\n```\r\n\r\nThat adds a predicate to ensure that the table namespace and name combination will only be changed if the metadata_location is the expected one. If that affects 0 rows, then the metadata location was changed concurrently. If it affects 1 row, the commit was succes", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcTableDao.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "For structure, I would prefer to have all of the SQL statements as constants up at the top of the file, like the `CREATE TABLE` command. That makes it easy to read them independent of use here.", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcTableDao.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Any SQL exception? What possible exceptions does this include?", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcTableDao.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "You may want to also have a prefix-based lookup using LIKE:\r\n\r\n```java\r\nString namespacePrefix = namespace.toString() + \".%\";\r\n```\r\n```sql\r\nSELECT * FROM iceberg_namespaces WHERE catalog_name = ? AND namespace LIKE ?\r\n```", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcNamespaceDao.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "For the initial version, let's not support namespace metadata. We can add that later in a separate PR where we consider more options for storage.", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcNamespace.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is this `PRIMARY KEY` and `FOREIGN KEY` syntax portable? I'm wondering if we should omit them to have more generic SQL.\r\n\r\nOf course, that would require some way to ensure that table creation is atomic because the primary key constraint wouldn't be enforced for the table. That would probably require a transaction to check whether a table exists and insert it to be portable. Some databases support `ON CONFLICT DO NOTHING` or `INSERT IGNORE` but it doesn't look like the syntax is portable.\r\n\r\nIt m", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcTableDao.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This doesn't look quite right because the exception is too generic to simply ignore and move on.\r\n\r\nAlso, no need to add a format argument for the exception. SLF4J loggers will detect that the last argument is an exception and add it properly without a placeholder for it.", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcTableOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This doesn't need to be added unless Iceberg actually contains something that is copyrighted by the H2 database engine. The LICENSE is updated if Iceberg adds code based on another project or if Iceberg bundles another project into an Iceberg artifact (like the runtime Jars). Since this is just used as a test dependency, there is no need.", "path": "LICENSE", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Should this be `testRuntime` because the only compile dependency is for JDBC?\r\n\r\nWhy switch databases?", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think the problem is that we need that primary key to ensure that there is only one entry for a given table identifier. We don't want two concurrent creates to collide and create two different tables.", "path": "core/src/main/java/org/apache/iceberg/hadoop/JdbcTableDao.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "Thanks for the work! I have a doubt around the use of `fs` in `JdbcCatalog`, see comments for more details. Also, I see quite a few places commented out, can you actually remove them if they are not useful anymore?", "path": null, "line": null, "type": "review_body"}, {"author": "jackye1995", "body": "I think `testCompile` is good enough, if we want to upgrade gradle we should upgrade all syntax at the same time.", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "you need to also define a no-arg constructor for `initialize` for dynamic loading purpose. And in that case, you still need to initialize warehouse location and fileIO. So it is probably better to move all constructor logic to `initialize`.", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "I don't understand why is `fs` used here. It seems like you have a `JdbcNamespace` that stores all the information, but also creates a file path for a namespace. Why are you doing this? I don't think the JdbcCatalog should have any dependency on Hadoop file system.", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This adds quite a bit of complexity to the JDBC implementation because it requires a separate table for namespace and stores both namespace name and metadata as JSON objects. I think that ensuring consistency between the two tables adds a lot of unnecessary complexity. What happens if a table is added to a namespace that as it is concurrently deleted?\r\n\r\nI think a much simpler implementation is to omit the namespace table and determine whether a namespace exists based on whether there are any ta", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcNamespace.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We would normally omit the \"is\" from a method name if it is clear that the return value is a boolean. I think that `exists` meets that requirement because it is natural to say \"if table exists\" instead of \"if table is exists\".", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcTable.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Rather than using JSON to encode the namespace, I suggest converting it to a String using `Joiner.on(\".\")`.\r\n\r\nThere's a trade-off to simplifying the problem by doing it that way: both ```a.`b.c` ``` and `a.b.c` end up as `a.b.c`. That has two effects:\r\n1. Namespaces that \"collide\" like that are considered the same namespace\r\n2. When listing namespace `a` the result is `[b]` and not `[b.c, b]`\r\n\r\nI think that those are fine. For the collisions, I think it is rare for users to _want_ namespaces t", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcTable.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: no need for a blank line at the start of a method.", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcTable.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why are there two metadata locations passed here?", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcTable.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This connection is passed into the `TableOperations` when creating a table, which means that it is potentially shared across threads. I don't think that connections are thread-safe, so that is a problem. For Hive, we implemented a connection pool that you should be able to reuse by extending `org.apache.iceberg.hive.ClientPool`. We may want to move that class into core so it can be used by both.", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: just one exception is caught, so the variable name should be singular, like `throwable`. Also, we tend to refer to the exception as simply `e` in most catch blocks.", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "No need to wrap in an `IOException`. I think we should create an unchecked exception used to wrap `SQLException`.", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Also, if we can catch more specific subclasses, that would be ideal. Here's a list of them: https://docs.oracle.com/javase/7/docs/api/java/sql/SQLException.html", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'm not sure that I think this class is valuable. It is nice that it keeps track of the SQL statements and exposes some high-level methods, like `update`. The problem is that this provides many of the same operations as catalog, just under different names and with a slightly different API. For example, `getAll(Namespace)` is basically the same thing as `Catalog.listTables`.\r\n\r\nThis class is also used in inconsistent ways. In `Catalog.delete` and in `TableOperations`, it is used for a single tabl", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcTable.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Are these needed? I think that autocommit can be used since all of the operations should just require a single command.", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcTable.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "One more thing: it would be helpful if all of the methods in this class that correspond to methods in `Catalog` were named the same way as the ones in `Catalog`. For example, `getAll` is really `listTables` and `updateTableName` is really `renameTable`. Using names consistently will make it easier for people to understand what this is doing.", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcTable.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This shouldn't be needed because only Hadoop tables use version hints.", "path": "core/src/test/java/org/apache/iceberg/jdbc/TestJdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think that there need to be tests for the JDBC operations that are exposed by `JdbcTable` directly. Those should check cases that are required, like creating a table fails if it already exists, updating a table when metadata is out of date catches the error, etc. Then we can add functional tests to ensure that the catalog calls those correctly.", "path": "core/src/test/java/org/apache/iceberg/jdbc/TestJdbcCatalog.java", "line": 70, "type": "inline"}, {"author": "rdblue", "body": "Why extend `UncheckedIOException`? This class is supposed to be the equivalent for `SQLException`, which is not an `IOException`.", "path": "api/src/main/java/org/apache/iceberg/exceptions/UncheckedSQLException.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "There is no need to wrap an exception that isn't a `SQLException` is there?", "path": "api/src/main/java/org/apache/iceberg/exceptions/UncheckedSQLException.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think this is needed.", "path": "api/src/main/java/org/apache/iceberg/exceptions/UncheckedSQLException.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think this is needed.", "path": "api/src/main/java/org/apache/iceberg/exceptions/UncheckedSQLException.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "`DriverManager.getConnection` will [throw `SQLTimeoutException` or a generic `SQLException`](https://docs.oracle.com/javase/8/docs/api/java/sql/DriverManager.html#getConnection-java.lang.String-java.util.Properties-). Those are the only two exception classes that need to be handled. Considering there isn't much that can be done for either one, I think this should simply wrap any `SQLException` in `UncheckedSQLException` with a little context, like the connect URI.", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcClientPool.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "The second argument for the `ClientPool` constructor is an exception that indicates a connection has gone bad and should be recreated using `reconnect`. I don't think that any `SQLException` should trigger reconnecting because there are a lot of different subclasses that have nothing to do with connection issues.\r\n\r\nIn Hive, we use `TTransportException`, which indicates a connection failure rather than an error from the remote service. It looks like the equivalent for JDBC might be `SQLNonTransi", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcClientPool.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think that Iceberg uses `Properties` anywhere, so this should probably accept `Map<String, String>` and pass that to JDBC as properties if necessary.\r\n\r\nAlso, config properties don't need config context like `iceberg.jdbc` because that context is already dependent on how the catalog is configured. For example, Spark catalogs will use `spark.sql.catalog.catalog_name.uri` for the connection URI already. So the property keys here should just be the standard ones defined in [`CatalogProperti", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcClientPool.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "No need to prefix with `this` when calling a method or reading a field. We only add it to distinguish between setting instance fields (`this.x = ...`) and local variables (`x = ...`).", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcClientPool.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: continuation indents should be 2 indents, which is 4 spaces, not 8.", "path": "core/src/test/java/org/apache/iceberg/jdbc/TestJdbcTableConcurrency.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This test should also be moved if the original class was moved.", "path": "hive-metastore/src/test/java/org/apache/iceberg/hive/TestClientPool.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: unnecessary whitespace change.", "path": "spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: in other places, we use different naming conventions:\r\n\r\n* `FileIO` instances are typically called `io`\r\n* `Configuration` instances are typically called `conf`\r\n* A pool would typically be a plural named for the pooled objects, like `connections`", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think we should suppress this. Instead, can you update the code to avoid it?", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is this required by the database or is it just a convention? I don't think we should check for multiple table names.", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think these exceptions should be handled in `initializeCatalogTables`, not in this method.", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why doesn't this store the table name that was found? Does it rely on case insensitive SQL behavior later?", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This will result in `toString` called on the result of `levels()`, which is a `String[]`. That isn't correct. I think you want to pass `SLASH.join(table.namespace().levels())` into this `join` instead.", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: please add newlines after control flow blocks like `if`, `for`, etc.", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": 177, "type": "inline"}, {"author": "rdblue", "body": "Could this use `try-with-resources` instead of calling close later?", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is not a correct return. If `SQLException` was thrown, then this should throw `UncheckedSQLException` with context about the operation, like `Cannot list tables in namespace: %s`.", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think that the result set and prepared statement are tied to the connection that was used to create the statement. That means that as long as this method is using either the statement or the result set, the connection should not be reused by another thread.\r\n\r\nTo hold the connection, the whole query will need to go in the `run` block. Here's my version of this method with that change:\r\n\r\n```java\r\n @Override\r\n public List<TableIdentifier> listTables(Namespace namespace) {\r\n if (!this.names", "path": "core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java", "line": null, "type": "inline"}], "9008": [{"author": "rdblue", "body": "I don't understand why there are so many changes here. Couldn't this leave `now` in micros and just convert nanos to micros here?", "path": "api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Should this logic move into `DateTimeUtil` now that there are two versions? I think we've avoided that up until now to to avoid modifying this code, but if it is changing anyway we could at least standardize.", "path": "api/src/main/java/org/apache/iceberg/expressions/Literals.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Does `LocalDateTime.parse` with `DateTimeFormatter.ISO_LOCAL_DATE_TIME` support nanoseconds?", "path": "api/src/main/java/org/apache/iceberg/expressions/Literals.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Conversion to date needs to be updated as well.", "path": "api/src/main/java/org/apache/iceberg/expressions/Literals.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think that the original lines need to change. We just need to add nanoseconds. Otherwise the change is lager than it needs to be.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can you add when this will be removed? I think that would be in the 2.0 release.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why not use our own unit enum? Using ChronoUnit creates odd situations where you don't know in the code what unit you're going to get, even though there are only 2 possibilities. This also ties the Iceberg API to a Java API that may change and the cost of introducing our own enum is small.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can you add a test for nanos? I think we also need identity tests.", "path": "api/src/test/java/org/apache/iceberg/TestPartitionPaths.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Don't we need similar tests for nanoseconds?", "path": "api/src/test/java/org/apache/iceberg/expressions/TestStringLiteralConversions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can you fix the line wrapping in this comment?", "path": "api/src/test/java/org/apache/iceberg/types/TestConversions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Yes, I think it should be here.", "path": "api/src/main/java/org/apache/iceberg/expressions/Literals.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "@epgif, why does the addition of nanosecond precision require switching `now` throughout this file from microseconds to nanoseconds? It seems to me that this could remain as it was.", "path": "api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Not using deprecated methods is a fair reason to push back, but I think it makes more sense to deprecate in the next PR so that we can evaluate the addition in isolation.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I prefer not exposing ChronoUnit from the Iceberg API.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Be specific. We can remove it in 2.0 so we should.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "The conversion logic should be in `DateTimeUtil` rather than embedded here. There may be methods for doing this already for microseconds. The important thing is that we want to centralize the logic in `DateTimeUtil` if we are changing this section. We hadn't done it yet because this code is so old and we saw no reason to modify it.", "path": "api/src/main/java/org/apache/iceberg/expressions/Literals.java", "line": null, "type": "inline"}, {"author": "Fokko", "body": "Left some style suggestions, apart from that it looks good to me \ud83d\udc4d ", "path": null, "line": null, "type": "review_body"}, {"author": "Fokko", "body": "I'd prefer to keep the logic of this check inside of the `satisfiesOrderOf` function:\r\n```suggestion\r\n ChronoUnit otherTimestamp = ((Timestamps) other).getResultTypeUnit();\r\n if (otherTimestamp == ChronoUnit.MICROS) {\r\n return Timestamps.DAY_FROM_MICROS.satisfiesOrderOf(other);\r\n } else if (otherTimestamp == ChronoUnit.NANOS) {\r\n return Timestamps.DAY_FROM_NANOS.satisfiesOrderOf(other);\r\n } else {\r\n throw new UnsupportedOperationException(\"Unsupported tim", "path": "api/src/main/java/org/apache/iceberg/transforms/Days.java", "line": null, "type": "inline"}, {"author": "Fokko", "body": "Style nit: Instead of throwing the exception three times, it is possible to just do it once after the outer case statement.", "path": "api/src/main/java/org/apache/iceberg/transforms/Timestamps.java", "line": null, "type": "inline"}, {"author": "Fokko", "body": "Same here, we could just throw it once at the end of the method.", "path": "api/src/main/java/org/apache/iceberg/transforms/TransformUtil.java", "line": null, "type": "inline"}, {"author": "Fokko", "body": "```suggestion\r\n } else if (type.typeId() == Type.TypeID.DATE) {\r\n```", "path": "api/src/main/java/org/apache/iceberg/transforms/Transforms.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't quite understand the need to make this change. The `TIMESTAMP` matcher already accepts 9 digits and then converts to micros, ignoring nanosecond values. This PR now distinguishes between `TIMESTAMP` (max precision 6) and `TIMESTAMP_NS` (precision 7 to 9) but in the end, the values are parsed and the nanosecond component is discarded using `nanosToMicros`.\r\n\r\nWhy go to this trouble? Can't these values be parsed into a microsecond timestamp and then sanitized without making changes?", "path": "api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: don't use `final` for local variables.", "path": "api/src/main/java/org/apache/iceberg/expressions/Literals.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "The decision to use the same `TimestampType` class and use `unit` makes this logic more complicated than necessary.\r\n\r\nThe idea behind having a Java enum, `Type.TypeID`, was that in most cases you can use a switch statement and very simple logic within that switch. The reason why `TimestampType` was parameterized by a boolean for `adjustToUTC` is that the logic in most places is identical for `timestamp` and `timestamptz` because they have the same internal representation.\r\n\r\nFor `timestamp_ns` ", "path": "api/src/main/java/org/apache/iceberg/expressions/Literals.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Looks like this should be a constant.", "path": "api/src/main/java/org/apache/iceberg/util/DateTimeUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this is adding too many conversion methods and relying on the caller to use its own OffsetDateTime or LocalDateTime logic (I'm looking at `StringLiteral.to(TimetsampType)`).\r\n\r\nI think this should use the established pattern similar to `isoTimestamptzToMicros` and just add `isoTimestamptzToNanos`.", "path": "api/src/main/java/org/apache/iceberg/util/DateTimeUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is this needed? If not, just expose `isoTimestampToNanos`.", "path": "api/src/main/java/org/apache/iceberg/util/DateTimeUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Another issue with using `Unit` is that this is splitting conversion logic between `DateTimeUtil` and here. It should all move to `DateTimeUtil` and this should just call `isoTimestamp[tz]ToNanos` or isoTimestamp[tz]ToMicros`.", "path": "api/src/main/java/org/apache/iceberg/expressions/Literals.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is a problem because we don't know how to interpret the value.\r\n\r\nBefore, the value was always interpreted as micros, which is probably the right way to update this if we want to preserve behavior:\r\n```java\r\n return (Literal<T>) new TimestampLiteral(Unit.MICROS, value()).to(type);\r\n```\r\n\r\nThis is going to cause another problem, eventually, because v3 is going to introduce type promotion from long to timestamp. The current plan is to always interpret long values as millis because that is ", "path": "api/src/main/java/org/apache/iceberg/expressions/Literals.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is this example in the spec?", "path": "api/src/test/java/org/apache/iceberg/transforms/TestBucketing.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What is this testing? It uses micros.", "path": "api/src/test/java/org/apache/iceberg/expressions/TestStringLiteralConversions.java", "line": 136, "type": "inline"}, {"author": "rdblue", "body": "Test that Avro produces a value within 1 micro of this", "path": "api/src/test/java/org/apache/iceberg/expressions/TestStringLiteralConversions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is this code path used? The `TIMESTAMP` and `TIMESTAMPTZ` patterns match `\\d{1,9}` and are checked first. I think those will likely match anything that these would match before checking the nanosecond patterns.", "path": "api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "You can remove this comment because `now` is not in milliseconds. It is in micros. Otherwise this wouldn't work. See lines 256 and 349.", "path": "api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can this be a method in `DateTimeUtil`? I think that's a good place to put it.", "path": "api/src/main/java/org/apache/iceberg/expressions/Literals.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this is going to be a problem. We will very likely allow type promotion from `timestamp` to `timestamp_ns`. If a table were bucketed by a timestamp column (which is allowed by the spec) then existing partition values would no longer be correct. For other type promotion cases, the spec aligns the hash function (see Appendix A). For example, `bucket(int)` is defined as `i -> bucket((long) i)` so that the values match.\r\n\r\nI don't think that we can do that in this case because the hash funct", "path": "api/src/main/java/org/apache/iceberg/transforms/Bucket.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is it safe to divide or do we need to use `Math.floorDiv`? I think we use that function in other places.", "path": "api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is it possible to avoid exposing `ChronoUnit`?\r\n\r\nAlso, Iceberg method names should almost never use `get`. Either use a more specific verb or omit `get` because it doesn't add anything.", "path": "api/src/main/java/org/apache/iceberg/transforms/Timestamps.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Was this change needed?", "path": "api/src/main/java/org/apache/iceberg/transforms/Timestamps.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why refactor and remove `name`?", "path": "api/src/main/java/org/apache/iceberg/transforms/Timestamps.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Looks like this is incorrect. The value is `timestampMicros`.", "path": "api/src/main/java/org/apache/iceberg/transforms/TransformUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "When would a `TimestampType` ever have `Type.TypeID.TIMESTAMP_NANO`? Wouldn't that only happen for `TimestampNanoType`?", "path": "api/src/main/java/org/apache/iceberg/transforms/Timestamps.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Typo: Nanoo", "path": "api/src/test/java/org/apache/iceberg/expressions/TestMiscLiteralConversions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why reorder these types? It would be fewer changes if you didn't move this.", "path": "api/src/test/java/org/apache/iceberg/expressions/TestMiscLiteralConversions.java", "line": 102, "type": "inline"}, {"author": "rdblue", "body": "Let's avoid string comparison here.", "path": "api/src/main/java/org/apache/iceberg/transforms/SortOrderVisitor.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "it's a good idea to add new tests for every single method that is being added here. These tests should go into `TestDateTimeUtil`", "path": "api/src/main/java/org/apache/iceberg/util/DateTimeUtil.java", "line": 95, "type": "inline"}, {"author": "nastra", "body": "I think it would be good to add a test that exercises this path here", "path": "api/src/main/java/org/apache/iceberg/transforms/Days.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "nit: I think it would be good to have a unit test that exercises this path", "path": "api/src/main/java/org/apache/iceberg/transforms/Hours.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "is there a particular reason to switch to string comparison here and in other places?", "path": "api/src/main/java/org/apache/iceberg/transforms/SortOrderVisitor.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "why not use `String.format` here?", "path": "api/src/main/java/org/apache/iceberg/transforms/Transforms.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "should have a test", "path": "api/src/main/java/org/apache/iceberg/transforms/Years.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "it would be good to have tests for all of these error conditions", "path": "api/src/main/java/org/apache/iceberg/transforms/Timestamps.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n public void applyBadSourceType() {\r\n```", "path": "api/src/test/java/org/apache/iceberg/transforms/TestTimestamps.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n public void applyBadResultType() {\r\n```", "path": "api/src/test/java/org/apache/iceberg/transforms/TestTimestamps.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "please also update the other test method names accordingly, as the codebase typically doesn't use `_` in method names", "path": "api/src/test/java/org/apache/iceberg/transforms/TestTimestamps.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "nit: I think it would be good to have a space before/after `->`", "path": "api/src/test/java/org/apache/iceberg/transforms/TestTimestamps.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I think this can be inlined, so we don't need to define the `static final` here. This should also be reflected in the other tests", "path": "api/src/test/java/org/apache/iceberg/transforms/TestYears.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n \"Unsupported source/result type units: \" + type + \" -> \" + resultTypeUnit);\r\n```", "path": "api/src/main/java/org/apache/iceberg/transforms/Timestamps.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n Duration duration() {\r\n```", "path": "api/src/main/java/org/apache/iceberg/transforms/Timestamps.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I wonder if this should be using `Math.toIntExact(DateTimeUtil.convertNanos(timestampUnits, resultTypeUnit.unit))`", "path": "api/src/main/java/org/apache/iceberg/transforms/Timestamps.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "similarly to my other comment, I don't think this `static final` is needed", "path": "api/src/test/java/org/apache/iceberg/transforms/TestDays.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this should be typically at the top of the class", "path": "api/src/main/java/org/apache/iceberg/util/DateTimeUtil.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "the `else` path doesn't seem to be tested", "path": "api/src/main/java/org/apache/iceberg/util/DateTimeUtil.java", "line": 255, "type": "inline"}, {"author": "findepi", "body": "@nastra just curious, can this be checked statically, perhaps with a tool like checkstyle?", "path": "api/src/test/java/org/apache/iceberg/transforms/TestTimestamps.java", "line": null, "type": "inline"}], "2591": [{"author": "kbendick", "body": "Since we are rewriting files in groups (if I'm not mistaken), would it make more sense to refer to this as a `groupId`?", "path": "api/src/main/java/org/apache/iceberg/actions/RewriteStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think this is heading in the right direction. We should spend some time simplifying the `execute` method. I think we can split it into a number of smaller methods.", "path": null, "line": null, "type": "review_body"}, {"author": "aokolnychyi", "body": "Shall we move it to core for now? I think it is a bit too early to expose it. ", "path": "api/src/main/java/org/apache/iceberg/actions/RewriteStrategy.java", "line": 30, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: `setID` ", "path": "api/src/main/java/org/apache/iceberg/actions/RewriteStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Not needed?", "path": "api/src/main/java/org/apache/iceberg/io/CloseableIterable.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Can we group all abstract method together? I think they should also be below the constructor. ", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: make final and move above mutable vars?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: extra space, do we need an explicit `this` in `this.options()`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Seems like we don't close this `Iterable` anywhere?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Can we split this method into smaller ones? I think planning file groups could definitely be in another method.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: I think it is a bit more readable to have `.stream()` on the same line with `=` and then each subsequent action on a separate line.\r\n\r\n```\r\n... filesByPartition = Streams.stream(files)\r\n .collect(...);\r\n```\r\n", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Would it make sense to put this into a separate method to simplify the collector statement?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Here as well. It may be a bit easier to read if Collectors.toMap was on a single separate line.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Is this used anywhere?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: `Maps.newHashMap`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: shall we define temp vars for UUID.randomUUID() and FileGroupInfo?\r\n\r\n```\r\nString xxx = UUID.randomUUID().toString();\r\nFileGroupInfo yyy = new FileGroupInfo(xxx, myJobIndex, myPartIndex, e.getKey());\r\nreturn Pair.of(yyy, tasks);\r\n```", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Is there a better prefix we can give? `my` -> `current` or something?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I'd consider putting this into a separate method and calling it from map.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Can we move `totalGroups` closer to the place it is used?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "`completedRewrite` -> `completedRewrites`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Typo? `committerService`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Can we put the commit closure into a separate method?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I am not sure I understand the second branch of this condition. Suppose I have 10 partitions and allow only 2 commits at most. If I managed to compact 9 partitions and one failed, we should still commit the remaining 4?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Wait, I think that's handled below.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Shouldn't it be `>=`, though? We should be able to commit if we have enough groups, no?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Shall we stop other rewrites if this happens?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Should this happen only if partial progress enabled?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Or we actually assume this only happens if all rewrites have completed?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: `Collectors.toMap(Pair::first, Pair::second)`", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "A helper error message?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Let's move this directly after the constructor together with other abstract methods.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Would it make sense to move this to core where we have other results?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: final?", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSpark3.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "`BaseRewriteDataFilesSpark3Action`?", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSpark3.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: wrong comment?", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think we should clone the original session and disable AQE to avoid any surprises. ", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think we should annotate each job with proper Spark UI.\r\n\r\n```\r\nRewriting files (bin-pack, partition '$partition', jobIndex/jobs, partitionIndex/partitions) in db.tbl\r\n```", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: What about moving `throw` to `default` in `switch`?", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSpark3.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Sure I was mostly copying the current usage in RewriteManager\r\n", "path": "api/src/main/java/org/apache/iceberg/actions/RewriteStrategy.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "One of the things I was having trouble with here is that we have iceberg.actions and spark.iceberg.actions so I can't keep this package private and use what I think is the correct package for the implementations", "path": "api/src/main/java/org/apache/iceberg/actions/RewriteStrategy.java", "line": 30, "type": "inline"}, {"author": "RussellSpitzer", "body": "Of course, trying hard to get everything in :)", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Yep was planning on having grouping and filtering separate just haven't got to it yet", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Not yet, I was going put this in the job description as well, ie Partition X : # Y of Total\r\n", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This is only in the case that partialProgress is not enabled, in that case we either have 1 commit (this section of the code) so there is no need to stop rewrites since they have all finished", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This should happen if the maxGroup size is not equally divisible, whether or not partial progress is enabled. If PartialProgress is enabled it will just be the remainder of the last group. If it isn't enabled this will be the entire set of all groups.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Maybe I should just switch PartialProgress to a completely different code path\r\nmay be easier to understand", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Shouldn't the result be propagated during the commit? Also, what if a group fails? What should be in the result map?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Does it mean we give 10 minutes for the committer thread to gracefully shut down?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think we don't properly clean up references in `FileScanTaskSetManager` and `FileRewriteCoordinator`. We could invalidate `FileRewriteCoordinator` in a finally block just above and we could make `commit` in `FileRewriteCoordinator` return a set of new files.", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Yeah not sure how this happened", "path": "api/src/main/java/org/apache/iceberg/io/CloseableIterable.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I pulled out this and the group into a single function, I think that helps, but take a look in the new version", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Yep, this is the amount of time we give the service after all rewrites have completed (Tasks.run blocks) for all commits to finish.\r\n\r\nSo basically once all rewrite jobs have completed the remaining commits have 10 minutes to execute", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I'm not sure, but I think we could always move it when we start the other implementations. I think we'll probably be extracting some of the other functions as well", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I can change this but is it a Base class if it's the leaf implementation? I know we have that on the other extensions so I'll do it here too.", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSpark3.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Hmm I think that's true, it does mean i have to handle cleanup slightly differently in the action itself. The case where we don't have partial progress enabled but do have multiple file groups being committed can lead to a situation where several sets of files that have been successfully rewritten but since the commit is invalid we need to clear them out.\r\n\r\nIt's late so we can probably talk about this tomorrow after I slept a bit.", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "changed to completedRewriteIds", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Note, job desc is set in base action", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "should we make the constant from `PartitionSpecParser` public? then we don't have define it here as it is a more general constant than spark", "path": "spark/src/main/java/org/apache/iceberg/spark/SparkWriteOptions.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "swapping everything to groupID", "path": "api/src/main/java/org/apache/iceberg/actions/RewriteStrategy.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I need to actually delete this, I had an implementation for the spark writer using this property but it did not work. I\"ll remove this for now", "path": "spark/src/main/java/org/apache/iceberg/spark/SparkWriteOptions.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Yeah, I know. That's why I am wondering whether we can move it to core for now.", "path": "api/src/main/java/org/apache/iceberg/actions/RewriteStrategy.java", "line": 30, "type": "inline"}, {"author": "aokolnychyi", "body": "No longer needed?", "path": "spark/src/main/java/org/apache/iceberg/spark/SparkWriteOptions.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: extra line", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Does this have to fail everything? I'd probably log an error or something.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think it would be simpler if you grouped the logic in `files()`, `filterAndGroupFiles()` and closing of resources into one method. You could call it something like `planFileGroups`.\r\n\r\n```\r\n private Map<StructLike, List<List<FileScanTask>>> planFileGroups(RewriteStrategy strategy) {\r\n CloseableIterable<FileScanTask> fileScanTasks = table.newScan()\r\n .filter(filter)\r\n .ignoreResiduals()\r\n .planFiles();\r\n\r\n try {\r\n Map<StructLike, List<FileScanTask>> filesByParti", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Then it would be just a single line in `execute` that would call `planFileGroups`.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Sure, we could just log an error and keep going", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "extra line", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think we currently don't remove partitions that end up getting 0 file groups after filtering. ", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Maybe, we should replace the second stream API part?\r\n\r\n```\r\n private Map<StructLike, List<List<FileScanTask>>> planFileGroups(RewriteStrategy strategy) {\r\n CloseableIterable<FileScanTask> fileScanTasks = table.newScan()\r\n .filter(filter)\r\n .ignoreResiduals()\r\n .planFiles();\r\n\r\n try {\r\n Map<StructLike, List<FileScanTask>> filesByPartition = Streams.stream(fileScanTasks)\r\n .collect(Collectors.groupingBy(task -> task.file().partition()));\r\n\r\n Map<Stru", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think by creating `toJobStream(fileGroupsByPartition)`, you should be able encapsulate the index vars above.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Isn't `partialProgressEnabled` always true in this case?\r\n ", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Also moving that new private commitOrClean below the abstract methods", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Yep left over from the unified execute, i'll clean that up", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Could be >=?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Is this needed?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "typo: `interrupted`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I think this is a good change, implementing", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Also moving execute up to be with the other public overrides", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "no longer needed, part of the more complicated single execute method.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Yep sorry didn't push that fix yet, I noticed this on my last run through", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: I guess this will be renamed into `xxxRewrites` following the rename of the option.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: I think `);` can be on the previous line where `collect` is.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: can be just `int`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: I think the name of the var does not match the method name. We should probably align them.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: `doExecuteWithPartialProgress`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think we are missing the set of group ids here.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: I think the order of methods in this class can be a little bit improved. For example, this one is not referenced by any of the adjacent methods so I don't know where it belongs or how it is used unless I open this in an IDE.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: arg formatting", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Should `numFilesPerPartition` be `numGroupsPerPartition`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: `xxx-DATA-FILES`", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "This method uses `infoListPair.first()` and `infoListPair.second()` quite a bit and I keep going back to see what each of those mean. Also, `infoListPair` isn't very descriptive and `Pair<FileGroupInfo, List<FileScanTask>>` is kind of bulky.\r\n\r\nWhile looking for a slightly more descriptive name, I saw it is also called a group in other places. What about creating a private helper class instead of using `Pair`? Also, what about creating temp vars that refer to the info and files? That should shor", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "These vars are passed around in a lot of methods. What about creating a context object like we have in a few other places in Iceberg?\r\n\r\nI think something like this would simplify the code:\r\n\r\n```\r\n private static class ExecutionContext {\r\n private final Map<StructLike, Integer> numGroupsByPartition;\r\n private final int totalGroupsCount;\r\n private final Map<StructLike, Integer> partitionIndexMap;\r\n private final AtomicInteger groupIndex;\r\n\r\n public ExecutionContext(Map<StructLike", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "sounds good to me, going to use the name RewriteExecutionContext just to emphasize ... maybe we can drop rewrite though ... execution context just felt to generic to me", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Whatever name you think is appropriate.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Moved it closer to the execute methods", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Too bad checkstyle can't check the logging string formats too :(", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "changed this to also use new container classes, so none of these args exist now", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: just `FileGroup`? Should be shorter and still descriptive enough.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Should it return the interface?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: `xxxRewrite` -> `xxxRewrites` or `xxxRewriteIDs` or `rewrittenGroupIDs`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: What about putting vars on different lines?\r\n\r\n```\r\n Set<DataFile> addedFiles = withJobGroupInfo(\r\n newJobGroupInfo(\"REWRITE-DATA-FILES\", desc),\r\n () -> strategy.rewriteFiles(groupID, fileGroupForRewrite.files()));\r\n```", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: looks like this name does not match what we have in `doExecuteWithPartialProgress`.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: Shall this be placed into a separate method? Can we also format it similar to what we do in `ThreadPools`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: I think it would fit on one line if it was called `fileGroup`.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Shall we overload `foreach` to also accept `Stream`? We support a number of things there.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: What about formatting like [here](https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java#L94)?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "If we had a method for constructing a rewrite service, we could use it here too.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Shall we advise folks to consider enabling partial progress like we discussed?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: Shall we move this up into a var like we have in `BaseRewriteDataFilesSpark3Action`?", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Do we have to wrap this into a runtime exception or is it just enough to rethrow? We add an error message in doExecute.", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Who is responsible for cleaning up resources? It seems we do that in two places: in strategy and in the action. I think we will have to eventually either delegate more to the strategy interface or create a rewriter interface that would be responsible for rewriting, committing, aborting.\r\n\r\nRight now both Spark 3 action as well as as Spark 3 bin packing strategy interact with the task set manger and commit coordinator. Ideally, I'd have one just one entity doing that.", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: `fileGroup`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: `})` on another line?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: arg formatting", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: `currentGlobalIndex`, `currentPartitionIndex`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: tableUUID(Table table)", "path": "spark3/src/main/java/org/apache/iceberg/spark/FileScanTaskSetManager.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: `fetchSetIDs`?", "path": "spark3/src/main/java/org/apache/iceberg/spark/FileScanTaskSetManager.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think we can ignore this for now and wait for more feedback.", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Sure? I don't think it matters as long as it follows the parent.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I need to add that to the \"commit\" portion. Here I'm not sure it's the right decision. This is where we actually had an error writing (or reading) and not it committing.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Added additional message here on commit failure with a ValidationException", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I didn't like how it looks like\r\n```java\r\n})\r\n.run\r\n```\r\nBut I can switch it back", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "switched to those - current so the \"new FileGroup\" fits on one line", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Currently the difficulty here is with our commit process, in both cases only the Action knows whether or not it has enough information to do the commit. We could put the commit api into the Strategy as well like how i've setup the action at the moment but that means changing the strategy API again and including more group tracking into the strategy.\r\n\r\nFor example, in this case the Strategy has the rewriteCoord + manager but other implementations may not require this. A Strategy without a cache ", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I think we probably would wrap this whole thing into a new api, and make it a transient member of a strategy or something like that? Otherwise I think we have serializability errors ...", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "It's annoying because \"rewriteFiles\" will be strategy specific and framework specific but all the other methods should just be framework specific", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/rewrite/Spark3BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "There are two situations we have to be careful with.\r\n\r\nFirst, the result file size may be a little bit bigger than our target file size. We don't want to cut another file with just a couple of MBs. It is better to write a slightly bigger file, especially because we have the max file size threshold.\r\n\r\nSecond, I have seen a number of use cases when the file size after compaction is substantially larger than what we anticipated. I suspect it is related to not being able to apply a specific encodi", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3BinPackStrategy.java", "line": 72, "type": "inline"}, {"author": "RussellSpitzer", "body": "At least on Parquet, differences in compression and encoding seem to be issues here. @aokolnychyi has more info but one of the hypothesis was that smaller files used dictionary encoding while larger files did not.\r\n\r\nMost of the experience with this is from production use-cases with users with large numbers of small files.", "path": "core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "Overall looks good to me. I would like to revisit the RewriteStrategy idea a bit. Because we are basically going to rewrite and remove all the delete files in this action along the way, this is what I see as the method for running major compaction.\r\n\r\nTherefore, I think we should have more configurations related delete files. For example, even if a scan task does not reach a minimum length requirement, if there are many delete files associated, we can still include that to cleanup all the delete", "path": null, "line": null, "type": "review_body"}, {"author": "jackye1995", "body": "nit: this method is not used anywhere else and only 3 lines, I think we can just put it in the planFIleGroups method", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "nit: unnecessary newline", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "nit: info()", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "Why not add some configurable retry for each file group?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": 202, "type": "inline"}, {"author": "jackye1995", "body": "specId is never assigned", "path": "core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "for all java doc to reference variables, you need `{@link #xxxxx}`, such as `{@link #MIN_INPUT_FILES}`", "path": "core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "We are setting the bin pack strategy as the default strategy, but there is not really a way to overwrite it. So I think either we should call this action `BinPackDataFilesSpark3Action`, or we should load this by strategy name from options. \r\n\r\nAnd either way, this does not sound like a `default` to me anymore, should we just call this method `strategy()`?", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSpark3Action.java", "line": 35, "type": "inline"}, {"author": "RussellSpitzer", "body": "Yeah this is a bit behind some of our other discussions on this, In my local copy this no longer exists. @rdblue , @aokolnychyi and I discussed passing strategies, especially those with more complicated args which is why RewriteDataFiles now has binPack() as a method.\r\n\r\nIn my local version with binPack(), sort() and sort(SortOrder sortOrder) call on methods here of either\r\n```java\r\nbinPackStrategy()\r\n//or\r\nsortStrategy() \r\n```\r\nAnd then modify the resultant strategy with their options.", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSpark3Action.java", "line": 35, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: extra line", "path": "core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think the first sentence should be little bit more clear that we estimate the tasks correctly and we hope the generated file (not task) will be less than or equal to our target file size...", "path": "core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "We have seen this quite a lot when tiny Parquet files are compacted into larger ones as it changes the encoding on many columns. In most cases, the actual file size is bigger than what we estimated. \r\n\r\nI am not sure about Avro. Cases where the estimation is precise enough should work as expected. The main cases we are trying to avoid is splitting 514 MB into 512 and 2 MB files and writing 1 GB files when the target is 512 MB.\r\n\r\nThe ideal solution is to know how much the remaining rows are goin", "path": "core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: Do we need the explicit `this` here?", "path": "core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I am afraid we cannot do much in that case", "path": "core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: shall we move `);` to the line above?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: formatting", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Could be simply `groupStream` now?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think a comment would help other folks to understand 10 mins. I suppose we leave 10 mins once the rewrite is done for remaining commits.\r\n\r\nWe could also check the return value of `awaitTermination` and log a warning if we hit the timeout.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: We should add spaces at the end of these lines", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think we better move this validation to `TableScanUtil` and also validate other args such as `lookback`.", "path": "spark3/src/main/java/org/apache/iceberg/spark/source/SparkFilesScan.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Thanks, I didn't know that!", "path": "core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Rewrote the documentation here to hopefully make it more clear", "path": "core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This is resolved with our \"writeMaxFileSize\" parameter which changes our write target file size", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3BinPackStrategy.java", "line": 72, "type": "inline"}, {"author": "RussellSpitzer", "body": "just trying to sneak my preferred indentation in :)", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "My feeling on this was that if the rewrite fails we are basically having a very serious issue since we are just trying to do a read.write operation in the underlying framework. If we want to do another attempt (and partial progress is enabled) a user can just rerun the operation and it would ignore completed binpacks. That said I don't have a strong opinion here but I would lean towards less configuration options if possible.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": 202, "type": "inline"}, {"author": "RussellSpitzer", "body": "I think I could do warn, but \"info\" implies expected to me, and this is not an expected behavior. Imho", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This is the style used in our other executor creation?\r\nI'm not sure what the preferred formatting is here - for example from ThreadPools.java\r\n\r\n``` private static final ExecutorService WORKER_POOL = MoreExecutors.getExitingExecutorService(\r\n (ThreadPoolExecutor) Executors.newFixedThreadPool(\r\n WORKER_THREAD_POOL_SIZE,\r\n new ThreadFactoryBuilder()\r\n .setDaemon(true)\r\n .setNameFormat(\"iceberg-worker-pool-%d\")\r\n .build()));\r\n```", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think it is converging, I had mostly cosmetic things. It would be great if more folks could take a look.\r\n\r\nThe only part I am not sure is point 3 in my earlier [comment](https://github.com/apache/iceberg/pull/2591#issuecomment-858912406).", "path": null, "line": null, "type": "review_body"}, {"author": "aokolnychyi", "body": "nit: `... by the rewrite action or strategy`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: I kept going back to the definition to find out what `e.getKey()` and `e.getValue()` means. Maybe, it will be a bit easier to read if we created separate vars with proper names.\r\n\r\nI'd also add a var for `new FileGroupInfo` as that line is quite long.\r\n\r\n```\r\nreturn fileGroupsByPartition.entrySet().stream()\r\n .flatMap(e -> {\r\n StructLike partition = e.getKey();\r\n List<List<FileScanTask>> groups = e.getValue();\r\n return groups.stream().map(tasks -> {\r\n ...\r\n F", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: extra space after `=`", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "super nit: I'd also put `ThreadFactoryBuilder` on a new line like you did in `rewriteService`.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: we sometimes use `xxxID` and sometimes `xxxId` in this file.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: If we called `completedRewriteIDs` as `rewrittenIDs` like in `doExecute`, the condition would fit on one line.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Does this mean groups that failed will not appear in the result of the action?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: `Ignoring a failure during ....`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: I think this can be called `fileGroup` or `group`.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: I'd consider the construction of `FileGroupRewriteResult` and adding it to the result map as a single logical operation. So if we are to add an empty line, I'd add it after `offer` call instead of this one.\r\n\r\n```\r\nrewrittenIDs.offer(groupID);\r\n\r\nFileGroupRewriteResult fileGroupResult = ...\r\nresults.put(groupID, ...);\r\n```", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Can we move `withJobGroupInfo` to the line above and keep `newJobGroupInfo` on this one?\r\n\r\n```\r\n Set<DataFile> addedFiles = withJobGroupInfo(\r\n newJobGroupInfo(\"REWRITE-DATA-FILES\", ...),\r\n () -> ...);\r\n```", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: double `the`?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: Can fit on one line?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Looks like this is missing spaces at the end of files.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Do we need to catch it to immediately rethrow?", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Missing spaces at the end of lines.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Do we have to locally sort data during bin-packing?", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3BinPackStrategy.java", "line": 77, "type": "inline"}, {"author": "aokolnychyi", "body": "The current one makes sense.", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Changed this around a bit, Now we only handle FileGroups within this logic", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Remove this because we already added it in another pr", "path": "core/src/main/java/org/apache/iceberg/util/TableScanUtil.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "We are pulling this because we removed \"groupID\" state and put it into the strategy implementations", "path": "core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java", "line": 63, "type": "inline"}, {"author": "RussellSpitzer", "body": "True we don't need that", "path": "spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I didn't think had this implemented yet? But I can add in a sortWithinPartitions here", "path": "spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3BinPackStrategy.java", "line": 77, "type": "inline"}, {"author": "aokolnychyi", "body": "Do we need to override `equals` and `hashCode` as this is used in the result map?", "path": "core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesFileGroupInfo.java", "line": 25, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: redundant `this`?", "path": "core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesFileGroupRewriteResult.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: redundant `this`?", "path": "core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesFileGroupRewriteResult.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "What if we import `FileGroupInfo` and `FileGroupRewriteResult` directly? Then this can fit on one line.", "path": "core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesResult.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "This will be also way shorter with direct imports.", "path": "core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesResult.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I think `xxxUtil` usually means the class is stateless. Shall we come up with a different name?", "path": "core/src/main/java/org/apache/iceberg/actions/RewriteDataFilesCommitUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It is odd to have a class that is a `Util`. Most of the time that is used to name classes that carry static methods, rather than helpers. Is there a better name for this?", "path": "core/src/main/java/org/apache/iceberg/actions/RewriteDataFilesCommitUtil.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: I think having `collect` on a new line would be cleaner.", "path": "core/src/main/java/org/apache/iceberg/actions/RewriteDataFilesCommitUtil.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: I think having `collect` on a new line would be cleaner.", "path": "core/src/main/java/org/apache/iceberg/actions/RewriteDataFilesCommitUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I find this use of streams awkward. I doubt performance is a problem, but it iterates over the groups twice and creates a bunch of sub-streams. I usually find it more straightforward to write a simple for loop.", "path": "core/src/main/java/org/apache/iceberg/actions/RewriteDataFilesCommitUtil.java", "line": null, "type": "inline"}], "11144": [{"author": "stevenzwu", "body": "should this be marked as `@Internal` or even package private?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `newRetainLast` -> `numSnapshots`", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `retainLast` -> `minNumSnapshots`", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "the name of `minAge` seems incorrect to me. it is more like snapshot max retention window. so maybe `maxSnapshotAge` is more accurate?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "should we design it like `RemoveSnapshots`? one benefit is to use the default `ThreadPools.getWorkerPool()/getDeleteWorkerPool()` to reuse thread pools in the JVM.\r\n\r\n```\r\n public ExpireSnapshots executeDeleteWith(ExecutorService executorService) \r\n\r\n public ExpireSnapshots planWith(ExecutorService executorService) \r\n```", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "these thread pool size defaults are hardcoded. We should probably use the default `ThreadPools` as mentioned in another comment below, which have reasonable defaults\r\n\r\n```\r\n public static final ConfigEntry<Integer> WORKER_THREAD_POOL_SIZE =\r\n new ConfigEntry<>(\r\n \"iceberg.worker.num-threads\",\r\n \"ICEBERG_WORKER_NUM_THREADS\",\r\n Math.max(2, Runtime.getRuntime().availableProcessors()),\r\n Integer::parseUnsignedInt);\r\n\r\n /**\r\n * Sets the size of the delete", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "typo: scheduler -> schedule", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "typo: scheduler -> schedule", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "is `scheduleOnInterval` more informative?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "in v2 sink PR, we have changed this to `uidSuffix` to match how the framework naming its internal operators (like writer, committer).\r\n\r\nhttps://github.com/apache/iceberg/pull/10179", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "is this an `id` or `index`?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "since these are package private methods, not sure if `@Internal` is needed.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "also, do we just make these methods protected scope only?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: this is not a setter. args can be named as `maintenanceTaskId` and `maintainanceTaskName`", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "it is uncommon to have a large number of args for `build()` method. But I don't have a concrete suggestion. ", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: should this be named as `forTable(...)`? Iceberg commonly used `Builder forSth` style.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "same comment on `uidSuffix`", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "seems that this is used by both monitor source and trigger manager. do they always have the same rate limit configs?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "extract line 225-240 to a separate method to reduce the complexity of this method", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: maybe `lockCheckDelay` to be consistent with the other name of `lockCheckDelayMs`?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `{@link MonitorSource}` can be changed to `TableMaintenance#builder(...)`, which was user facing", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "why is it called `WindowClosingWatermarkStrategy`? didn't quite see the `window` part. this looks a form of punctuated watermark strategy. \r\n\r\nhttps://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/event-time/generating_watermarks/#writing-a-punctuated-watermarkgenerator", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I don't know if we need to put this class under the `stream` sub package. I am also not sure if we need the `operator` sub package. \r\n\r\nshould we just keep everything under `maintenance` package? this way, we can also probably get rid of the `@Internal` tags for some classes that can be made package private.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "can this be just `filter`? also to improve readability, we can probably use lambda here and remove the `TaskFilter` class.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: remove the comment. seems redundant with the comment on line 242. if we keep it, it is probably also better placed before line 250", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this should be unique across all supported maintenance tasks. `delete-stream` doesn't seem appropriate here. is this maintenance task only one generate file deletes? it probably should be sth like `expire-snapshots-stream` or `expire-snapshots-file-deletes-stream`?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "we should use the batch delete API from `SupportsBulkOperations`", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/AsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "do we still need async with the batch/bulk delete API (as suggested in another comment)?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "we should probably at least log how many files are emitted for deletion. \r\n\r\nideally, the number of files emitted for deletion should actually be part of the `TaskResult`", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Do we want to allow the users to create their own maintenance tasks?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "`retainLast` is used in Spark and the generic implementation. I would probably stick to the \"known\" name ", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "`retainLast` is used in Spark and the generic implementation. I would probably stick to the \"known\" name", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Are you suggesting to do everything in the operator instead of separating out the delete to another operator?\r\n\r\nThe Spark implementation even split the expired file calculations to multiple operators for performance reasons. In the long run we might go down that road... WYYT?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Doesn't really makes sense to check the table more frequently than the scheduling rate limit. So I think they should be the same.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I would like to separate out the public and the internal objects at package level.\r\nMy plan is that everything in \"stream\" package is public, and everything in \"operator\" package is private/internal. I think that is easier to understand for the users.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "If we want to use the batch delete feature then we need state handling too to store the not yet delete files.\r\nAlso theoretically not all FileIO implements the `SupportsBulkOperations` interface.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/AsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Changed one of the `MonitorSource` instances, but kept the other.\r\nIf you have better suggestion for the javadoc, please suggest.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Updated to set the system default sizes", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "renamed to `append`", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "Let's keep this package private first so that it is easier to evolve the class especially during early stages.\r\n\r\nwhen there is real need in the future, we can always make it public then. ", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I didn't mean change the method name, just the arg name. here is an example from iceberg-core\r\n```\r\nExpireSnapshots retainLast(int numSnapshots);\r\n```", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "Also see my comment above for `retainLast(int numSnapshots);`. \r\n\r\nHere is an example from `RemoveSnapsots`. the variable name is called `defaultMinNumSnapshots`. `Integer retainLast` is not very descriptive.\r\n```\r\n @Override\r\n public ExpireSnapshots retainLast(int numSnapshots) {\r\n Preconditions.checkArgument(\r\n 1 <= numSnapshots,\r\n \"Number of snapshots to retain must be at least 1, cannot be: %s\",\r\n numSnapshots);\r\n this.defaultMinNumSnapshots = numSnapshots;\r\n ", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "> Are you suggesting to do everything in the operator instead of separating out the delete to another operator?\r\n\r\nNope, that is not what I meant.\r\n\r\nI was wondering if `ExpireSnapshotsProcessor` and `AsyncDeleteFiles` should use the default shared thread pools from `ThreadPools.getWorkerPool()/getDeleteWorkerPool()`, instead of creating new pools.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "what does `main` mean here?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I don't know if users would interpret `stream` sub package as pubic APIs. It is better to use proper Java class scope for that purpose. public classes are public and non-public classes can be package private.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "Most `FileIO` implementations support the bulk deletes. We should definitely leverage the bulks deletion when `FileIO` supports it. It makes a **huge** difference.\r\n\r\n<img width=\"387\" alt=\"image\" src=\"https://github.com/user-attachments/assets/f192454c-40b5-4dd5-a556-a3cac92adc80\">\r\n\r\nSpark also does the check on bulk support. See `DeleteOrphanFilesSparkAction`.\r\n```\r\n if (deleteFunc == null && table.io() instanceof SupportsBulkOperations) {\r\n deleteFiles((SupportsBulkOperations) table.i", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/AsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I am wondering if we still need async if leveraging bulk deletions, which are available for all major implementations Azure, GCP, Hadoop, S3.\r\n\r\nI don't know we need to store state. during the checkpoint, the operator can flush all pending files and submit a synchronous bulk deletion call. This is like the file flushing for `IcebergStreamWriter#prepareSnapshotPreBarrier`.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/AsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I don't like the idea of shared pools. If some users create multiple maintenance flows in a single job, they will end up using the same pool, and will block each-other. ", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "If the FileIO doesn't support bullk deletes, we need to do the async to finish deleting in a timely manner.\r\n\r\nDo we create different job graph for different FileIO, or just don't support FileIO if it doesn't support bulk deletes?\r\n\r\nIt is usually not a good practice to block a checkpoint with long running tasks. I'm a bit concerned because of the delay caused by executing the deletes on checkpoint.\r\n\r\nBTW HadooFileIO does the bulk delete with a thread pool behind the scenes - so it does the sam", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/AsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "There is a possibility to inherit the suffix and the slotSharingGroup from the TableMaintenance.Builder. It could be overwritten on task-by-task basis.\r\n\r\nThe `main` here is the value inherited from the `TableMaintenance.Builder`", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Let's mark it experimental then. I definitely would like to provide a way for the users to extend the maintenance tasks.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I think our practice to put high number of files to a single package, and mixing the public API with the implementation details makes the development of the Iceberg connector often confusing. It is hard to understand what is a public API, and what should only used internally.\r\nAlso there will be very high number of operators needed for the implementation of the other maintenance tasks.\r\n\r\nI agree that the stream might not be the best name, maybe `api` would be better, but I would try to separat", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Also, if you have multiple slots on a single TaskManager, then these pools are shared between the subtasks. Which is again not something we want", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this is where I meant if we should use the shared thread pool here.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/AsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this can be marked as `@Internal` or package private if we can agree to remove the sub packages", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java", "line": 46, "type": "inline"}, {"author": "stevenzwu", "body": "maybe we should add Javadoc to the `ExpireSnapshots` class that expired files are always deleted", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java", "line": 103, "type": "inline"}, {"author": "stevenzwu", "body": "error level here?\r\n\r\nnit: error msg typically starts with sth like `Failed to ` or `Cannot `", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "`TaskResult` has ` List<Exception> exceptions`. wondering what scenario would we have a list of exceptions to propagate?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java", "line": 116, "type": "inline"}, {"author": "stevenzwu", "body": "I am wondering if we need this timeout. FileIO typically has a timeout config internally. If we are configuring another timeout here, we may have inconsistency. E.g., a lower value here would cause premature declaration of failure. should we just rely on the underneath FileIO timeouts.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "maybe add some comment to explain multiplier choice of `1.5`. more common choice is `2`.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "It is unclear what `ES` means. please use full name. also we can probably set the operator name and uid the same. right now, name is set to this constant. at least, it should include the `uidSuffix`.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `ignore the results` can be clarified. maybe sth like this\r\n```\r\nIgnore the async file deletion boolean result and return the DataStream<TaskResult> directly\r\n```", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "parallelism is non-primitive `Integer` and can be null. probably only call `setParallelism` if not null.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: this seems more like an operator name, not maintenance task name. again, would be simpler to just set the operator uid and name the same. that is what we do in the v2 sink", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: more direc. either `The snapshots newer than this age will be retained` or `The snapshots older than this age will be removed`", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "remove the `@return` line. seems obvious to document it. this applies to all builder methods", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: should be `The minimum number of ... to retain`.\r\n\r\nalso use import for the linked class?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: maybe `deleteRetryNum` to be more clear it is retry. kind of like `COMMIT_NUM_RETRIES` from `TableProperties`", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "this assertion condition doesn't seem correct. it should assert on the input arg `mainParallelism`.\r\n\r\nalso should `mainParallelism` be non-primitive `Integer`? if yes, we should only need to assert two conditions of `mainParallelism == null || mainParallelism > 0`.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "wondering if we need both `uidSuffix` and `mainUidSuffix`. can we just use `mainUidSuffix`?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I am not saying we shouldn't. but it is usually good to keep them private first so that we are free to evolve the class. Maybe wait until the need is clear.\r\n\r\nUse Spark as an example. the `BaseSparkAction` is package private.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "> It is usually not a good practice to block a checkpoint with long running tasks. I'm a bit concerned because of the delay caused by executing the deletes on checkpoint.\r\n\r\nI see no difference with `IcebergStreamWriter#prepareSnapshotPreBarrier`, which does the file flush and upload in the synchronous part of checkpoint.\r\n\r\n> I agree that the other FileIO implementations could be faster with the bulk delete\r\n\r\nBulk deletes are not only a lot faster. they can also avoid potential throttling fro", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/AsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "> It is hard to understand what is a public API, and what should only used internally.\r\n\r\nIdeally, this should be achieved via Java scoping (public, package private, private etc.)\r\n\r\n> I agree that the stream might not be the best name, maybe api would be better, but I would try to separate out the classes used by the user from the classes only used by the implementation.\r\n\r\nThis is not the tradition/style that Iceberg connectors (Spark, Flink) have been following. If we are going to have `api`", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: is this more clear?\r\n\r\n```\r\nUse this for standalone maintenance job. It creates a monitor source that detect table changes and build the maintenance pipelines afterwards.\r\n```", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "is `1 ms` a good default?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "is `30s` a good default? is that based on the estimated average of task run time?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: global -> default?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "again, might be good to make name and uid the same as we did for v2 sink", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "do we need `global` with the `forceNonParallel` below?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "can we use `GenericAppenderHelper` for data insertions? We can avoid the initialization of Flink mini cluster and Flink SQL executions. \r\n\r\nLooking at the tests, we don't even actually need an Iceberg table here. just some plain tmp directory would probably be good enough. don't even need table loader and table creation.\r\n\r\ntrying to to make tests faster :)", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestAsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "where is the `wait` part?", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestAsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I found `deleteDummyFileAndWait` a bit confusing. maybe just `deleteFileAndWait`?", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestAsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "what does `wrong` mean here? with invalid file scheme in the URI? if yes, let's call it `testInvalidURIScheme`.", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestAsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "is `AsyncDeleteFiles.FAILED_PREDICATE` only used here? if yes, why it is defined in the `AsyncDeleteFiles` class?", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestAsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I don't quite understand this while loop with the await on the same condition as the while-loop", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestAsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: remove unnecessary empty line", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestExpireSnapshotsProcessor.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "not sure if we hav discussed this before. `SerializableTable` is included in the `Trigger` record. what does the downstream operators use it for? `MaintenanceTaskBuilder#append` always require `TableLoader`.\r\n\r\n This can potentially be a little problem with Iceberg version upgrade with unaligned checkpoints, if Iceberg changed the table serialization. I guess we need to advise users to use savepoint for Iceberg version upgrade?", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestExpireSnapshotsProcessor.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "Is this change just piggyback?", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestLockRemover.java", "line": 146, "type": "inline"}, {"author": "stevenzwu", "body": "what does `for convenience reasons` mean here? if it just means that it is refreshed inside, I feel we can remove this part.", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/ScheduledBuilderTestBase.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "does `checkSideEffect` mean `waitForCondition`?", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/ScheduledBuilderTestBase.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: this comment is probably better placed right before line 89", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I probably missed sth here. we had a successful run at line 130. why both counters are zero. there is no expired files deleted?", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "uidPrefix -> uidSuffix", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/OperatorTestBase.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "should this class be called `MaintenanceTaskTestBase`?", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/ScheduledBuilderTestBase.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "tableLoader is not closed", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "testForTable?", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "why is `PROCESSED` a static class member? can't it be a local variable for each method", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "why not use `GenericAppenderHelper`, which has been extensively used for the purpose of inserting records to a table?", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "why don't we add lock assertion in other tests and remove this method?", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "similarly, can metrics assertion be added to one of earlier methods?", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "do we really need the flexibility of `uidSuffix` overwrite?", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "not clear about what this comment trying to explain. what is `something`? does `scheduler` mean trigger manager?", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "same comment for `something` and `scheduled`.", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `ForMonitor` -> `ForMonitorSource`", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "This API is a single `abstract DataStream<TaskResult> append(DataStream<Trigger> sourceStream);` method.\r\nI don't expect big changes here.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Why would be better to have it in protected scope? I don't think we would like to expose them at all.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "My current implementation contains 50 classes - I would prefer to separate them in some way", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "This is the javadoc for the prepareSnapshotPreBarrier:\r\n```\r\nImportant: This method should not be used for any actual state snapshot logic, because it will inherently be within the synchronous part of the operator's checkpoint. If heavy work is done within this method, it will affect latency and downstream checkpoint alignments.\r\n```\r\n\r\nDeleting high number of files could be quite heavy even for bulk deletes.\r\n\r\nSo I switched to the bulk delete implementation and added a `deleteBatchSize` to lim", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/AsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Added this:\r\n```\r\n/** Deletes expired snapshots and the corresponding files. */\r\n```", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java", "line": 103, "type": "inline"}, {"author": "pvary", "body": "Other maintenance tasks might have multiple errors from multiple operators/subtasks", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java", "line": 116, "type": "inline"}, {"author": "pvary", "body": "AsyncDelete used it.\r\nOutdated", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "AsyncDelete used it.\r\nOutdated", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "The name of the executor is shown on the Flink UI.\r\nHaving a full name there will cause a cluttered UI. I suggest to revisit the exact naming once we see the whole job graph", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Done - updated the actual comment to reflect the Asyc -> batch delete changes", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Let me think about this - The question is what should be the default parallelism for the entire TableMaintenance job, if the parallelism is not set....", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "renamed to `DELETE_FILES_OPERATOR_NAME` and used it to generate the uid", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Done - even though I do not agree", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Outdated by Async -> Batch update", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "We need to have different ways to clean the MonitorSource state and the maintenance task state.\r\nAlso the user might want to clean the state of a specific maintenance task, but might want to keep the state of another maintenance task.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Good catch - remained from a different config.\r\nSet it to 1 min.\r\nWDYT?", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I would not set it based on the estimated average task run time, as if the actual run time is longer only by a bit, then we will wait for 2 times the required time.\r\n\r\n30s seems reasonable for the JDBC lock manager", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "We need the `tableLoader` for the tests", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestAsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Outdated because of the Async -> Bulk delete change", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestAsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "removed the global", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Outdated because of the Async -> Bulk delete change", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestAsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Outdated because of the Async -> Bulk delete change", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestAsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Outdated because of the Async -> Bulk delete change", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestAsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "We need a Table which is earlier than the element timestamp.\r\n\r\n> I guess we need to advise users to use savepoint for Iceberg version upgrade?\r\n\r\nOr simply drop the state of the tablemanager", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestExpireSnapshotsProcessor.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Yeah... sorry for the confusion", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestLockRemover.java", "line": 146, "type": "inline"}, {"author": "pvary", "body": "I hate unexpected side-effects, and it felt prudent to highlight this here", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/ScheduledBuilderTestBase.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "The failure happened before the delete file operation. The counters are not affected. The failure counter will increased by `LockManager` down the pipeline", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "This is the same for several other existing test cases. I don't want to complicate the test with unnecessary cleanup. The `HadoopTableLoader.close()` is empty BTW", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "The `PROCESSED` is set in the `DummyMaintenanceTask.map` method", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I thought these getters are needed by child classes extended from this one. then the protected scope is needed?\r\n\r\nBut as discussed in another comment, if this class is not exposed as public. then package private would be fine too.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "not just `append`. there are a lot of public methods in this class.\r\n\r\nIf we have any doubt if users would implement extensions from this class, we can delay the decision until real ask came forward. it is trivial to make a private class public. But once a class is public, it is more difficult to change/evolve the contract.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "sure. it is fine to have sub packages. we will just continue to use the `@Internal` annotation as we have been doing.\r\n\r\nbut I don't know if we have a convention to use `api` subpackage in connectors like Flink, Kafka, Spark modules.", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "hmm. I am still not quite following. Each operator subtask emits a `TaskResult`. Each `TaskResult` should only contain one exception, right?\r\n\r\nI didn't see the exceptions are used by downstream. if `success` boolean flag good enough for downstream, maybe we can remove the exceptions from `TaskResult` as stack trace can be non-trivial.\r\n\r\nBTW, `TaskResult` is not marked as `Serializable`. ", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java", "line": 116, "type": "inline"}, {"author": "stevenzwu", "body": "sharing thread pool is not necessarily a bad thing. it can limit the concurrent I/O. E.g., we may not want to have too many threads perform scan planing, which can be memory intensive.\r\n\r\nDeletes have low memory footprint. Hence it is probably less of a concern to have separate pools. but probably good to keep an eye on the number of http connections", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "there is really no fundamental difference whether we use `prepareSnapshotPreBarrier` or not. if we use the bulk delete API, the check barrier can come right after a bulk delete request with or without `prepareSnapshotPreBarrier`. The main thing is that batch delete requests have higher latency. \r\n\r\nAsync model can increase the throughput with single file deletion. but it is overall inefficient and more expensive. It also can't avoid the potential throttling problem.\r\n\r\nThe change seems not pushe", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/AsyncDeleteFiles.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Moved to `GenericAppenderHelper` - this caused a bit of refactor in older tests as well", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I prefer to separate out testing the different features", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "If we want to keep the MonitorSource state, but drop some or one of the maintenance tasks state, then we need a different uid", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Updated the comment", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Updated the comment", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "The public methods are public API anyways...\r\nThese are the ones which will be called by the users when they are scheduling the MaintenanceTasks...", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "When we do compaction, then we have multiple subtasks running parallel. If more than one of them fails, then we will have multiple exception messages to report, but we don't want to fail the job (especially in PostCommitTopology).", "path": "flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java", "line": 116, "type": "inline"}, {"author": "pvary", "body": "Ok.. found a way to get the `tableLoader`.\r\nMoved all tests to us the `GenericAppenderHelper`", "path": "flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestAsyncDeleteFiles.java", "line": null, "type": "inline"}], "1587": [{"author": "rymurr", "body": "not sure if this is the best way to get hold of a directory to write tables into. Anyone have any suggestions?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "The nessie specific tests all modify spark settings and reset the settings at the end. This is to interfere as little as possible w/ the 'normal' iceberg path.", "path": "spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceNessieTables.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "We identify Nessie as the core catalog/source when there are specific parameters available on the classpath or hadoop config. The idea here is to be fully backwards compatible w/ Hive and Hadoop catalogs.", "path": "spark2/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "All Nessie tests are run in their own branch to not interfere with parallel test execution", "path": "spark3/src/test/java/org/apache/iceberg/spark/SparkCatalogTestBase.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "The concept of a namespace is implicit in Nessie and are therefore not managed through the normal `SupportsNamespaces` interface. We skip tests of this interface when the catalog is a `NessieCatalog`.", "path": "spark3/src/test/java/org/apache/iceberg/spark/sql/TestNamespaceSQL.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "We do not extend `SupportsNamespaces` as a Nessie object store supports the concept of namespaces implicitly. A Nessie namespace can be arbitrarily deep but is not explicitly created or stored. Similar to empty folders in git.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What's happening in the Nessie plugin?", "path": "build.gradle", "line": 1064, "type": "inline"}, {"author": "rdblue", "body": "From the comments in the Iceberg sync, it sounds like this is running a stand-alone Nessie server? Is that something we could handle like the current Hive MetaStore tests, where each test suite creates a new metastore and tears it down after the suite runs?", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Should be fine, but I think the trade-off is that you won't be able to list namespaces in a namespace. It will be harder to find the namespaces themselves.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "In general, I would discourage depending so heavily on Hadoop `Configuration`. Spark and Flink have a way to pass catalog-specific options, which is the best way to configure catalogs.\r\n\r\nThere is some discussion about this in #1640. I think that catalogs should primarily depend on config passed in a string map, and should only use Hadoop `Configuration` when dependencies (like `HadoopFileIO` or `HiveClient`) require it.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Looks like this will return all tables underneath the given namespace, even if they are nested in other namespaces?\r\n\r\nI haven't tested this in spark, does it work as expected?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Probably shouldn't use `RuntimeException` here. How about `NoSuchNamespaceException`?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: Most Iceberg error messages use the form `Cannot <some action>: <reason> (<workaround>)`. Consistency here tends to make at least Iceberg errors more readable and easy to consume.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this should throw `NoSuchTableException` if the existing metadata is not null because the table was deleted under the reference. You'll probably want to follow the same behavior as the Hive catalog.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Doesn't look like the format here is quite correct. Missing a space?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": 97, "type": "inline"}, {"author": "rdblue", "body": "Is this right for `NotFoundException`? Iceberg will retry failed commits.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "You can use the `DynFields` helpers to do this a bit more easily.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We prefer using `AssertHelpers.assertThrows` so that state after the exception was thrown can be validated. For example, testing `catalog.createTable(invalid)` would not only check `ValidationException` but also verify that the table was not created.", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestParsedTableIdentifier.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is probably an area to revisit. Right now, this is written to have minimal changes between 2.4.x and 3.0.x, but I think we will probably want to route all loading from here through a catalog. That will allow us to delegate all of this to Nessie or Hive the same way.", "path": "spark2/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Please have a look at #1640, I'd like to standardize how we do this. I do like using `type = nessie`, so we may want to have a lookup that points to the `NessieCatalog` implementation.", "path": "spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why was this needed?", "path": "spark3/src/test/java/org/apache/iceberg/spark/sql/TestCreateTableAsSelect.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "There are a lot of tests that need this. Should we separate the test cases into different suites?", "path": "spark3/src/test/java/org/apache/iceberg/spark/sql/TestNamespaceSQL.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "Quarkus (which is the underlying http framework) behaviour is slightly counterintuitive in that it doesn't offer an option to start Nessie like you can start the hive metastore. Hence we start it once per module and test suites are responsible for cleanup", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "rymurr", "body": "It uses `quarkusAppRunnerConfig` dependencies to discover the Nessie Quarkus server and its dependencies then uses that to start a server. Some of the operations to discover all runtime dependencies are non-trivial and require a full gradle dependency graph, hence why its non-trivial to do in a test suite. I believe the primary reason for all this is to facilitate easily building graalvm native images.\r\n\r\nSee https://github.com/projectnessie/nessie/tree/main/tools/apprunner-gradle-plugin for the", "path": "build.gradle", "line": 1064, "type": "inline"}, {"author": "rymurr", "body": "I will take another pass at this today, I can see totally valid reasons to support listing namespaces if they have tables in them. The problem as I see it comes from creating or deleting namespaces, and storing namespace metadata.\r\n\r\n* create/delete: in Nessie (similar to git) a namespace would be created implicitly with the first table in that namespace tree and deleted with the last table in that namespace tree. Separate crerate/delete options in nessie are either no-ops or require a dummy to ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "You are correct, it will return everythiing in and below `namespace`. What is the contract supposed to be? Only tables in this namespace? ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "good eye, the first char of the `applicationId` is a newline. I've put no space between `commit` and `%s` to not have extra trailing whitespace in message.\r\n\r\nAlso note that the handling of commit messages in nessie is still fairly primitive. This should get replaced by a structured object in the near future.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": 97, "type": "inline"}, {"author": "rymurr", "body": "good eye, cleaned up exception message and handled throwing better", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "The way I was running in the test made it get deleted on the backend nessie server but not in the cached spark context I will clean this up as part of the Spark rework", "path": "spark3/src/test/java/org/apache/iceberg/spark/sql/TestCreateTableAsSelect.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "Sure, the hadoop catalog is also skipped for most of these. Makes sense to have separate tests", "path": "spark3/src/test/java/org/apache/iceberg/spark/sql/TestNamespaceSQL.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "I have cleaned this up a bit and tried to follow the pattern you suggested in #1640", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "Just checked and the contract is `Return all the identifiers under this namespace.` I took this to mean everything under this and all sub namespaces. If that was not the intention of the method I will fix the predicate.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "Having another look we could add valid impls for `namespaceExists` and `listNamespaces` and do no-op or throw for the others. Then the clients can still navigate namespaces. Thoughts?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "For Spark writes, we pass the application ID in through `snapshot.summary()`: https://github.com/apache/iceberg/blob/9af545ed56343b2fa09966166f8da3e7a24100d7/spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchWrite.java#L153", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "Ahh, thanks for that. Much cleaner this way. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "#1640 is in. It uses a no-arg constructor and adds an `initialize(String name, Map<String, String> config)` method to initialize and configure the catalog. I think you should be able to update this now.\r\n\r\nI'm hoping that this removes the need to make Spark and Flink depend on the new Nessie and Glue modules. We should make sure we have a test suite we can include here that uses Flink and Spark.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think that any configuration should come from the Hadoop `Configuration` unless it is used for a Hadoop component, like `HadoopFileIO`. Can you initialize this from the catalog config passed to `initialize`?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This error message could easily be incorrect because it doesn't use `CONF_NESSIE_REF` directly. It assumes the caller did.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Here as well, I don't think this should pull config from `Configuration`.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Looks like `reference` must never be null, correct?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think it would be helpful for `get` to have a better name for uses like this. What about `findReference` or `loadReference`?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "How about passing `ImmutableMap.of()` instead of `new HashMap<>()`? That avoids unnecessary object creation. Better yet, what about a version of this that doesn't need to pass a map if there isn't one?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "This is fixed now, apologies. I fixed the constructor but didn't remove the comment.\r\n\r\nI agree that the Catalog portion of Spark3 should work fine now w/o explicitly adding Nessie (or Glue etc). I believe we still need to update the `IcebergSource` to handle custom (Iceberg) catalogs right?\r\n\r\nIs the intention to add the new catalogs to the Iceberg shaded jar? ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is this assuming that the `NessieNotFoundException` is referring to the ref because the table was loaded just above? Or is that always used for a ref?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can't this refresh and complete the operation?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "In this case, just remove the purge. We do that in our catalog as well because we never delete data as a result of a user action. We garbage collect it later.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What is this referring to?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Util methods seem to be mixed in. I think it may help readability if these were at the bottom, or were static methods in a `NessieUtil` class.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "Hey @rdblue these parameters are initialised in the `initialize` method (I have moved the method up to near the constructor as it was hidden at teh bottom of the class). The initialisation uses the passed `options` where possible and falls back to `Configuration` if not found. This is to make it compatible with Spark2/3 `IcebergSource`. However I am happy to remove this once `IcebergSource` supports custom catalogs (which I hope to tackle next if its not already being worked on)", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is there a more specific name for this? It isn't clear what `catalog.getHash()` should be.\r\n\r\nAlso, style nit: we avoid using `get` where a more specific verb would add value.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Should we create a trait just for listing namespaces that are implicit?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": 216, "type": "inline"}, {"author": "rdblue", "body": "Ignore my comments above, since it looks like you've already added this. Can you merge this with `init` and the constructors?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nessie URL? In other places, we configure the connection using `uri`.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is the default ref, right? Tables can override it. I think that would be a better name if you can use this catalog to load other refs.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "If an update doesn't create a snapshot, then this method will return the app ID that committed the last snapshot. That may not be correct. Should we create a class in core to hold this information instead? Then we could set it somewhere in Spark and Flink so you'd always have identifiers without needing to resort to reflection?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": 132, "type": "inline"}, {"author": "rdblue", "body": "What about doing this delete in a `finally` block if `threw` is false? That's usually a better way than catching `Throwable` and wrapping it in a `RuntimeException`.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "re-jigged this and above a little bit to make it clear that the hadoop config is only used as a fallback. Hopefully that is more clear. As stated before I hope to remove `Configuration` for anything but IO in a further PR.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This doesn't seem correct to me. I try to maintain single-table state by catalog, so that all uses of a table stay in sync. I think it would make sense to do the same with refs. If you update a branch by refreshing or committing any table, it should also refresh everything that is related to stay in sync. Otherwise, you're left with the problem of not knowing whether two tables with the same ref are in sync.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "correct. The only way to return from `get` is with a valid reference, otherwise an exception would be thrown. Would you prefer an explicit null check here?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "agreed, fixed", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "agreed, fixed", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "I've deleted the builder as its superfluous after #1640", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "You are correct `NessieNotFoundException` refers to the `ref`. If the table were deleted it would be a conflict exception. This is similar to comparing the error modes of git if you committed to a non-existent ref compared to a merge conflict in the case of changing files in the repo", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "I am not 100% certain that `RuntimeException` is the best avenue here, its definitely an unexpected error and in a sense unrecoverable. There are no exceptions referenced in the javadoc so perhaps a log line and returning `false` is more appropriate?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "correct, now refreshes and tries again", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "We don't strictly need to refresh immediately after an operation. This generates an extra call to the backend which typically isn't required. We have to do it because tests do require it. Tests tend to switch branches, perform multiple actions and make several conflicting changes in short order in the same jvm so need explicit refresh. Since the api call isn't expensive we have left the refresh in until a better strategy (or better tests) are devised", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "good idea, fixed. Static methods have been moved to a util class and private methods have been moved to the bottom with `Override` methods grouped above", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "changed to `currentHash`. Thoughts? The method returns the current hash as teh catalog understands it.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "How do you mean? A new interface that no-ops create, load, drop from `SupportsNamespaces`?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": 216, "type": "inline"}, {"author": "rymurr", "body": "removed the builder and any URL refs I found", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "Ahh, good point. Do you mean setting some static state somewhere which holds the current app id? I don't love setting static state, is there a way to tell if the snapshot is the 'correct' snapshot we are looking for?\r\n\r\nWe were just talking today about how to better handle Nessie commit info, perhaps somethign we could discuss tomorrow on the sync call? cc @jacques-n", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": 132, "type": "inline"}, {"author": "rymurr", "body": "agreed, fixed", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "Cool, that makes sense to me. Have removed the comment. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We try to use simpler error messages and avoid referring to specific people, like \"we\" or \"you\". A good rule of thumb is \"Cannot [some action]: [problem[ [(suggestion to fix)]\" or \"Invalid [something]: [problem]\". How about \"Invalid table name: # is not allowed (reference by timestamp is not supported)\"?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/ParsedTableIdentifier.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can we use a simpler verb, like `parse` or `parseTableIdentifier`? It's wordy to use \"get\" and then a past tense verb.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/ParsedTableIdentifier.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'm wondering if there is a more specific name for this class. Maybe something like `TableReference` because it has both an identifier and a ref? Or maybe `NessieIdentifier`? `ParsedTableIdentifier` doesn't really tell me what is different about this as opposed to `TableIdentifier`.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/ParsedTableIdentifier.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Just an interface that omits those. All this needs is to list namespaces, not do anything else.\r\n\r\nI guess we can take a closer look if anything else needs this.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": 216, "type": "inline"}, {"author": "rdblue", "body": "Maybe we should pass this in through catalog properties?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": 132, "type": "inline"}, {"author": "rdblue", "body": "Table no longer exists? Or the ref no longer exists?\r\n\r\nAlso, `NotFoundException` is for files that don't exist. Tables should use `NoSuchTableException`", "path": "nessie/src/main/java/org/apache/iceberg/nessie/UpdateableReference.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Most tests use `@Rule TemporaryFolder temp` so that JUnit handles temp lifecycle. I'd recommend doing that here, too.", "path": "nessie/src/test/java/org/apache/iceberg/nessie/BaseTestIceberg.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This looks like it should have a more specific name because it returns the metadata location for a table.", "path": "nessie/src/test/java/org/apache/iceberg/nessie/BaseTestIceberg.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Looks like this could just always pass `hash`.", "path": "nessie/src/test/java/org/apache/iceberg/nessie/BaseTestIceberg.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I prefer to separate tests into distinct cases. It looks like this combines `testRename` with `testListTables`. Combining test cases into a single method causes failures to prevent other tests from running, which makes the whole suite harder to work with.\r\n\r\nIt is also a lot easier to spot missing test cases and add new ones when tests are separate. I'd recommend taking a look at most of these test suites.\r\n\r\nThat said, I think that you'll be the primary reviewers here so in the end it is mostly", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What is this suppressing?", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieTable.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can you use `AssertHelpers.assertThrows` instead? That way, you can add assertions after the exception to check other things, like that the table has not been modified.", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieTable.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We've moved the other implementations to create a FileIO at the catalog level and pass it into TableOperations. You may want to do the same. Also, you'll probably want to update to use the same logic so that the implementation can be overridden dynamically.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Does this need properties? The only thing that is used is an optional ref. That could be passed by itself rather than as a map.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/ParsedTableIdentifier.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We typically use a continuation indent of 2 indents / 4 spaces, rather than aligning with the previous method call.", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieTable.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "agreed, fixed", "path": "nessie/src/main/java/org/apache/iceberg/nessie/ParsedTableIdentifier.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "I chose `TableReference` in the end. I don't love it but its way better than `ParsedTableIdentifier`. I don't know how to concisely state that its a `TableIdentifier` tied to a specific `Reference`. I have updated the javadoc which I hope clarifies it", "path": "nessie/src/main/java/org/apache/iceberg/nessie/ParsedTableIdentifier.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "fixed the comment and simplified the exception handling", "path": "nessie/src/main/java/org/apache/iceberg/nessie/UpdateableReference.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "agreed! Bizarre that wasn't used in the first place :-)", "path": "nessie/src/test/java/org/apache/iceberg/nessie/BaseTestIceberg.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "fixed, contents is an internal nessie name for the objects stored in the Nessie object database. Changed it to content held for iceberg", "path": "nessie/src/test/java/org/apache/iceberg/nessie/BaseTestIceberg.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": ":-) i think you are correct", "path": "nessie/src/test/java/org/apache/iceberg/nessie/BaseTestIceberg.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "I prefer splitting as well. I have split into 3 tests, its a little more cluttered this way but let me know what you think, happy to go either way.", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestCatalog.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "absolutely nothing! It got carried over from the nessie repo (which has militant checkstyle rules) and I forgot to remove it. Gone now.", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieTable.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "thanks for the pointer, copied the Hive Catalog now.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "correct, even further the ref isn't really needed. Thanks for pointing that out, much cleaner now.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/ParsedTableIdentifier.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "> Is the intention to add the new catalogs to the Iceberg shaded jar?\r\n\r\nI think it depends. If a catalog pulls in a ton of dependencies and requires updating a lot of the shaded Jar's documentation, then it comes at a high cost. On the other hand, if it uses existing bundled libraries or libraries that can be pulled from the Spark runtime, then it would be easier.\r\n\r\n> I believe we still need to update the IcebergSource to handle custom (Iceberg) catalogs right?\r\n\r\nYes, we will need to come up ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We support the Hive warehouse property because that's how to set it up in Hive. Since we are introducing new configuration for Nessie, I'd really rather not make it so people can depend on using the Hadoop config. Then we will never be able to get rid of it.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why remove it in a follow-up? I'd be concerned about not remembering and then needing to break behavior later.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Yeah, I like logging and returning false.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think we should update our SparkCatalog to pass information into catalogs.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": 132, "type": "inline"}, {"author": "rdblue", "body": "@rymurr, do we need to request a refresh for all other tables that use this ref?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "rymurr", "body": "Ok, removed. The consequence of this is the custom catalog work for `IcebergSource` has to be done before the next release if we want a valid/usable nessie in the release", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}], "4871": [{"author": "rdblue", "body": "I think this should match the style of the other JSON parsers, which don't do this work twice. Here, you're using a switch statement on the type to validate, and then using a switch statement on the type to extract the value. Instead, I think this should have one method that keeps the logic for each type in the same place.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This needs to check `isNull` as well.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This needs to validate the UUID and also return a `java.util.UUID` value.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": 107, "type": "inline"}, {"author": "rdblue", "body": "What is the value of `0x`? I think I'd rather just remove it than have all this extra handling for it.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This needs to validate the length of the byte array.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Iceberg code should not use `spliterator`. Can you find another way?", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "According to the spec, the JSON node should be an object with two fields: `keys` and `values`. I think it would be much easier to validate that the node is an object and then read the fields, rather than trying to convert to a list. This needs to respect the names, not the order.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think you're trying to copy the error messages from `JsonUtil`, but removed the wrong `%s`. The user value goes after the error message and should not be embedded in it. The `Cannot parse %s` in `JsonUtil` tells the reader which field was being parsed, like `Cannot parse snapshot-id to a long value: null`.\r\n\r\nThis should be `\"Cannot parse default as a %s value: %s\", type, defaultValue`.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Actually, as long as this is in a type branch, you should just embed the type string: `\"Cannot parse default as a boolean value: %s\", defaultValue`", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this should check `isFloatingPointNumber`", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this also needs to validate that the decimal's scale matches the expected scale. That must always match or else it should throw an exception.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this should validate that the string's length is the length of a UUID string.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can you produce a better error message for when the length is invalid?", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why not use `ByteBuffer.wrap` here?", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "You can move `type.asListType().elementType()` out of the loop.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It may also be shorter to do it this way:\r\n\r\n```java\r\nType elementType = type.asListType().elementType();\r\nreturn Lists.newArrayList(Iterables.transform(arrayNode, e -> DefaultValueParser.fromJson(elementType, e)));\r\n```", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think you should check that the size of these array nodes matches.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It shouldn't be necessary to copy these into lists. Instead, you can iterate over them simultaneously after checking that the size is the same:\r\n\r\n```java\r\n ImmutableMap.Builder<Object, Object> mapBuilder = ImmutableMap.builder();\r\n\r\n Iterator<JsonNode> keyIter = keys.iterator();\r\n Type keyType = type.asMapType().keyType();\r\n Iterator<JsonNode> valueIter = values.iterator();\r\n Type valueType = type.asMapType().valueType();\r\n\r\n while (keyIter.hasNext()) {\r\n mapBuilder.put(fromJson(keyTyp", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should return a `StructLike`:\r\n\r\n```java\r\n StructType struct = type.asStructType();\r\n StructLike defaultRecord = GenericRecord.create(struct);\r\n\r\n List<NestedField> fields = struct.fields();\r\n for (int pos = 0; pos < fields.size(); pos += 1) {\r\n NestedField field = fields.get(pos);\r\n String idString = String.valueOf(field.fieldId());\r\n if (defaultValue.has(idString)) {\r\n defaultRecord.set(pos, fromJson(field.type(), defaultValue.get(idString));\r\n }\r\n }\r\n\r\n return defa", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Shouldn't this throw an exception if the type is not supported?", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Here, I think we need to handle the child default values. If we make this independent of the child's default value, then there is no way to distinguish between an explicit null default and a missing default after this returns.\r\n\r\nWhen the default is missing and the child field has a default, this should fill in the child's default value.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Rather than using `Literal`, could you just refactor to add the conversions to `DateTimeUtil` like the to string conversions? That way we have both in the util.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should not produce a string and then parse it. Instead, it should update the conversion above to go directly.", "path": "core/src/main/java/org/apache/iceberg/util/DateTimeUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is an incorrect and unsafe use of `array()`:\r\n* The buffer may not be backed by `byte[]`\r\n* The buffer may use an `arrayOffset()`\r\n* The buffer may have a non-zero position\r\n* The buffer may have a limit before the end of the array\r\n\r\nYou can use `ByteBuffers.toByteArray(value)` if you need `byte[]`, although that may copy in some cases. I think it should be fine here since most of the time the backing array can be used directly because it will be created by `fromJson`.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can you rename this `fromJson`? And also add the variations of the method that accept String.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Like the other parsers, this method should be passed a `JsonGenerator` that handles creating the JSON string.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should call generator methods instead of returning `List<Object>`. Same with map and struct.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": 362, "type": "inline"}, {"author": "rdblue", "body": "This should deconstruct a `StructLike`, not a map.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "The parser should produce and accept strings, rather than doing it here in tests.", "path": "core/src/test/java/org/apache/iceberg/TestDefaultValuesParsingAndUnParsing.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can you add test cases for nested types? One of each (list, map, struct) that contains a struct would be good.", "path": "core/src/test/java/org/apache/iceberg/TestDefaultValuesParsingAndUnParsing.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this should also have a few tests for cases that are caught above, like maps with different length key and value lists, binary and fixed values that are not the right length, UUID values that are not actually UUIDs, etc.", "path": "core/src/test/java/org/apache/iceberg/TestDefaultValuesParsingAndUnParsing.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Since the field can't actually carry a default value right now, I think we can put this off until the next PR.\r\n\r\nFor the next step, I think this should add the API changes as package-private so we can add handling for child defaults in the same package. We can move the parser and make more things public as we make progress.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Yes, please update the spec as well.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't feel strongly about this. Either way is fine, but the first comment about the error message should be fixed.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "@rzhang10, yes, as I said it should be fine \"most of the time\".\r\n\r\nThe problem is that you have no guarantee about the byte buffer that is passed in. Hence this is unsafe.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "@rzhang10, please look at the other parsers and match what they do. You should be using a `JsonGenerator`.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "No, as I said this should match the other parser classes and use a `JsonGenerator`.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": 362, "type": "inline"}, {"author": "rdblue", "body": "What if the default value is null?", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": 259, "type": "inline"}, {"author": "rdblue", "body": "Looks like this uses unchecked casts. Can you also add validation that the incoming values are the expected type?\r\n\r\n```java\r\n case BOOLEAN:\r\n Preconditions.checkArgument(javaDefaultValue instanceof Boolean, \"Invalid default %s value: %s\", type, javaDefaultValue);\r\n generator.writeBoolean((Boolean) javaDefaultValue);\r\n break;\r\n ...\r\n```\r\n\r\nNote that this also casts the the class that was checked, so it uses the object type, `Boolean` rather than primitive `boolean`.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Minor: Can you rename `javaDefaultValue` to `defaultValue`? Adding `java` is not very helpful because all of these are Java values. And the choice of which classes are used is actually the set of Iceberg internal representations.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Since there are two separate functions for the \"from string\" functions, I think this should match.", "path": "core/src/main/java/org/apache/iceberg/util/DateTimeUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It doesn't look like the literals were updated to use these functions. Could you finish that refactor?", "path": "core/src/main/java/org/apache/iceberg/util/DateTimeUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Could you name these `microsToIsoTime` and `isoTimeToMicros`? I think the shorter names are more clear.", "path": "core/src/main/java/org/apache/iceberg/util/DateTimeUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is only required to be `CharSequence`, so you should cast to `CharSequence` and call `toString`.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Minor: this isn't the element default, it is one element in the default list. I think this should just be `element`.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "You can get rid of an extra list allocation by writing keys as you iterate:\r\n\r\n```java\r\n Map<Object, Object> defaultMap = (Map<Object, Object>) javaDefaultValue;\r\n Type keyType = type.asMapType().keyType();\r\n Type valueType = type.asMapType().valueType();\r\n\r\n generator.writeStartObject();\r\n generator.writeArrayFieldStart(\"keys\");\r\n List<Object> values = Lists.newArrayListWithExpectedSize(defaultMap.size());\r\n for (Map.Entry<Object, Object> entry : defaultMap.entrySet()) {\r\n toJson(keyT", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": 377, "type": "inline"}, {"author": "rdblue", "body": "Strings that are part of the serialization format should be constants at the top, like the other parsers use.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'd probably put these tests in the parser test.", "path": "core/src/test/java/org/apache/iceberg/TestInvalidDefaultValues.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: In Iceberg, there should be an empty newline between a control flow block and the following statement.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": 265, "type": "inline"}, {"author": "rdblue", "body": "Style (minor): the format arguments should be on the same line as the format string unless they need to be wrapped, not one per line.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: this still has \"Java\" in the variable name.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should be `UnsupportedOperationException`", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This conversion isn't safe. Can you update this to use `timestampFromMicros` and `timestamptzFromMicros` instead? The problem here is that this doesn't use `floorMod` and `floorDiv` so it handles negative values incorrectly.", "path": "core/src/main/java/org/apache/iceberg/util/DateTimeUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "@rzhang10, please refactor the literals to use these methods where possible. We don't want two separate implementations.", "path": "core/src/main/java/org/apache/iceberg/util/DateTimeUtil.java", "line": 89, "type": "inline"}, {"author": "rdblue", "body": "Instead of having a parameterized test suite with just one test, can you refactor this and embed the list of types you want to test?\r\n\r\nThen you can combine all of the default values tests into a single suite.", "path": "core/src/test/java/org/apache/iceberg/TestDefaultValuesParsingAndUnParsing.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Sorry if it wasn't clear, but double needs to check `isFloatingPointNumber` as well.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This error message can be more specific: the scale doesn't match.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should be returned rather than running the conversion twice.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "In Iceberg, you should never suppress an exception that was caught. You can add context, but always wrap the exception.", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Actually, it isn't:\r\n\r\n```java\r\n public static UUID fromString(String var0) {\r\n String[] var1 = var0.split(\"-\");\r\n if (var1.length != 5) {\r\n throw new IllegalArgumentException(\"Invalid UUID string: \" + var0);\r\n } else {\r\n for(int var2 = 0; var2 < 5; ++var2) {\r\n var1[var2] = \"0x\" + var1[var2];\r\n }\r\n\r\n long var6 = Long.decode(var1[0]);\r\n var6 <<= 16;\r\n var6 |= Long.decode(var1[1]);\r\n ", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can you rephrase this error message? We try to make messages helpful, but short. Instead, this should be \"Cannot parse default %s value, incorrect length: %s\"", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should also be `UnsupportedOperationException`", "path": "core/src/main/java/org/apache/iceberg/DefaultValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "When using `assertThrows`, you should always test the exception message.", "path": "core/src/test/java/org/apache/iceberg/TestInvalidDefaultValues.java", "line": null, "type": "inline"}], "6428": [{"author": "nastra", "body": "just curious, would it be possible to extend the existing `CatalogTests` class as it already defines/tests a lot of common behavior", "path": "snowflake/src/test/java/org/apache/iceberg/snowflake/SnowflakeCatalogTest.java", "line": 39, "type": "inline"}, {"author": "nastra", "body": "Thanks for the PR @dennishuo, great to have this. I did a more thorough review and my comments are inlined. ", "path": null, "line": null, "type": "review_body"}, {"author": "nastra", "body": "the version of this and `commons-dbutils` above should go into `versions.props`", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "nastra", "body": "do we have a more specific type for the config? `Object` seems a little bit too generic", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": 59, "type": "inline"}, {"author": "nastra", "body": "should we do a null check here in case the client wasn't initialized in the first place?\r\n\r\nNote that you might also want to use a `CloseableGroup closeableGroup` to close additional resources, such as `fileIO`", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "So you might end up having the following code in `initialize(...)`:\r\n\r\n```\r\nthis.closeableGroup = new CloseableGroup();\r\n....\r\ncloseableGroup.addCloseable(snowflakeClient);\r\n...\r\ncloseableGroup.addCloseable(fileIO);\r\n...\r\ncloseableGroup.setSuppressCloseFailure(true);\r\n```\r\n\r\nand in `close()` you would simply call\r\n```\r\nif (null != closeableGroup) {\r\n closeableGroup.close();\r\n}\r\n```", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "it seems a bit weird to check whether `name` / `snowflakeClient` / `fileIO` have/haven't been set. Can you please elaborate why this is necessary? Typically the `initialize()` should be called exactly once and its purpose is to initialize everything that's necessary", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I think what would make sense in such a case is to have an alternative `initialize(String name, SnowflakeClient snowflakeClient, FileIO fileIO, Map<String, String> properties)` method. The default `initialize(String name, Map<String, String> properties)` method would then call this alternative one. We have a similar pattern in the `NessieCatalog`.\r\nThen I think you could remove the `setSnowflakeClient` / `setFileIO` methods and always call one of the `initialize` methods", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "should this check potentially go into the `snowflakeClient`?", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": 70, "type": "inline"}, {"author": "nastra", "body": "same question as above: should this maybe go into the client?", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I don't think this behavior here is correct. If Namespace properties are not supported, then this should probably just return an empty immutable map", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this check shouldn't be done here. All it should be doing in this method is returning a new `SnowflakeTableOperations` instance", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "should this throw an exception rather than returning null maybe?", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "are we overriding close() only to remove the exception from the signature? If so, then I think it makes more sense to change it to `SnowflakeClient extends AutoCloseable`. With the `close()` method of `AutoCloseable` one can decide to not throw an exception.", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "is this class required? would it maybe make more sense to move this into an existing one (`SnowflakeCatalog` for example)?", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeResources.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "nit: maybe init this only in the constructor", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/JdbcSnowflakeClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "should this have a null check?", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/JdbcSnowflakeClient.java", "line": 376, "type": "inline"}, {"author": "nastra", "body": "should this have a `Preconditions.checkArgument(null != conn, \"...\")`?", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/JdbcSnowflakeClient.java", "line": 130, "type": "inline"}, {"author": "nastra", "body": "what's the reason for having this class here? I was thinking whether it would make sense to directly convert this to a `TableIdentifier` via the `ResultSetHandler` in this class? Wouldn't that save us from having to convert from `SnowflakeTable` to `TableIdentifier`?", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/entities/SnowflakeTable.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "the same applies for `SnowflakeSchema`. Couldn't we directly convert/use `Namespace` via the `ResultSetHandler`?", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/entities/SnowflakeTable.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "as mentioned in comments on `SnowflakeCatalog`, we should probably be calling one of the `initialize` methods here as that is typically how a catalog is being initialized in Iceberg", "path": "snowflake/src/test/java/org/apache/iceberg/snowflake/SnowflakeCatalogTest.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "given that this is new code being added, it would be great if we could switch all assertions to AssertJ as that makes it later easier if we decide to move from Junit4 to Junit5. This check in particular would then be something like `Assertions.assertThatThrownBy(() -> catalog.listNamespaces(Namespace.of(dbName))).isInstanceOf(RuntimeException.class).hasMessage(...)`. Note that it's generally good practive to also assert on the message to make sure the right exception with the right message is th", "path": "snowflake/src/test/java/org/apache/iceberg/snowflake/SnowflakeCatalogTest.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "could be simplified to \r\n```\r\nAssertions.assertThat(catalog.listNamespaces())\r\n .containsExactly(\r\n Namespace.of(\"DB_1\", \"SCHEMA_1\"),\r\n Namespace.of(\"DB_2\", \"SCHEMA_2\"),\r\n Namespace.of(\"DB_3\", \"SCHEMA_3\"),\r\n Namespace.of(\"DB_3\", \"SCHEMA_4\"));\r\n```", "path": "snowflake/src/test/java/org/apache/iceberg/snowflake/SnowflakeCatalogTest.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "same as above. Also please add a `.hasMessage(..)` check to make sure we're getting the right message here.", "path": "snowflake/src/test/java/org/apache/iceberg/snowflake/SnowflakeCatalogTest.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "same as above. Also please add a `.hasMessage(..)` check to make sure we're getting the right message here.", "path": "snowflake/src/test/java/org/apache/iceberg/snowflake/SnowflakeCatalogTest.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "same as above. Also please add a `.hasMessage(..)` check to make sure we're getting the right message here.", "path": "snowflake/src/test/java/org/apache/iceberg/snowflake/SnowflakeCatalogTest.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "`assertEquals(expected, actual)` is the signature, so I think the params should be turned around :)\r\nHowever, it would be great to switch this (and potentially other places above) to `Assertions.assertThat(table.location().isEqualTo(...)` as that is more fluent to read", "path": "snowflake/src/test/java/org/apache/iceberg/snowflake/SnowflakeCatalogTest.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "Please add Javadoc to describe the name collisions between the snowflake and iceberg terminology and why this identifier is used as opposed to the Iceberg TableIdentifier. ", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/entities/SnowflakeIdentifier.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "This should probably default to ResolvingFileIO", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "You might want to consider moving the conversion from from snowflake identifier to that class so you can just call `snowflakeIdentifier.asIcebergIdentifier()` rather than building it here (and everywhere else the conversion is necessary).", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "There's no need to log what was attempted if it's not supported. Just log something like `Snowflake catalog does not support dropping tables.` ", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "Same here, just log that the operation is not supported.", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "Does this need to be public?", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "I don't believe this needs to be public", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeClient.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "Each table should be loaded with its own FileIO instance", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "`LOG` isn't used", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/JdbcSnowflakeClient.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "I believe this behavior deviates from other implementations. This appears to be listing everything under the current level rather than listing the current level. I believe list tables should only return results for the snowflake schema (not database or root).", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": 76, "type": "inline"}, {"author": "danielcweeks", "body": "I don't think this needs to be public", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/NamespaceHelpers.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "This doesn't appear to be used anywhere", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeTableOperations.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "This doesn't appear to be used anywhere", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeTableOperations.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "I don't feel like we need to describe the difference between this and the JDBC Catalog. Just a clear explanation of how this catalog operates. ", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/JdbcSnowflakeClient.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "I don't think this needs to be public", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/JdbcSnowflakeClient.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "You might want to drop the `entities` package as it doesn't provide a lot of logical separation and results in having to expose some of these internals as `public`, which they really don't need to be.", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/entities/SnowflakeIdentifier.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Yeah, this is the correct behavior. It's a little weird, but nice to avoid Hadoop requirements.", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": 59, "type": "inline"}, {"author": "rdblue", "body": "What is used from `iceberg-aws`?", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't see classes that extend this. Can it be private instead of protected?", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/entities/SnowflakeIdentifier.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "If the static factory methods are used, why not just set this as an instance field passed by those methods?", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/entities/SnowflakeIdentifier.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Iceberg does not use `get` in method names. Either it isn't helpful and can be omitted, or it should be replaced with a more specific verb. It's arguable that no one will touch this code besides Snowflake so it is okay, but I'd still recommend conforming to the project style.", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/entities/SnowflakeIdentifier.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: `get` in method name.", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/NamespaceHelpers.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: there should be an empty line between control flow blocks and the following statement.", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/entities/SnowflakeTableMetadata.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Isn't there already an in-memory FileIO? If not, can we add this in a more generic location so others can use it?", "path": "snowflake/src/test/java/org/apache/iceberg/snowflake/InMemoryFileIO.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can these handlers be static since they don't change based on context?", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/entities/SnowflakeIdentifier.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Does Snowflake allow tables to exist directly in a database or account root? If not, then those identifiers should not be supported here. This is not a recursive listing.", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We don't typically debug individual calls like this. What is the value here?", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Snowflake schemas do not have metadata associated with them?\r\n\r\nAlso, a namespace exists if this returns without throwing an exception. Is that the intended behavior?", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'm not sure that I agree here. You can't assume that tables share a catalog-level FileIO, but the tables don't necessarily need a separate instance. We reuse the FileIO in other catalogs if the config doesn't change based on the table.", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: `get` in a method name.", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeClient.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Before we can include this in runtime Jars, we need to validate the file diff between the resulting Jars, update LICENSE/NOTICE files, and also take a look at the dependency tree to make sure it isn't too broad.", "path": "spark/v3.1/build.gradle", "line": 217, "type": "inline"}, {"author": "rdblue", "body": "Iceberg avoids runtime dependencies on Apache Commons. What is this used for? Can we remove it easily?", "path": "versions.props", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Looks like line wrapping was auto-formatted.", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Missing empty lines in this method.", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeTableOperations.java", "line": 88, "type": "inline"}, {"author": "rdblue", "body": "This should include the catalog name, but the Iceberg identifier won't have it.", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeTableOperations.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this is needed by `tableName`.", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeTableOperations.java", "line": null, "type": "inline"}, {"author": "danielcweeks", "body": "Reusing the FileIO will cause problems in certain cases though, so I feel like we shouldn't introduce those here even if there are other catalogs that currently use that approach. For example, if you want to use S3FileIO (or any other native FileIO), then you'll be limited to a single bucket. Once the FileIO is initialized, it will only be able to load the table metadata if it was the same as what the catalog was initialized with. \r\n\r\nIn Snowflake you can register multiple buckets and arbitra", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "what does this test exactly?", "path": "snowflake/src/test/java/org/apache/iceberg/snowflake/SnowflakeCatalogTest.java", "line": 257, "type": "inline"}, {"author": "nastra", "body": "nit: probably better to move this further up", "path": "snowflake/src/main/java/org/apache/iceberg/snowflake/NamespaceHelpers.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I think it would still be good to add some assertion here as this looks weird without any assertion", "path": "snowflake/src/test/java/org/apache/iceberg/snowflake/SnowflakeCatalogTest.java", "line": 257, "type": "inline"}], "12346": [{"author": "szehon-ho", "body": "Hey , really sorry for delay, let's work on it this week. Some early comments", "path": null, "line": null, "type": "review_body"}, {"author": "szehon-ho", "body": "Why do we need interop between Geography and Geometry? I assume we have just have a class for each.", "path": "api/src/main/java/org/apache/iceberg/Geography.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Lets split out predicate pruning support. I feel 80% of the changes is to support those, let's focus in this pr to get the API right.", "path": "api/src/main/java/org/apache/iceberg/expressions/BoundLiteralPredicate.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "I am not comfortable leaking the JTS API from Iceberg API, as this forces all the consumer to now depend on JTS. Can we instead encapsulate this dep in iceberg-core? Going to check with other Iceberg PMC's on this.\r\n\r\nAt very least this should be implementation.", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Nit: use the constant for \"type\"", "path": "core/src/main/java/org/apache/iceberg/SchemaParser.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Nit: use switch statement like rest of the code.", "path": "core/src/main/java/org/apache/iceberg/SchemaParser.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Nit: \"crs\" seems used frequently enough to warrant a constant.", "path": "core/src/main/java/org/apache/iceberg/SchemaParser.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "can we put a comment how we arrived at 80?", "path": "api/src/main/java/org/apache/iceberg/types/TypeUtil.java", "line": 544, "type": "inline"}, {"author": "szehon-ho", "body": "Why cant we put the startsWith in the regex, like DECIMAL and FIXED", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "I guess here, let's specifically not leak the Geography in the API, we should just have a wrapper class as well.", "path": "api/src/main/java/org/apache/iceberg/expressions/Literal.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Can we have a base class, and two implementing classes then? its confusing as it is.", "path": "api/src/main/java/org/apache/iceberg/Geography.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "another option is to allow null for algorithm right? and the code can default it , as per the spec?", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Ok I meant code comment :) ", "path": "api/src/main/java/org/apache/iceberg/types/TypeUtil.java", "line": 544, "type": "inline"}, {"author": "szehon-ho", "body": "Thanks @Kontinuation for removing the api dependency on jts from iceberg-api, this looks a lot closer to me", "path": null, "line": null, "type": "review_body"}, {"author": "szehon-ho", "body": "Precondition, check null and throw exception?", "path": "api/src/main/java/org/apache/iceberg/types/EdgeInterpolationAlgorithm.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Nit: can remove redundant 'name'", "path": "api/src/main/java/org/apache/iceberg/types/EdgeInterpolationAlgorithm.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "why not , new GeometryType(\"\")", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "why not new GeographyType(\"\")", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "this should be in its own test", "path": "api/src/test/java/org/apache/iceberg/types/TestTypes.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "nit: these (and following) can be replaced by the constants?", "path": "core/src/main/java/org/apache/iceberg/SchemaParser.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "maybe can annotate NotNull?", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": 596, "type": "inline"}, {"author": "szehon-ho", "body": "Precondition that crs is not null?", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "nit: make a constant to avoid instantiating it every time?", "path": "core/src/main/java/org/apache/iceberg/util/GeometryUtil.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Got it, nvm it may not be worth it", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": 596, "type": "inline"}, {"author": "rdblue", "body": "Style: newlines are added after control flow blocks before the next statement.", "path": "api/src/main/java/org/apache/iceberg/transforms/Identity.java", "line": 103, "type": "inline"}, {"author": "rdblue", "body": "Minor: We usually only include the lower-case version when it isn't a simple translation from the name.", "path": "api/src/main/java/org/apache/iceberg/types/EdgeInterpolationAlgorithm.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "So we consider CRS to be case sensitive?\r\n\r\nThe capture group of a case insensitive regular expression returns the matched characters without modification. So if CRS is case sensitive, you can still use a regular expression, like this:\r\n\r\n```java\r\n Pattern pattern = Pattern.compile(\"geometry\\\\(([^)]*)\\\\)\", Pattern.CASE_INSENSITIVE);\r\n```\r\n\r\nThis test will pass:\r\n```java\r\n Matcher m = pattern.matcher(\"GEOMETRY(aBc)\");\r\n assertThat(m.matches()).isTrue();\r\n assertThat(m.group(1)).isEqua", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think the intent with the first capture group is to avoid consuming trailing spaces between the group and the comma, but the current pattern would fail when there are spaces in the CRS name, like `EPSG: 4326`. We may want to support names with spaces because they are unambiguous, and we don't have anything that disallows spaces in those names in the specs.\r\n\r\nInstead of matching non-space characters, I recommend matching any character other than comma using a non-greedy `+` by adding `?`: `[^,", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think that this should validate the algorithm and pass the algorithm enum symbol instead of a string. That avoids being able to compile with a typo (because `EdgeInterpolationAlgorithm.VINCENTI` is not a symbol) instead of allowing code paths that contain `\"vincenti\"`.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: In Iceberg, we try to have direct and clear error messages by cutting out sentence structure. Here, we would normally use `\"Invalid CRS: null\"`.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Please use a valid CRS rather than an empty string. I think that this defaults to CRS84, which seems like a reasonable string to use.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "The empty string should not be allowed. That is an invalid CRS because it has no information. Null is fine to support here if you expect to have situations in which callers will find it convenient to pass null to mean \"use the default\".", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "There aren't any other types with default type parameters, so this is new. However, `get` is never used for any other parameterized type. Since `get` normally implies that there is only one (the other types using `get` are singletons) I think it would be better to use a more descriptive name, like `default()` or `crs84()`. I think I prefer using `crs84()` to be clear.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Iceberg does not generally use nullability annotations.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": 596, "type": "inline"}, {"author": "rdblue", "body": "Using empty string for CRS would produce `geometry()` instead of `geometry`. I would also not use `geometry(crs84)` when that is equivalent to `geometry`. So whatever we store for the default CRS (null?) this should translate the default to `geometry`.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": 623, "type": "inline"}, {"author": "rdblue", "body": "Similar to the comments above: the method to get a default type should not be `get` and an empty string should not be used to carry information. Use either `null` or `CRS84`. I prefer `null`.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is there a better name than `of` that is more descriptive? I think that `forCRS` would work here, but it is a bit confusing when the algorithm is added (`forCRS(\"srid:1234\", EdgeInterpolationAlgorithm:ANDOYER)`. Maybe `of` is the best choice.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": 634, "type": "inline"}, {"author": "rdblue", "body": "I think this should be removed from the API and the logic should be moved into the parsing code.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Looks like this will produce `\"geography(, )\"` for the default type. When the type is the default, it should produce just `geography`. When algorithm is not set, it should produce `geography(crs)` and when algorithm is set it should produce `geography(crs, algorithm)`. When the CRS is defaulted and algorithm is not, it should use whatever standard name we decide on for CRS84.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can these be collapsed with `testVariantUnsupported` in a `ParameterizedTest`?", "path": "api/src/test/java/org/apache/iceberg/TestPartitionSpecValidation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can you also make sure that `identity` is tested like this?", "path": "api/src/test/java/org/apache/iceberg/TestPartitionSpecValidation.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is a long name to use as a prefix. What about `EdgeAlgorithm` or `GeoAlgorithm` instead?\r\n\r\nAlso, I'm not sure that \"interpolation\" is the right word for this because the algorithm calculates points on the edge but does not define the edge. The CRS defines the interpolation for the edge, right?", "path": "api/src/main/java/org/apache/iceberg/types/EdgeInterpolationAlgorithm.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is this valuable? These tests are the same for any string.", "path": "api/src/test/java/org/apache/iceberg/types/TestSerializableTypes.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "In this suite, these cases would be added to `fromPrimitiveString` and `fromTypeName`.", "path": "api/src/test/java/org/apache/iceberg/types/TestTypes.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why is this needed?", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Primitives are encoded as type strings. This method should not change and `GeometryType.toString` should produce a string that can be parsed by `Types.fromTypeName`.", "path": "core/src/main/java/org/apache/iceberg/SchemaParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should not change.", "path": "core/src/main/java/org/apache/iceberg/SchemaParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "@szehon-ho, is the WKT representation worth adding a library to translate between WKT and WKB? Maybe this should do the same thing as binary like we do for variant?", "path": "core/src/main/java/org/apache/iceberg/SingleValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I see, this is used for translation between WKT and WKB.\r\n\r\nIf we choose to do that translation (we could change from a WKT representation or could not allow defaults for geo fields) then we will need to update license documentation or make support pluggable. Updating documentation involves researching the code that is pulled in by JTS and reviewing the license and notice for every new transitive dependency.", "path": "build.gradle", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: Even for private methods, we avoid using `get` in method names. It is either not adding value (can be omitted) or should be replaced with a more specific verb. I would probably omit here and name this `dimension(Geometry)`.", "path": "core/src/main/java/org/apache/iceberg/util/GeometryUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: empty lines following control flow blocks.", "path": "core/src/main/java/org/apache/iceberg/util/GeometryUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why does this use `HadoopTableTestBase`? I would avoid using `HadoopCatalog` or considering something that works with `HadoopCatalog` to be working everywhere.", "path": "core/src/test/java/org/apache/iceberg/TestGeospatialTable.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is tested because `supportsVariantType` at the top is true. I would use the same pattern for spatial types, adding them to the list in `DataTest` and disabling the tests with a `supports` method. That way we can slowly add support and enable the tests (as we are doing for unknown, variant, and timestamp(9)).", "path": "core/src/test/java/org/apache/iceberg/TestSchemaParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Good catch with this test. @aihuaxu we will need to update this for variant and other types as well.", "path": "core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What are the `.*` patterns matching?", "path": "core/src/test/java/org/apache/iceberg/TestSingleValueParser.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "I think the spec says 'ogc:crs84' is default, and 'srid' and 'projjson' is for extensions, so its ok as a default value for me. Wdyt @jiayuasu ?", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think it would be easier to allow null to indicate the default. That's easier to detect for `toString` to show `geometry` instead of `geometry()`.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "+1 for `OGC:CRS84`.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Sorry, I meant is it valuable to have a `srid` case and a `projjson` case. Both are strings and if we validate that any string is preserved by serialization and deserialization, I think that is sufficient.", "path": "api/src/test/java/org/apache/iceberg/types/TestSerializableTypes.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I missed that in the spec.\r\n\r\n@szehon-ho, why are we using objects for these when we also support parsing them as we do for `decimal` and `fixed`? Should we update the spec?", "path": "core/src/main/java/org/apache/iceberg/SchemaParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "If there aren't real-world scenarios for default values, then we should update the spec to ensure that the default is always null.", "path": "core/src/main/java/org/apache/iceberg/SingleValueParser.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Does this test validate catalog behavior? It seems to me that this is a test that the API can be used with geo types, right? If that needs to be done, then it can be in a `TestGeospatialTable` suite. But that suite should use `InMemoryCatalog` or something rather than `HadoopCatalog` so that we don't create more tech debt and dependencies on `HadoopCatalog`.", "path": "core/src/test/java/org/apache/iceberg/TestGeospatialTable.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why is `algorithmName.isEmpty` valid here? I think it should be an error if the type is passed as `\"geography(srid:1234,)\"`. If there is a non-null string then it must be a valid algorithm name. When there is no comma, the second capture group is null, so removing this `isEmpty` check will fix it.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Empty is not a valid CRS, so I think this is the wrong check.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think it would be fine to move the `toLowerCase` into the enum. We just don't need to parameterize the enum with a constant that is the lower case name.", "path": "api/src/main/java/org/apache/iceberg/types/Types.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why is this a valid? It seems like a problem to me. We don't allow `long()`.", "path": "api/src/test/java/org/apache/iceberg/types/TestTypes.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think a better place for this to be tested in in `TestSchema.testUnsupportedTypes`. There are already a good set of cases for nesting and other types.", "path": "core/src/test/java/org/apache/iceberg/TestTableMetadata.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'd rather not refactor this out since it is a method added with just one statement. This ends up being a larger change and we plan to remove the `supports*` methods after all of the child tests have support.", "path": "core/src/test/java/org/apache/iceberg/data/DataTest.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Noted, will take a look at this", "path": "core/src/main/java/org/apache/iceberg/SingleValueParser.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Im ok with that, initially I thought its a bit complex as the crs can be projjson content. But now it is a table property, so should work.", "path": "core/src/main/java/org/apache/iceberg/SchemaParser.java", "line": null, "type": "inline"}], "1893": [{"author": "rdblue", "body": "My IDE warns that this is a suspicious call because it gets a `FunctionDefinition` from a map keyed by `BuiltInFunctionDefinition`. To fix it, I think the map should be `Map<FunctionDefinition, Operation>`.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": 83, "type": "inline"}, {"author": "rdblue", "body": "Nit: extra newline.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": 66, "type": "inline"}, {"author": "rdblue", "body": "How does `FieldReferenceExpression.getName()` reference nested fields?", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "`Class` is parameterized, so this should be `Class<?>`", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "`ValueLiteralExpression` allows the value to be null, in which case `get` here will throw an exception. How is this avoided? Does the parser reject `col = null` expressions?\r\n\r\n@openinx may be able to help here, too.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Actually, a few lines down there is an assertion that the value isn't null. This looks like a bug to me.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think it would be cleaner to have a method to extract the column name and literal value as a pair, rather than passing the function to create an expression in here. Handling the expression type in the caller, but then also handling it here doesn't provide very good separation of concerns.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should not discard anything that is not a `ValueLiteralExpression`. Instead, if there is a non-literal this should either throw `IllegalArgumentException` or return `Optional.empty` to signal that the expression cannot be converted.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this should be converted to `List<ValueLiteralExpression>` to simplify value conversion.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Null values can't be ignored. This should either return `Optional.empty` or throw `IllegalArgumentException` if there is a null value.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "There are several calls to `getResolvedChildren().get(0)`. I think that should be converted to a method that validates there is only one child and also validates the type:\r\n\r\n```java\r\n private <T extends ResolvedExpression> Optional<T> getOnlyChild(CallExpression call, Class<T> expectedChildClass) {\r\n List<ResolvedExpression> children = call.getResolvedChildren();\r\n if (children.size() != 1) {\r\n return Optional.empty();\r\n }\r\n\r\n ResolvedExpression child = children.get(0);\r\n i", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This can be `child.map(Expressions::not)`.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "The literal value needs to be converted to Iceberg's internal representation before being passed to create an expression. Flink will return `LocalDate`, `LocalTime`, `LocalDateTime`, etc. just in the `getValueAs` method. And it isn't clear whether the value stored in the literal is the correct representation for other types as well.\r\n\r\n@openinx, could you help recommend how to do the conversion here?", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Tests should be broken into individual methods that are each a test case. To share code, use `@Before` and `@After` and different test suites.", "path": "flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSource.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This class needs an extensive test suite that checks the conversion from expected Flink expressions, not just a test for the source.\r\n\r\nThe conversion needs to cover at least these cases:\r\n* Equals with null\r\n* Not equals with null\r\n* In with null\r\n* Not in with null\r\n* Equals with NaN\r\n* Not equals with NaN\r\n* In with NaN\r\n* Not in with NaN\r\n* All inequalities with null\r\n* All inequalities with NaN\r\n* All expressions with a non-null and non-Nan value (preferably one string and one numeric)\r\n* E", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": 45, "type": "inline"}, {"author": "rdblue", "body": "Can you also add a test case that listens for a `ScanEvent` and validates that the expression was correctly passed to Iceberg?", "path": "flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSource.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think it would be good to fix this case rather than making the assumption that Flink won't push the `= null` filter. Handling null will be good for maintainability.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Okay, sounds fine that Flink doesn't currently support predicate pushdown on nested fields. @openinx, any plans to change this?", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Q: is it possible that we have two literals in both left and right side ? For example, someone may write the SQL: `SELECT * FROM test where 1 > 0`, then we could not think that the left MUST be a field and the right MUST be a `value` in this `else` block. \r\n\r\nWe have done the similar work in our own branch before, you could see the PR: https://github.com/generic-datalake/iceberg-poc/pull/2/files#diff-86160616589acf1dd526b10b73418a46fe60f9e5e5ab6946a4ea3c8f019542f5R123", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Is this sentence correct ? should transform the `&&` to `||` ? ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "I agreed with @rdblue that handling null in this function rather than assuming flink won't push down the `null`. ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "I think it's better to add an unit tests to address this case. ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Q: the `call.getResolvedChildren().get(0)` MUST be a `FieldReferenceExpression` ? I would recommend to use the similar [toReference](https://github.com/generic-datalake/iceberg-poc/pull/2/files#diff-86160616589acf1dd526b10b73418a46fe60f9e5e5ab6946a4ea3c8f019542f5R181) method to check whether it's indeed a `FieldReferenceExpression`, that's more safe.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Yes , flink does not support nested field push down now. Will need to file issue to address it in apache flink repo.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Pls check its class type before casting to `FieldReferenceExpression` directly. It's more safe to cast if we're sure that it's indeed the expected class. ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "nit: the `{` and `}` could be removed .", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "The flink's `LIKE`, `BETWEEN`, `NOT_BETWEEN` could also be pushed down, pls see https://github.com/generic-datalake/iceberg-poc/pull/2/files#diff-86160616589acf1dd526b10b73418a46fe60f9e5e5ab6946a4ea3c8f019542f5R78-R80.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": 63, "type": "inline"}, {"author": "openinx", "body": "I will suggest to have a more strict assertion to validate the pushed filter's internal information, see https://github.com/generic-datalake/iceberg-poc/pull/2/files#diff-5d18d1ff127d1dc70a9a15bbe941f2b6f9d28b3015924f601ac1f722914099dbR96", "path": "flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSource.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I agree. It is best to validate both child types in each case and only convert if they match what is expected.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Here we could remove the `flinkExpression == null` and just keep the \r\n```java\r\n if (!(flinkExpression instanceof CallExpression)) {\r\n return Optional.empty();\r\n }\r\n```\r\n\r\nBecause `null` will always meet the `!(flinkExpression instanceof CallExpression)` condition.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "How about changing this to check whether the `explainBetween` contains a more detailed string `FilterPushDown,the filters :ref(name=\"id\") >= 1,ref(name=\"id\") <= 2]]], fields=[id, data])` ? `explainBetween.contains(\"FilterPushDown\")` is not so accurate for me. ", "path": "flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSource.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "OK, make sense to me ! ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": 63, "type": "inline"}, {"author": "openinx", "body": "`getOnlyChild` ? the method name is confusing. How about using `singleton` ? ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Pls consider the two cases: \r\n\r\n```java\r\nCase.1 : a < 1; \r\nCase.2: 1 < a;\r\n```\r\n\r\nHere we `convertBinaryExpress` will parse the `tuple2` as `<a, 1>`. and the `function` will be `lessThan`. While in the actual the case.2 is totally different with case.1 because its meaning is : a is `greaterThan` 1 . \r\n\r\nThat's why we introduced a reversed function in [here](https://github.com/generic-datalake/iceberg-poc/pull/2/files#diff-86160616589acf1dd526b10b73418a46fe60f9e5e5ab6946a4ea3c8f019542f5", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "If we're sure that flink won't enter the `IN` block, then I think we should remove this block. Pls add a comment saying `IN` will convert to multiple `OR`.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "I'd prefer to have a basic `args` check here ( rather than introducing another similar `sql(String query, Object... args)` method: \r\n\r\n```java\r\nString query = args.length > 0 ? String.format(sql, args) : sql:\r\nTableResult tableResult = getTableEnv().executeSql(query);\r\n```", "path": "flink/src/test/java/org/apache/iceberg/flink/FlinkTestBase.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "> the BETWEEN, NOT_BETWEEN,IN expression will be auto convert by flink\r\n\r\n-> `The BETWEEN, NOT_BETWEEN,IN expression will be converted by flink automatically`. \r\n\r\n> so we do not add the convert here\r\n\r\n-> so we do not add the conversion here. \r\n\r\n> be convert to\r\n\r\n-> be converted to ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "I think there's a bug here. Assume the case: `a != NaN`, the `handleNaN` will return an iceberg expression: `Expressions.isNaN(name)`. That's incorrect ? ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Pls add an unit test for this if possible. ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Will the flink convert the `NOT_EQ` to be `NOT ( EQ )` ? If sure, then we don't have to handle the `NOT_EQ` ? ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": 117, "type": "inline"}, {"author": "openinx", "body": "It's good to move this class to `test` packages, I saw nobody from production files would call this. ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkUtil.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "It's better to use `ImmutableList.of()` as the default `filters` because of the comment [here](https://github.com/apache/iceberg/pull/1893/files#r544779804), though there's large probability that when `isFilterPushedDown()` returns true the filters should always be non-nullable. ", "path": "flink/src/main/java/org/apache/iceberg/flink/IcebergTableSource.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "nit: it's more simple to rewrite this as: \r\n\r\n```java\r\n explain += String.format(\", FilterPushDown,the filters :%s\", Joiner.on(\",\").join(filters));\r\n```", "path": "flink/src/main/java/org/apache/iceberg/flink/IcebergTableSource.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "nit: how about rewrite those as: \r\n\r\n```java\r\n List<org.apache.iceberg.expressions.Expression> expressions = Lists.newArrayList();\r\n for (Expression predicate : predicates) {\r\n FlinkFilters.convert(predicate).ifPresent(expressions::add);\r\n }\r\n```", "path": "flink/src/main/java/org/apache/iceberg/flink/IcebergTableSource.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "It's good to have a assert the optional is present before get its value: \r\n\r\n```java\r\n Optional<org.apache.iceberg.expressions.Expression> expression = FlinkFilters.convert(expr);\r\n Assert.assertTrue(expression.isPresent());\r\n```", "path": "flink/src/test/java/org/apache/iceberg/flink/TestFlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Pls add `NaN` cases for both equals and notEquals . ", "path": "flink/src/test/java/org/apache/iceberg/flink/TestFlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Nit: better to check `ifPresent` before `get` the value from `Optional`. ", "path": "flink/src/test/java/org/apache/iceberg/flink/TestFlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Sounds like it should be renamed as `FlinkFiltersUtil`", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkUtil.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "This key-value pairs could be removed , right ? ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Nit: It's better to rename it as `parseFieldAndLiteral` ? The method is not actually convert expressions from flink expr to iceberg expr ? ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "For the flink's `Binary` data type, its default java type is `byte[]`, while the iceberg's `BinaryLiteral` will use `ByteBuffer`. So we will need to convert it to ByteBuffer ? ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": 206, "type": "inline"}, {"author": "openinx", "body": "Pls see the iceberg expression literal types here : https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/types/Type.java#L30", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": 206, "type": "inline"}, {"author": "openinx", "body": "If the literal type of flink is BINARY or VARBINARY, then its java data type is `byte[]`, while in iceberg we will use `ByteBuffer` to do the literal comparison. So I think we need to convert it to ByteBuffer by `ByteBuffer.wrap((byte[])o)` . Otherwise the iceberg literal comparing will throw a cast failure . \r\n\r\nPls see the java data type for iceberg type here: https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/types/Type.java#L30\r\n\r\nBy the way, I think y", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": 213, "type": "inline"}, {"author": "openinx", "body": "OK, then we have to keep the `NOT_EQ` here. ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": 117, "type": "inline"}, {"author": "openinx", "body": "The `DescribeExpressionVisitor` is not actually format the flink filters to string, instead it's formatting the iceberg's `Expression` to string. So even if we want to introduce a flink utility , it's not a good idea to put the iceberg's visitor in this class. ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkUtil.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "I'd prefer to abstract the common methods from flink and spark to a common utility. How about introducing a `ExpressionsUtil` under package `org.apache.iceberg.expressions` ? In that way, we could remove the `Spark3Util#describe` methods. ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkUtil.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "In this `matchLiteral` testing method, we will provide both flink's java data type and iceberg's java data type ( for asserting). I think it's not a good way to test those literals because if people don't quite understand what's the specific java data type that iceberg type is mapping, then he may provide the wrong iceberg's java type to assert. The `BINARY` type is an example. \r\n", "path": "flink/src/test/java/org/apache/iceberg/flink/TestFlinkFilters.java", "line": 91, "type": "inline"}, {"author": "openinx", "body": "I think the correct way to validate the flink-iceberg type mapping is: for each data type, providing a EQUAL expression and then use the `FlinkFilters` convert it to iceberg filters, finally call the `BoundLiteralPredicate#test` to see the EQUALS return the expected `true`, if yes, then the data type mapping MUST be correct.", "path": "flink/src/test/java/org/apache/iceberg/flink/TestFlinkFilters.java", "line": 91, "type": "inline"}, {"author": "openinx", "body": "It's good to have test cases for all other data types except `Integer`. I guess we may need few abstraction so that the test code won't duplicate too many. ", "path": "flink/src/test/java/org/apache/iceberg/flink/TestFlinkFilters.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Asserting that the explain contains `FilterPushDown` does not have much meaning because all explained string will contains that word ( see the `IcebergTableSource#explainSource` ). The key part to assert is the last part: \r\n\r\n```java\r\nString.format(\", FilterPushDown,the filters :%s\", Joiner.on(\",\").join(filters));\r\n```\r\n\r\nI mean we need to assert `Joiner.on(\",\").join(filters)` part ( for all the following cases).", "path": "flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSource.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "Nit: we could just use the `assertEquals` for two lists, don't have to convert it to array and then use `assertArrayEquals` ? ", "path": "flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSource.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "nit: If we use the static imported `assertEquals` before then we don't use `Assert.assertEquals` here ? Make them to be unified. ", "path": "flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSource.java", "line": null, "type": "inline"}, {"author": "openinx", "body": "The patch almost looks good to me now, I found one bug and left few comments. Pls resolve those comments, really thanks @zhangjun0x01 's patient work. ", "path": null, "line": null, "type": "review_body"}, {"author": "openinx", "body": "OK, that make sense. ", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": 213, "type": "inline"}, {"author": "openinx", "body": "Looks good to me now. @rdblue would you like to have another check if you have time ? I will keep this open for one or two day, then I plan to merge this :-)", "path": null, "line": null, "type": "review_body"}, {"author": "rdblue", "body": "I don't think that `Expression` needs to be fully-qualified.", "path": "api/src/main/java/org/apache/iceberg/expressions/ExpressionsUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Could you also add Javadoc to this method?", "path": "api/src/main/java/org/apache/iceberg/expressions/ExpressionsUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: \"convert\" should be capitalized because it begins a sentence.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nits: missing a space in the list and the next sentence isn't capitalized.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think that this should pass `null` to `toReference`. It works because `toReference` does an `instanceof` check, but it isn't obvious that `null` is expected there.\r\n\r\nI think it would be better to use `flatMap` to run `toReference` if the `FieldReferenceExpression` is defined.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think that this method name is enough to see what's going on from reading it. How about a name like `childAs` or `onlyChildAs`? That makes it clear that the call's child will be returned and that there should be only one.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "If this is going to access children from this list directly using `get(1)` then it should also check that there are exactly 2 children.", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": 189, "type": "inline"}, {"author": "rdblue", "body": "Actually, since the type is already known there is no need to use `toReference` at all. This can call name directly:\r\n\r\n```java\r\n return onlyChildAs(call, FieldReferenceExpression.class)\r\n .map(FieldReferenceExpression::getName)\r\n .map(Expressions::isNull);\r\n```", "path": "flink/src/main/java/org/apache/iceberg/flink/FlinkFilters.java", "line": null, "type": "inline"}], "5268": [{"author": "danielcweeks", "body": "The original idea behind the MetricsContext interface was to provide a way to delegate to different implementations for creating telemetry primitives (like Counter, Timer, Guage, etc.).\n\nYou can imagine implementations like:\n`HadoopMetricsContext` (this exists, but is a wierd implementation because of how it interacts with FileSystem)\n`DefaultMetricsContext` (provides native counters/timers like you've created)\n`MicroMeterMetricsContext` (delegates the creation to a micrometer implementation)\n\nI", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanMetricsContext.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "that makes perfect sense @danielcweeks, thanks. I'll adjust the implementation accordingly", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanMetricsContext.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What about using just `start` and `stop`? Isn't it clear that these are timer methods?", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "A lot of other frameworks implement `AutoCloseable` so that you can use try-with-resources to stop the timer:\r\n\r\n```java\r\n List<DataFile> files;\r\n try (Timer t = context.timer(\"manfiest-read\", TimeUnit.MICROSECONDS).start()) {\r\n this.files = ManifestFiles.read(...);\r\n }\r\n```", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What about using `time` as the verb? I think `record` makes sense for adding intervals timed in some other way, but I would expect `time` to be used when passing a `Runnable` or `Callable`.", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "An alternative that I tend to like is to have a custom `Task` interface rather than using `Callable` that allows you to specify the exception type:\r\n\r\n```java\r\n public interface TimedTask<R, E extends Exception> {\r\n R call() throws E;\r\n }\r\n```\r\n\r\nThat way, you can customize the exception type and don't need an outer `try` block if it only throws `RuntimeException`.", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this is definitely going to be called in parallel, since we want to measure the time it takes to scan an individual manifest file, which happens in the worker pool.\r\n\r\nTo do that, you may want `startTimer` to return an individual `TimedExecution` or something, similar to how `Iterable.iterator` returns an `Iterator` with its own state. When a `TimedExecution` has `stopTimer` called on it, it could then call `record`, which would work just fine because `record` is an `AtomicLong`.", "path": "api/src/main/java/org/apache/iceberg/metrics/DefaultTimer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This will never be empty, right? Why not just print the value of counter directly?", "path": "api/src/main/java/org/apache/iceberg/metrics/LongCounter.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "When setting instance fields, we use `this.` to show that it is modifying state outside of the local context.", "path": "api/src/main/java/org/apache/iceberg/metrics/LongCounter.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Should we use `updateAndGet` with `Math.addExact` to avoid overflows?", "path": "api/src/main/java/org/apache/iceberg/metrics/LongCounter.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Does the `onCompletionRunnable` need to happen in a `try-finally` block in case `iterable.close()` throws?\r\n\r\nMight be something you want to add a test for.", "path": "api/src/main/java/org/apache/iceberg/io/CloseableIterable.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Style / Non-blocking: I'm a big fan of using Guava's `MoreObjects.toStringHelper` like seen here:\r\n\r\nhttps://github.com/apache/iceberg/blob/90fe0edf1a671095e587a53adeb31cc84d01fb89/core/src/main/java/org/apache/iceberg/rest/responses/LoadTableResponse.java#L72-L78\r\n\r\nIt more or less gives you the same result but is a lot less mental overhead in my opinion. Up to you if you use it or not though.", "path": "api/src/main/java/org/apache/iceberg/metrics/DefaultTimer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "agreed, that looks much nicer", "path": "api/src/main/java/org/apache/iceberg/metrics/DefaultTimer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "makes sense, updated", "path": "api/src/main/java/org/apache/iceberg/metrics/LongCounter.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "makes sense, updated", "path": "api/src/main/java/org/apache/iceberg/metrics/LongCounter.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "good point, I updated this to just start() / stop()", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "good point, that makes perfect sense to stop the timer when close is called", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I was broadly following how Micrometer defines their [Timer](https://www.javadoc.io/doc/io.micrometer/micrometer-core/latest/io/micrometer/core/instrument/Timer.html) interface and I had the exact same thought as you mentioned here. \r\nThat being said, `time()` makes also sense to me as a verb for measuring a Runnable/Callable/Supplier, but I think `record(long amount, TimeUnit unit)` / `record(Duration duration)` reads nicer than `time(long amount, TimeUnit unit)` / `time(Duration duration)` and", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I like that idea. Would you like me to add this as part of this PR or rather later?", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "you're absolutely right about this. That's also how Micrometer handles timing state. I'll update accordingly", "path": "api/src/main/java/org/apache/iceberg/metrics/DefaultTimer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I think it would also make sense to have something like a `TimedRunnable` / `TimedSupplier`", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "good catch, that makes definitely sense. Updated and added a test", "path": "api/src/main/java/org/apache/iceberg/io/CloseableIterable.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "good catch, that makes sense. Updated the code to reflect that", "path": "api/src/main/java/org/apache/iceberg/metrics/LongCounter.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "Given that we're tracking state, using `AutoCloseable` (bf2178c818508152dce7a3b910a998a5fba9d2b5) makes the API less readable imo. I've done it in a separate commit, so we can still decide whether we want to have it or not", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I left the `AutoCloseable` changes in the commit mentioned above but left it out from https://github.com/apache/iceberg/pull/5286 for now. Let me know if you'd like to include it in https://github.com/apache/iceberg/pull/5286", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I agree with using `record` for direct additions, but I don't think I would make the names the same across something that is timed by the timer and a time that is provided by the caller.", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: we add an empty newline between control flow blocks and the following statement.", "path": "api/src/main/java/org/apache/iceberg/metrics/DefaultMetricsContext.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'd probably replace what's here with two variations, one that returns a value and one that has no return. Then you've covered all of the cases and can use method references to call with a Runnable, Supplier, or Callable.", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "When I think of sampling in metrics, I think of randomly deciding whether to time or use a no-op, to avoid overhead in tight loops. It looks like the intent here is to make this something that reports when stopped, so a different name is probably more clear (I had no idea why this had sampling).\r\n\r\nHow about calling this `Timed` or something?", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What is the purpose of having this as well as `Sample`? If they do the same thing with different ways of stopping the timed period, then can't one interface provide both `close` and `stop`?", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this object should be responsible for tracking its own timer. It's an awkward API to need to pass the right timer back in.", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this is probably unnecessary given that there's already a `start` method.", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "As I mentioned above, I think that timing the execution of something is a different operation than recording an interval that was timed externally.", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this needs to be updated. Now that it is using a `Sample` to hold the stopwatch, there's no need to fail if it is called concurrently.", "path": "api/src/main/java/org/apache/iceberg/metrics/DefaultTimer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This stopwatch should be owned by the `Sample` and not shared.", "path": "api/src/main/java/org/apache/iceberg/metrics/DefaultTimer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This class should own the stopwatch, so that there is no way for this to happen. When a sample is created, the stopwatch is also created and started. The only state that we care about is if `stop` is called multiple times on the same `Sample`.", "path": "api/src/main/java/org/apache/iceberg/metrics/DefaultTimer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What is `deletedDataFiles` reporting?", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is tracking scan planning, not a scan. Values like this should be `totalPlanningDuration`.\r\n\r\nWe also typically use `long` to return durations in milliseconds. Is there a reason to use `Duration` instead? Are millis not granular enough?", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "When would a scan add data files?", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'm assuming that this is distinct from `matchingDataManifests` because this is the number that were actually read if the scan planning iterable was closed early?", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is the intent to keep adding new metrics to this?", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": 30, "type": "inline"}, {"author": "rdblue", "body": "I don't think this needs the `Count` suffix. It's fairly clear that this is going to be a counter.", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think we want a metric for total data manifests as well so that we can track the % of data manifests that were scanned.", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why would a count not default to 0?", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Similar to the comment above, this would be `startScanPlanning`.\r\n\r\nI'm also not sure that the `ScanReporter` itself should be a timer. That mixes more responsibility than necessary. I would probably just add a timer for total planning time to the scan metrics.", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReporter.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why is a reporter tied to a single \"current\" `ScanMetrics`? Why not make this interface produce `ScanMetrics` and accept `ScanReport`?", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReporter.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I would probably remove customization for now. We can add this later.", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReporter.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think the problem is that this is tracking state and creating a new object. The object itself tracks state so the caller doesn't need to.", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Indentation is off. It should be 2 spaces per indent and 2 indents for continuation indents.", "path": "api/src/test/java/org/apache/iceberg/io/TestCloseableIterable.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Does this need to be exposed?", "path": "core/src/main/java/org/apache/iceberg/BaseScan.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think you can go ahead and remove the old methods. These are internal constructors.", "path": "core/src/main/java/org/apache/iceberg/BaseTableScan.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Since this is passed in from the table, we don't need to support dynamic loading. If there's a use case, we can add it later.", "path": "core/src/main/java/org/apache/iceberg/CatalogUtil.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Tying the lifecycle of scan metrics to the scan reporter means that all of the scans created through this chain will have the same metrics collector, which is a correctness problem.\r\n\r\nIn the Java library, the `TableScan` interface uses a refinement pattern. Each `TableScan` is independent and you get a new table scan that has additional options when you call the refinement methods, like `select` or `filter`. For example,\r\n\r\n```java\r\nTableScan fullTableScan = table.newScan();\r\nTableScan yesterda", "path": "core/src/main/java/org/apache/iceberg/BaseTableScan.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Rather than changing so many method signatures, why not add `ScanReporter` to the `TableScanContext`?", "path": "core/src/main/java/org/apache/iceberg/IncrementalDataTableScan.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: missing newline.", "path": "core/src/main/java/org/apache/iceberg/ManifestGroup.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think I see what you're doing with these counters now. These end up tracking the total number of data or delete files in manifests that are scanned, right?", "path": "core/src/main/java/org/apache/iceberg/ManifestGroup.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think a more reliable place to gather these by wrapping the iterator returned by `planFiles`. That way you don't have to worry about future changes to code further filtering the iterator that you're wrapping here.", "path": "core/src/main/java/org/apache/iceberg/ManifestGroup.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: unnecessary whitespace change.", "path": "core/src/main/java/org/apache/iceberg/ManifestGroup.java", "line": 175, "type": "inline"}, {"author": "rdblue", "body": "Rather than changing the filtering, why not make a new Iterable that updates a counter? Then you'd have less custom code:\r\n\r\n```java\r\n class <T> Iterable<T> count(Counter counter, Iterable<T> iterable) {\r\n return () -> new Iterator<T> {\r\n Iterator<T> iter = iterable.iterator();\r\n\r\n public boolean hasNext() {\r\n return iter.hasNext();\r\n }\r\n\r\n public T next() {\r\n T next = iter.next();\r\n counter.increment();\r\n return next;\r\n }\r\n }\r\n }\r\n\r\n ", "path": "core/src/main/java/org/apache/iceberg/ManifestGroup.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "`Timed` sounds good to me. Updated", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "the only reason I figured this might be a good idea is the IDE complaining about `Timer#start()` being used without a `try-with-resources` statement everywhere, but I guess the API is less readable so it's probably better to make `Timed` extend `AutoCloseable`", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "agreed, fixed", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "ok renamed those 3 methods to `time(...)`", "path": "api/src/main/java/org/apache/iceberg/metrics/Timer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "-1 means here that this wasn't tracked/available, but 0 means that this was tracked and was actually 0", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "ok removed it", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReporter.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "subclasses need access to this and checkstyle is complaining that this needs to be exposed via a method for subclasses", "path": "core/src/main/java/org/apache/iceberg/BaseScan.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "ok removed it", "path": "core/src/main/java/org/apache/iceberg/CatalogUtil.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I haven't considered that, but that makes perfect sense. Updated the code", "path": "core/src/main/java/org/apache/iceberg/IncrementalDataTableScan.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "moved the `ScanReporter` into the `TableScanContext`, so this change isn't required anymore", "path": "core/src/main/java/org/apache/iceberg/BaseTableScan.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "moved the `ScanReporter` into the `TableScanContext`, so this change isn't required anymore", "path": "core/src/main/java/org/apache/iceberg/BaseScan.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "yes correct. The naming is probably bad in `ScanMetrics`, so I'm completely open to suggestions on naming and also on what metrics to track and where to retrieve them from. The initial idea was that there are just **some metrics** gathered and reported", "path": "core/src/main/java/org/apache/iceberg/ManifestGroup.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "that would obviously also work and I'm completely open to suggestions. The only downside of this approach is that we're creating a new iterator", "path": "core/src/main/java/org/apache/iceberg/ManifestGroup.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "the only reason to use `Duration` here is because of the flexibility of converting to other time units and out-of-the-box toString() readability", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "yes correct", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "yes, the idea is to add more stuff eventually to this class (depending on the things we'd like to track)", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": 30, "type": "inline"}, {"author": "nastra", "body": "this is tracked as part of `totalDataManifestsRead()`", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "makes sense, updated. The intention was not to make the `ScanReporter` a timer. We're tracking all metrics (including a Timer) inside `ScanMetrics`", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReporter.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "thanks for bringing this up. I haven't thought about that. I'll rewrite the implementation and add some tests to make sure that each independent table scan produce their own metrics/report", "path": "core/src/main/java/org/apache/iceberg/BaseTableScan.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I double-checked and the old implementation was behaving correctly, because a new `ScanMetrics` instance was created when `ScanReporter#startScan()` was called.\r\nHowever, I updated the `ScanReporter` implementation to how you suggested, as that is cleaner", "path": "core/src/main/java/org/apache/iceberg/BaseTableScan.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "There is no `startTimer` method anymore. If `stopwatch` is null, that indicates that `stop` was called more than once.", "path": "api/src/main/java/org/apache/iceberg/metrics/DefaultTimer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why is `totalTime` recorded in nanos instead of `defaultTimeUnit`? If we are going to record nanos anyway, why not remove passing the default unit and use nanos everywhere?", "path": "api/src/main/java/org/apache/iceberg/metrics/DefaultTimer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "ah yeah I need to adjust the error message to say `stop() called multiple times`", "path": "api/src/main/java/org/apache/iceberg/metrics/DefaultTimer.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "oversight after cleaning up the implementation :( will push a fix", "path": "api/src/main/java/org/apache/iceberg/metrics/DefaultTimer.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I merged the CloseableIterable fix and the addition of Timer. Can you rebase? Thanks!", "path": null, "line": null, "type": "review_body"}, {"author": "nastra", "body": "added that functionality", "path": "core/src/main/java/org/apache/iceberg/ManifestGroup.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why use two info messages rather than one to report the result? I'd expect something like `\"Completed scan planning: %s\", report`", "path": "api/src/main/java/org/apache/iceberg/metrics/LoggingScanReporter.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What is this suppressing?", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Following up on [this comment](https://github.com/apache/iceberg/pull/5268#discussion_r922334117), I don't think that this should produce -1.\r\n\r\nThe reason why `count` returns an `Optional` is that implementations, like Hadoop counters, may not be able to return the final value. Those counters aren't appropriate here, because we need to get the value to build a scan report.\r\n\r\nThere are a couple options to fix. One is to log an error and not send the scan report, so this method would return null", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We typically form error messages by telling you what's wrong, not what the requirement is. That would be `\"Invalid scan report: null\"`\r\n\r\nAnother convention that we use is that we avoid class and method names in error messages. The person reading the error rarely has the context to know what the variable is called, and variable names tend to get out of sync with strings (as you could see from the error message about `startTimer()` not getting called).", "path": "api/src/main/java/org/apache/iceberg/metrics/LoggingScanReporter.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think we should change some of these metrics to be more easily understood by people that see them. To do that, I'd start by establishing some consistent terms that we should use. Here's what I'm thinking:\r\n* Use \"manifest\" to refer to a manifest file\r\n* Use \"file\" to refer to a content file\r\n* Use \"data\" and \"delete\" to specify whether a manifest/file contains data or deletes\r\n* Use \"filtered\" to count what was rejected by a filter\r\n* Use \"scanned\" to count what is actually read\r\n* Use \"result", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```\r\n> Task :iceberg-api:checkstyleMain\r\n[ant:checkstyle] [ERROR] /home/nastra/Development/workspace/iceberg/api/src/main/java/org/apache/iceberg/metrics/ScanReport.java:141:41: 'tableName' hides a field. [HiddenField]\r\n[ant:checkstyle] [ERROR] /home/nastra/Development/workspace/iceberg/api/src/main/java/org/apache/iceberg/metrics/ScanReport.java:146:40: 'snapshotId' hides a field. [HiddenField]\r\n[ant:checkstyle] [ERROR] /home/nastra/Development/workspace/iceberg/api/src/main/java/org/apache/ice", "path": "api/src/main/java/org/apache/iceberg/metrics/ScanReport.java", "line": null, "type": "inline"}], "6622": [{"author": "huaxingao", "body": "I removed the [null check](https://github.com/apache/iceberg/blob/97d657cc4485da9f10b74d48cc3c4c52cadffcef/api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java#L124) and did it in `update`. e.g. [MaxAggregate.update]( https://github.com/apache/iceberg/blob/349daf83882655d13d591807fe7b9ded268ec5d4/api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java#L62)\r\n", "path": "api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Sorry maybe a silly question but how does this work with time travel? Here we're reading the summary from the latest table snapshot but if an aggregation is done on a historical snapshot then we may skip the check below unintentionally. ", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Thanks for your comment! I think I need to check `readConf.snapshotId()` first to get the time travel snapshot. If there is no time travel snapshot, then get `table.currentSnapshot()`. I will fix this.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "nit on log message:\r\n\r\n\"Group by aggregation push down is not supported yet\"", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "same as below, imo\r\n\r\n\"Cannot push down aggregates when row level deletes exist\"", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "typo: statics -> statistics", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Instead of .toString().equals() or .toString().contains() I feel this may be a case where ```instanceof``` may be clearer.\r\n\r\nFor example\r\n\r\n```\r\nif (mode instanceof None) {\r\n return false;\r\n}\r\nelse if (mode instance of Counts) {\r\n...\r\n}\r\netc \r\n```", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Agree. Addressed this and the above comments.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "[doubt] can we parallelize reading the manifests here ? ", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "[minor] can use PropertyUtil#propertyAsInt", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "You're right that we need to subtract the null values, since NaN and null are included:\r\n\r\n> Map from column id to number of values in the column (including null and NaN values)\r\n\r\nI think I was also wrong to add the NaN values in, because they are included in both v1 and v2. So we should remove the addition of NaN count.", "path": "api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This doesn't seem correct to me. What if it is an optional struct with some number of nulls? I think we may want to return `null` if we don't have a count for the column, which would be the default behavior of `safeGet` right?", "path": "api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Was this needed for correctness? I'm not sure I understand why you'd need to move it. The `NullSafeAggregator` was intended to avoid needing to handle null in the `update` methods. It also used to keep `isNull` set correctly so that any aggregate that is null would stop calling `eval` and updating.", "path": "api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Iceberg uses properties that are words separated by `-`. The reason why the two above use camelCase instead is that they are standard options used by Spark. I think for this we should use `aggregate-push-down-enabled`.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We typically separate words with `-` rather than `_`. I think this should also match the other property. How about `spark.sql.iceberg.aggregate-push-down-enabled`?", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It's a little strange that this class is generic and holds a set of rows, but then assumes that those rows are aggregates. Can you refactor this to be purely generic?", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkLocalScan.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can you make this exception more specific? What might be thrown here?", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This shouldn't swallow the exception by only printing the message. Instead, this should pass the exception to the logger so that it gets printed with the full stack trace, suppressed exceptions, and causes.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This isn't much information to help. Can this show which modes don't support aggregate pushdown?", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is going to read all table metadata, which could be really large. Instead, I think this should use scan planning to get the data files. That will allow this to apply filters and skip a lot of data, and it would also parallelize manifest scanning using a `ParallelIterator`. You'd need to request stats, or else the tasks will be returned without them copied.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think it would be good to check whether the aggregates are non-null and only return if they are valid. Otherwise, this could return different results depending on whether stats are present in the file metadata. To avoid that, we can just detect whether we have a result and abort pushdown if we don't.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "There's a wrapper that we can use so we don't need to convert to Scala, `StructInternalRow`. That would also handle conversion to Spark representations, like `UTF8String`.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "If one data file evaluates `null`, I think we still want to evaluate the rest of the data files. For example,\r\n```\r\nCREATE TABLE test (id LONG, data INT) USING iceberg PARTITIONED BY (id);\r\n\r\nINSERT INTO TABLE test VALUES (1, null), (1, null), (2, 33), (2, 44), (3, 55), (3, 66);\r\n\r\nSELECT max(data) FROM test;\r\n```\r\nFor `max(data)`, the first data file evaluates null, I think we still want to evaluate the rest of the data files to get the max value `66` for `max(data)`.", "path": "api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Iceberg does not use `get` in method names. There's probably a better name here, like `readSnapshot()`.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Log messages should be more direct. In this case, the main information is that the aggregate pushdown is skipped. The reason why is secondary, but important. Rather than making this a statement that needs to be interpreted (\"X is not possible\" -> \"Iceberg didn't do X\"), this should be \"Skipped aggregate pushdown: detected row level deletes\".\r\n\r\nIn addition, I think that there are cases where you'd still want an answer from metadata. First, there may not be any matching delete files, so it could ", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Error message: don't use \"yet\". It is simply not supported.\r\n\r\nIt should be possible to do this, but I understand skipping it in the first PR.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Removed the NaN count.", "path": "api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "How about just checking the BoundAggregate in the beginning and not push down if any of the aggregates are on complex type? In this way we don't need to create `AggregateEvaluator` and evaluate any of the aggregates.", "path": "api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "What about a short Javadoc explaining the purpose of the class?", "path": "api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java", "line": 35, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: What about a direct import for `Aggregator` to shorten the lines?\r\n\r\n```\r\nImmutableList.Builder<Aggregator<?>> aggregatorsBuilder = ImmutableList.builder();\r\n```", "path": "api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: What about an empty line prior to the for loop for separate this block?", "path": "api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java", "line": 63, "type": "inline"}, {"author": "aokolnychyi", "body": "Do we think this will be used in the future? I did not find where it is used right now.", "path": "api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I am not sure I understand the purpose of `isNull` in this class then. Looks like we init it and never change?", "path": "api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I haven't seen the tests yet but I think we want to cover all of [these](https://spark.apache.org/docs/3.3.0/sql-ref-null-semantics.html#builtin-aggregate-expressions-) scenarios.", "path": "api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java", "line": 114, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: `aggregatePushDown` -> `aggregatePushDownEnabled`?", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Do we need this temp var? Why not return directly like in other methods of this class?\r\n\r\n```\r\nreturn confParser\r\n .booleanConf()\r\n .option(SparkReadOptions.AGGREGATE_PUSH_DOWN_ENABLED)\r\n .sessionConf(SparkSQLProperties.AGGREGATE_PUSH_DOWN_ENABLED)\r\n .defaultValue(SparkSQLProperties.AGGREGATE_PUSH_DOWN_ENABLED_DEFAULT)\r\n .parse();\r\n```", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Hm, is it a good idea to throw an exception? Would returning null be better?", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkAggregates.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I'll check how we handle it later. No need to change it.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkAggregates.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "nit: `pushDownAggregate` -> `canPushDownAggregation`?", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Why do we have this condition? Is it to prevent such pushdown for metadata tables?", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": 251, "type": "inline"}, {"author": "aokolnychyi", "body": "I also support the idea of checking if any matching tasks have deletes and using that instead of relying on generic snapshot metadata.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "I'd consider adding another method to `SparkAggregates` to convert an entire `Aggregation`. That way, we will be able to simplify this block.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "aokolnychyi", "body": "Shall we add tests for NaN, positive and negative infinity?", "path": "spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java", "line": 43, "type": "inline"}, {"author": "huaxingao", "body": "Changed to use scan planning to get the data files. Please take a look to see if it's OK. ", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Thanks for your comment. This has been changed to use scan planning for parallel reading", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Thanks. This has been changed to check the deletes in tasks.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Fixed. Thanks", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Fixed. Thanks", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Fixed. After the change it's something like this\r\n```\r\ntesthive.default.table [max(data), min(data), count(data)]\r\n```", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkLocalScan.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Fixed. Thanks", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Fixed. Thanks", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Have fixed this to include the Metrics modes in the log", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "I have added the null check, but after a second thought, I removed it.\r\n\r\nIf one of the column has all null values, then the max or min is also null. We probably still want to push down the aggregate.\r\n\r\nI have checked the Metrics mode to disable push down if the mode doesn't have stats. I have also disabled push down for complex types. I am wondering if it's safe without the null check here. If not, I will put back.\r\n", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Changed to `StructInternalRow`. Thanks", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Fixed, thanks", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "I have changed the code to check the deletes in the tasks and abort the push down if deletes are present.\r\n\r\nI also agree it may be better to introduce another setting to get an approximate number if there are deletes. Probably we can do this in a follow up PR. ", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Fixed. thanks", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Added. Thanks", "path": "api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java", "line": 35, "type": "inline"}, {"author": "huaxingao", "body": "Fixed. Thanks", "path": "api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java", "line": 63, "type": "inline"}, {"author": "huaxingao", "body": "I think this is for future usage @rdblue ", "path": "api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Fixed. Thanks", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Fixed. Thanks", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Changed. Thanks", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Changed. Thanks", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should be how data is returned by the evaluator, right? How would Spark get the result if not by calling this method?", "path": "api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Oh, I see. My original PR had `public StructLike result()` rather than two methods. I think that was correct. Everything should get the result as `StructLike` and wrap to adapt to other in-memory object models.\r\n\r\nhttps://github.com/apache/iceberg/pull/6405/files#diff-e81b895957e7995cf3a00dd20966277fd18a2ae6c3090d28d3cdce033fbfeb56R83-R87", "path": "api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I see. In that case, I think we need to change `isNull` to `hasValue` and return a boolean from `update(R)`.\r\n\r\nThe intent here was to signal when there is not enough information to produce a value. When there isn't, then the result value should be `null`, and we can skip pulling values out of rows or data files because we don't have enough information.\r\n\r\nFor example, if we are processing 3 Parquet files and 1 Avro file, the Avro file may not have a max value. Rather than giving a partial max f", "path": "api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "@huaxingao, I think the fix is to have a flag in the aggregator that can return whether or not the value is valid. That's what I wanted to use `null` for here, but you're right that there are cases where the aggregate is value and that value is null because there are no non-null values.\r\n\r\nIf we keep track of `isValid` in each aggregator, then the `AggregateEvaluator` can have a similar method to return whether all aggregates are valid. The we would just abort the aggregation if any value is not", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'm good ignoring structs, maps, and lists for now. But we _can_ probably handle cases where there are value counts for primitive fields nested within stucts.\r\n\r\nTo fix this, I don't think that we need to check for struct, list, or map types. Instead, we can simply detect that we don't have the value and null counts (which are not produced for those types) and set the `isValid` flag to signal that we don't know.", "path": "api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "I changed `isNull` to `hasValue`. I have also added a flag `canPushDown` in `BoundAggregate` to indicate if this aggregate can be pushed down. I think I need a way to differentiate the `null`: is the `null` due to stats not available (e.g. complex type) or due to the value is null, so I added this flag.", "path": "api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "OK. Thanks", "path": "api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Sorry I initially didn't know this `StructInternalRow` wrapper class, so I broke your method into two and used `Object[]` directly, and then I forgot I have changed your original method. I changed it back.", "path": "api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Aggregate expressions can't be mutable. The `AggregateEvaluator` is the mutable state that gets created for every aggregation.\r\n\r\nSorry about not being clear, I think the problem was that I referred to \"aggregator\" when I meant `AggregateEvaluator`.", "path": "api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why is this creating a `TableScanContext`? You should be able to use `table.newBatchScan()` and do this entirely with the public interfaces.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "If this was needed to call `setColStats` then we should add a method to set that, rather than exposing the reporter on `BaseTable` and not using the public API.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Reminder: Iceberg does not use the `get` prefix in method names.", "path": "core/src/main/java/org/apache/iceberg/BaseTable.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "For all cases, this should not be called if the value is null. Handling null values is the purpose of `NullSafeAggregator`.", "path": "api/src/main/java/org/apache/iceberg/expressions/CountAggregate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Rather than setting `canPushDown` in the aggregator, I think that this needs to distinguish between a `null` value and a missing value. To do that, there are a few options, but I think the cleanest is to add a `hasValue(DataFile file)` method to check before calling `eval`. If that returns false, then `hasValue` is set to `false` and no more aggregation is done. (No need for a similar update for rows, since there is always a value.)\r\n\r\nIn addition, this method should only call `update` when valu", "path": "api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Missing whitespace.", "path": "api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should always call all aggregators. Each aggregator will stop aggregation independently if there is a missing value, then it is the responsibility of the caller to determine whether the aggregation can be used by checking if any aggregate has no value. The caller may still choose to get the aggregates for which there are values, so this should always return what it can. If there is no value, it should return `null`.", "path": "api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This should not be needed.", "path": "core/src/main/java/org/apache/iceberg/TableScanContext.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "@huaxingao, I think that this class should be tested with the various cases for `DataFile`. We want tests in core for core functionality, not just integration tests with Spark.", "path": "api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java", "line": 35, "type": "inline"}, {"author": "rdblue", "body": "We can't assume that the aggregate is invalid just because it isn't supported. Instead, this should state that the aggregate is unsupported.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkAggregates.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Info message about why aggregate pushdown was skipped?", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": 207, "type": "inline"}, {"author": "rdblue", "body": "Nit: end punctuation in log messages should be omitted.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Style: missing whitespace between `if` blocks.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: I'd separate the `return true` from the other statements.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": 247, "type": "inline"}, {"author": "rdblue", "body": "Nit: ending punctuation in a log message.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "How about `\"Skipped aggregate pushdown: Cannot produce min or max from truncated values for column %s\"`?", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "How about \"Skipped aggregate pushdown: Cannot produce min or max from count for column %s\"?", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": 293, "type": "inline"}, {"author": "rdblue", "body": "How about \"Skipped aggregate pushdown: No metrics for column %s\"?", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": 289, "type": "inline"}, {"author": "rdblue", "body": "I think it would be slightly better to create the scan in the aggregation methods. Then this could be `if (localScan != null) { return localScan }` which is a bit more generic and may be reused in the future.", "path": "spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think this is quite correct. A null value is a valid value if it comes from a struct, but it should be skipped rather than passing it into `update` because min/max/count skip null values.", "path": "api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This looks good to me.", "path": "api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java", "line": 140, "type": "inline"}, {"author": "rdblue", "body": "Since there is the case where `count < 0` in `countFor`, I think that this should return `file.recordCount() >= 0`. If -1 was used for the data file's record count then there is no value.", "path": "api/src/main/java/org/apache/iceberg/expressions/CountStar.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "For the value to be valid, it must be expected to be non-null and should actually be included in upper bounds.\r\n\r\nThat is:\r\n* If the column is all null values (`safeGet(file.valueCounts(), fieldId) != null && safeGet(file.valueCounts(), fieldId) == safeGet(file.nullValueCounts(), fieldId)`) then this returns `true` because `null` is valid\r\n* If the column has at least one non-null value, then `file.upperBounds()` must contain `fieldId`.", "path": "api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this will work correctly when `file.upperBounds()` does not contain the `fieldId` because all values are null.", "path": "api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java", "line": 56, "type": "inline"}, {"author": "rdblue", "body": "This is similar to the case for max, but should check that there's a value in `file.lowerBounds()`.", "path": "api/src/main/java/org/apache/iceberg/expressions/MinAggregate.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Fixed. Thanks", "path": "api/src/main/java/org/apache/iceberg/expressions/CountStar.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "Fixed. Thanks!", "path": "api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java", "line": null, "type": "inline"}, {"author": "huaxingao", "body": "I initially wanted to push down `NaN`, but I think it's better not to. I changed to not push down if there are `NaN`", "path": "spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java", "line": 620, "type": "inline"}, {"author": "rdblue", "body": "I think this should be `&&`, not `||` to be careful about what is accepted.\r\n\r\nThe upper bounds must have an entry for the field. But that entry could be `null` if the null count and the value count are equal.\r\n\r\nTo be safe, I think this should be:\r\n\r\n```java\r\n protected boolean hasValue(DataFile file) {\r\n boolean hasBound = file.upperBounds().containsKey(fieldId);\r\n Long valueCount = safeGet(file.valueCounts(), fieldId);\r\n Long nullCount = safeGet(file.valueCounts(), fieldId);\r\n bo", "path": "api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This logic should be similar to to what I recommended in the max aggregate.", "path": "api/src/main/java/org/apache/iceberg/expressions/MinAggregate.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It would help to name these files to assist the reader. It looks like this file is missing stats for the `some_nulls` column. If that's the intent of this file for testing, then I'd name it something like `MISSING_SOME_NULLS_STATS`.", "path": "api/src/test/java/org/apache/iceberg/expressions/TestAggregateEvaluator.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why not add count(*) here as well?", "path": "api/src/test/java/org/apache/iceberg/expressions/TestAggregateEvaluator.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "The results look correct for this test.", "path": "api/src/test/java/org/apache/iceberg/expressions/TestAggregateEvaluator.java", "line": 99, "type": "inline"}, {"author": "rdblue", "body": "Isn't this just testing the fact that the stats are missing and this can't be calculated?\r\n\r\nI think it would be more interesting to test valid count(*) and count(some_nulls) cases where the stats exist.", "path": "api/src/test/java/org/apache/iceberg/expressions/TestAggregateEvaluator.java", "line": 139, "type": "inline"}, {"author": "rdblue", "body": "In fact, I think it would make sense for all of the tests to support count(*)", "path": "api/src/test/java/org/apache/iceberg/expressions/TestAggregateEvaluator.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think that this should be public. If you need to have stats returned, then that option needs to be exposed through a public interface, not on the implementation class.", "path": "core/src/main/java/org/apache/iceberg/DataTableScan.java", "line": null, "type": "inline"}], "13400": [{"author": "singhpk234", "body": "To be removed, need to remodel the tests correctly, but presently the E2E machinery works with scan planning and with context aware parsers !", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "might need to pass the auth session based on context passed in the loadTable", "path": "core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Yeah, more specifically I think this should be a `tableSession` whose parent session is the `contextualSession` ", "path": "core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "i just realized, i sent the tableClient to this function which already has done `client.withAuthSession(tableSession)` we should be good then :) ! let me check if i can validate it by a test ", "path": "core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I'd just inline these where they're actually used for improved readability", "path": "core/src/main/java/org/apache/iceberg/RestTableScan.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I don't follow, in this case shouldn't `useSnapshotSchema` always be true? When time travelling to a specific snapshot we always use the snapshot schema", "path": "core/src/main/java/org/apache/iceberg/RestTableScan.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I feel like this should reside in the REST module? ", "path": "core/src/main/java/org/apache/iceberg/RestTableScan.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Same as above, I feel like this should reside in the REST module", "path": "core/src/main/java/org/apache/iceberg/ScanTasksIterable.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Shouldn't `close` cancel any ongoing work? ", "path": "core/src/main/java/org/apache/iceberg/ScanTasksIterable.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Can it not also be a cancelled status here as well? https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3443", "path": "core/src/main/java/org/apache/iceberg/RestTableScan.java", "line": 168, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Same as below, can this be in the REST module?", "path": "core/src/main/java/org/apache/iceberg/RestTable.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I'm still going through things but I do believe this works at least from a correctness perspective. I still would need to give some more thought as to cancellation, client/server backpressure and how this would fit in for engines which immediately start task consumption/execution during planning (like Trino)", "path": null, "line": null, "type": "review_body"}, {"author": "amogh-jahagirdar", "body": "Should this be a queue? we're always removing from the front", "path": "core/src/main/java/org/apache/iceberg/ScanTasksIterable.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "This is done because TableScanContext is protected and we need to access it, not sure if we should make it public, thought it ?", "path": "core/src/main/java/org/apache/iceberg/RestTable.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "I think, when snapshot is the current snapshot we don't want to use the snapshot schema use table schema so useSnapshotSchema returns false otherwise it returns true, which is what we have in the spec - https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L4410-L4411", "path": "core/src/main/java/org/apache/iceberg/RestTableScan.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "I think its mostly related to the fact that cancel endpoint is not something we have implemented ? i can add it for this but i think couple questions that cross my mind are : \r\n1. Should i throw ? \r\n2. What if we cancelled something may be due to the LIMIT being reached i.e we got all the intended number of tasks, in this case should we just stop processing ? can we have that context \r\nideally in the plan call it should not be the case, let me think more", "path": "core/src/main/java/org/apache/iceberg/RestTableScan.java", "line": 168, "type": "inline"}, {"author": "singhpk234", "body": "Yes it should IMHO, we should issue a cancel, should i just go ahead and implement the cancel ?", "path": "core/src/main/java/org/apache/iceberg/ScanTasksIterable.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "plz ref [here](https://github.com/apache/iceberg/pull/13400#discussion_r2283592442)", "path": "core/src/main/java/org/apache/iceberg/RestTableScan.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "presently this is a table prop as well client side config override combination which triggers scan planning flow, \r\n\r\nwondering if we need some special prop in server side response from the table to indicate the client to do server side planning ? ", "path": "core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Is this `planningFinished` state ever updated anywhere? It looks like in every possible branch we'll either exit due to a success or throw. The one exception is if it's repeatedly in submitted. But effectively as its written, this is just a while(true)...", "path": "core/src/main/java/org/apache/iceberg/RESTTableScan.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Whether or not server side planning is supported is a matter of checking the config response no? I don't think it's something that would be defined per table so I don't think table properties is right. I think the right conditions is client side config + server config response contains scan-planning.\r\n\r\nI do agree we'll also need something from a server to indicate that a client *must* do client side planning but I think that's something we can take in a follow on. Once the spec is defined for t", "path": "core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "sorry missed this, yeah I think we should implement the close->cancel logic upfront. It'll be important for any engines which are enforcing limits , they would then stop the task generation/requests and free up resources from the server", "path": "core/src/main/java/org/apache/iceberg/ScanTasksIterable.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "Discussed this offline but adding it here for the completion : \r\nIts supposed to be a while(True) loop essentially polling the server until the planning is completed because the `planFiles()` api has been called, this variable is essentially added to suppress the warning, per feedback i will change this to while(true) with a timeout and add suppress warning annotations. ", "path": "core/src/main/java/org/apache/iceberg/RESTTableScan.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "Agree, then i will just refactor to check the `/config` response to whether server supports this and then the client side intention for now, will comeback on it later !", "path": "core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "Introduced since iceberg 1.8.0 via https://github.com/apache/iceberg/pull/11756 \r\nnot sure if it's we should be doing another release to correct this since post 1.10 because the rest request and parsers are now public, server with java SDK 1.10 might be able to use it to ready with the server side implementation ? \r\n\r\ncc @amogh-jahagirdar thoughts ? ", "path": ".palantir/revapi.yml", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "created a seperate PR for this fix : https://github.com/apache/iceberg/pull/14120/files", "path": ".palantir/revapi.yml", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I think this change is right; after all, it's possible that the delete file references is defined but an empty list for a given task.", "path": "core/src/main/java/org/apache/iceberg/rest/RESTFileScanTaskParser.java", "line": 90, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "`restTableFor`? or just `restTable`, since we limit the use of get in the project", "path": "core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I feel like we should double check we're getting the expected behavior out of metadata tables in this change. The way it's currently implemented we're never going to pass the RESTTable through to MetadataTableUtils.createMetadataTableInstance. Of course in general for security use cases typically nobody is going to have access to this but this should still be implemented in a way that's compatible with supporting metadata queries when remote planning is enabled.\n\nI understand we're going to want", "path": "core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java", "line": 491, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "What is this for?", "path": "core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java", "line": 261, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "These are the same paths no? Just different verbs", "path": "core/src/main/java/org/apache/iceberg/rest/ResourcePaths.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Apologies if I missed it but could we add a test to make sure metadata tables still work as expected with remote planning? We can always refactor later but I do want to make sure there's no incorrect behavior in case server side planning is supported and a client does a metadata table scan. ", "path": "core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java", "line": 491, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Nit: remove these comments", "path": "core/src/main/java/org/apache/iceberg/ScanTasksIterable.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I'm not really a fan of harcoded table identifiers which exhibit a behavior in `RESTCatalogAdapter` especially considering we know people use this class as a primitive when building simple catalogs out there. \n\nI see how it's convenient for writing tests but I feel like we should have a different mechanism since this change really is just for the tests we're adding.\n\nIs it not possible to do something like RESTCatalog adapter only does planning with depth when there are some N (some hardcoded i", "path": "core/src/test/java/org/apache/iceberg/rest/RESTCatalogAdapter.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I don't quite understand this comment, don't things just work by the client setting the desired snapshot which internally gets set in the context? What's special about the REST case?", "path": "core/src/main/java/org/apache/iceberg/RESTTable.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Does this need to be public? If I'm not mistaken, it's only used in the RESTTableScan in the same module", "path": "core/src/main/java/org/apache/iceberg/ScanTasksIterable.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Speaking of this, could we add some tests around time travel? ", "path": "core/src/main/java/org/apache/iceberg/RESTTable.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Test hardness: We should make sure the task contents are what we expect, the exact data/delete files....", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "This probably repeats enough across the tests that it's worth putting in a helper:\n\nrestTableScanFor(Table table) which asserts the instance of the table and the scan type and returns the casted RESTTableScan.", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Same for these tests, I feel like these tests could be stronger in asserting the contents of the actual files rather than simple expectations on the existence of delete files.", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "Does this need to be public? ", "path": "core/src/main/java/org/apache/iceberg/RESTTableScan.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I don't quite follow, why do we need to pass in a callback for cancellation? We're already passing the client into the iterable, could we not just cancel a plan inside the close of the iterable?", "path": "core/src/main/java/org/apache/iceberg/RESTTableScan.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "Yes we can, I did it in this way so that the cancelPlan logic can be re-used, mainly because there are cases like (timeout reached) etc where we want to additionally cancel, by this we just have one place where the cancel i.e RestTableScan is defined (in case we wanna error out / log etc) and iterator simply calls the callback. please let me know your thoughts considering above.", "path": "core/src/main/java/org/apache/iceberg/RESTTableScan.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "This is ignore style for swallowing the exception here, rationale is that if we get an exception here then if fine to ignore rather than fail ?\r\n```\r\n } catch (Exception e) {\r\n // Plan might have already completed or failed, which is acceptable\r\n return false;\r\n }\r\n ```", "path": "core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java", "line": 261, "type": "inline"}, {"author": "singhpk234", "body": "Metadata tables would be tricky (may be we can offload some metadata tables like FILES ?) but for metadata tables like `all_manifests` etc requires knowing the manifests, which in Remote Planning we expect client not to care, can't be resolved with the same RestTable as we don't have the metadata for it. \r\nif the expectation is the client has the credentials to read the table (since it can read the manifests directly) would it be better to then cut off RestTable from the loop ? \r\n\r\nexperimenting", "path": "core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java", "line": 491, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "While I do like the simplicity of the current implementation, I feel like it's a bit over simplistic in that we're wrapping parallel iterables around parallel iterables which I feel is a bit problematic for bounding consumption and memory usage (each level in the pagination will multiply how much memory is allocated across all the queues). In practice I wouldn't expect such crazy depths, but it'd be better if we can make this function just like client side planning where tasks are constrained to", "path": "core/src/main/java/org/apache/iceberg/RESTTableScan.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "I agree, I pushed an api change to make parallelIterable to exposed taking in more iterable post creation and then added a reference counter which would make the finished adding and hence closing the iterator more easy to reason about, please let me know if you prefer reviewing that change in a seperate PR https://github.com/apache/iceberg/pull/13400/commits/1622b115639ed456a7bdb9ac30da529fe4b7dd6d \r\n\r\n", "path": "core/src/main/java/org/apache/iceberg/RESTTableScan.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I'm still going through this change in more detail but I also poked around locally to see if there were any alternatives to changing ParallelIterable itself (which I still think is a valid option but worth considering other options as well imo). \r\n\r\nOne approach that seems to be OK and reasonably simple is using Java ForkJoinPool which handles any [recursive plan task expansion](https://github.com/amogh-jahagirdar/iceberg/blob/separate-pool-with-queues/core/src/main/java/org/apache/iceberg/REST", "path": null, "line": null, "type": "review_body"}, {"author": "amogh-jahagirdar", "body": "One possible issue in the fork join pool approach I linked is that even if we share the fork join pool across planning invocations, it's still technically a separate threadpool from the worker pool. In an environment like Trino where coordinator is handling multiple queries this may lead to resource utilization issues across queries which are mixed across remote planning vs not. \r\n\r\nAnother issue which I think is relatively minor in this context is that fork join pool really is meant for CPU bou", "path": null, "line": null, "type": "review_body"}, {"author": "amogh-jahagirdar", "body": "@singhpk234 I published a PR to your branch on how we may want to do this https://github.com/singhpk234/iceberg/pull/271 , comments in the code. Let me know if this makes sense, we can probably publish it separately", "path": "core/src/test/java/org/apache/iceberg/rest/RESTCatalogAdapter.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this should probably be moved to `RESTCatalogProperties`", "path": "core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I think it would be good to pull these changes out into a separate PR", "path": "core/src/main/java/org/apache/iceberg/util/ParallelIterable.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "nit: protected?", "path": "core/src/test/java/org/apache/iceberg/catalog/CatalogTests.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "I agree, though i haven't moved this into a seperate pr because we are still debating if we need :\r\n1. Approach implemented in this PR\r\n2. Approach @amogh-jahagirdar mentioned https://github.com/apache/iceberg/pull/13400#pullrequestreview-3341995863\r\n3. Approach @danielcweeks suggested which i implemented here https://github.com/apache/iceberg/pull/14444\r\n\r\nseems like we are now inclined towards 3, what are your thoughts considering all the 3 approaches ? ", "path": "core/src/main/java/org/apache/iceberg/util/ParallelIterable.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "maybe just `server-scan-planning-enabled` or `rest-scan-planning-enabled` as having `rest` and `server` in the name more or less means the same thing?", "path": "core/src/main/java/org/apache/iceberg/rest/RESTCatalogProperties.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "looks like we just need a rebase since this was already fixed", "path": "core/src/main/java/org/apache/iceberg/rest/TableScanResponseParser.java", "line": 109, "type": "inline"}, {"author": "nastra", "body": "I believe this is already verified inside restTableScanFor", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I would have assumed that we just use whatever the default is that is defined in `RESTCatalogAdapter`? Any other test that needs other planning behavior would configure it for that particular test, but the default IMO should be the default of `RESTCatalogAdapter`", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I'm not fully following why the response wouldn't roundtrip here. Isn't that why we have the serializers and deseralizers configured for a given response class?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I think this should be removed as mentioned in https://github.com/apache/iceberg/pull/13400/files#r2538769276", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "What does `Replaces magic table name-based behavior routing` actually mean and where does that happen?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "given that `tableScan` is never used anywhere, maybe we should remove that parameter?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I don't think our builders across the codebase typically follow this pattern. We typically pass configured parameters through a private constructor, so maybe we should update this as well to align it?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "why is this one set to 100 and the other to 1000? Should they both be aligned?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "should this be called `synchronous()` to align with `asynchronous()`?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "are we expecting all of these new methods to be overridden by subclasses of this test class?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "maybe the append should be outside of this method, since the naming suggests that it only creates the table but doesn't do anything else?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "nit: I'm aware that most tests in this class have a `test` prefix, but I feel like we could as well just omit that prefix from the newly added tests, to slightly improve readability, since that prefix doesn't add any value, wdyt?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I don't think any other subclass of `CatalogTests` use these, so maybe they should be moved to the REST tests?", "path": "core/src/test/java/org/apache/iceberg/catalog/CatalogTests.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "`.path()` is deprecated. Can we switch all places to use `.location()`?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I would have assumed that this always just contains FILE_A or is that not the case?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "same question as above", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "nit: in other tests we just add this to the method signature instead of catching", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "The previous implementation here used to have hard coded table names that triggered planning behavior, my PR fixed that. I think @singhpk234 is just including that context in this comment but it's not really useful and it's confusing to someone who's reading this and doesn't have the history, so I'd remove it.\r\n", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this test and the one above basically do the exact same thing but are named differently", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "maybe we should just keep `testCancelPlanMethodAvailability()` and remove the other two test methods?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I feel like we're overcomplicating this quite a bit. Are we not able to just inline override the behavior in the adapter just once for all these tests, and then just setup the tests on the basis of how the behavior is overriden? I don't think we need to set this _per_ test in different ways.\r\n\r\nAlso, to make sure we are indeed exercising the paths we expect, it'd be nice to verify the actual http calls made if that's reasonable. ", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I don't see a test where `cancelPlan()` would return true, so maybe that would be good to add", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "See https://github.com/singhpk234/iceberg/pull/271/files#diff-37f64b622d5c3e0e890d94773e84ef49ceb3c1e9b0c03da313db9266b0e75afbR189", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I would remove this comment @singhpk234 , it doesn't really add too much value and is confusing to a reader. But for context @nastra , before my PR, the way testing was done was just based on hard coded table names in RESTCatlaogAdapter that would trigger certain planning behaviors. ", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I think we should only call this endpoint when it's actually available? `Endpoint.check(endpoints, V1_CANCEL_TABLE_SCAN_PLAN)`. Also we probably need to check the `V1_FETCH_TABLE_SCAN_PLAN` and `V1_FETCH_TABLE_SCAN_PLAN_TASKS` endpoints as well", "path": "core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java", "line": 269, "type": "inline"}, {"author": "nastra", "body": "also if we really want to test whether the endpoint is available I think we would need to have a valid planId so that the underlying `V1_CANCEL_TABLE_SCAN_PLAN` endpoint is actually called", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this seems like it should go into `TestResourcePaths`", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "it seems that this behavior is already tested elsewhere?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "should this use `restTableFor` where that check is already performed?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "nit: should probably be wrapped in a try-with-resources block", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "minor: what about just having a for loop and iterating over all available metadata table types? We do something similar in https://github.com/apache/iceberg/blob/19330fa19f833f481063704c1e0122e068289259/core/src/test/java/org/apache/iceberg/TestTableUtil.java#L103", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this sounds like something that should be tested in a `TestScanTaskIterable`. Same for `testIteratorCloseTriggersCancel`", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "should this be using `restTableFor`?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this can also be changed to use AssertJ:\r\n```\r\nFileScanTask taskWithDeletes =\r\n assertThat(tasks.stream().filter(task -> !task.deletes().isEmpty()).findFirst())\r\n .isPresent()\r\n .actual()\r\n .get();\r\n```", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "same as above", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "same as above", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "same as above", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "can we have a test with an active plan where cancelling succeeds and returns true?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "should this be calling the util methods that set this up?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "`restTableFor`?", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n assertThat(tasks2).allMatch(task -> task.deletes().isEmpty());\r\n```", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n assertThat(tasks3).map(task -> task.file().path())\r\n```", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "you don't really need to call `stream()` here. Instead you could do this:\r\n```\r\nassertThat(tasks2)\r\n .map(task -> task.file().path())\r\n .containsExactlyInAnyOrder(FILE_A.path(), FILE_B.path());\r\n```", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "I removed it, it was basically there to skip serde of the response because it required setting parserContext to the mapper unfortunately we intialize a MAPPER inside the TestRESTCatalog which skips the entire handling we have for parserContext inside the HTTPClient, so what i did was mimicked what we are doing there and setting injectable value to provide the mapper the context to deserialize using the ParserContext, though ideal would be testing stuff E2E using the existing mapper", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "+1 i forgot to remove this comment ", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "Its helpful for subclasses, and consistent with the dataFileObject imho, would be nice to keep it here", "path": "core/src/test/java/org/apache/iceberg/catalog/CatalogTests.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "I agree with you, with in the current iteration i just wanted to avoid rewriting it based on one config (hence I was checking for setter for example i need async / sync mode both), Let me think a bit more on how to factor in with the feedback !", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "I see, so call cancel only when its supported, though V1_FETCH_TABLE_SCAN_PLAN and V1_FETCH_TABLE_SCAN_PLAN_TASKS are called on need to know basis, like if the server is giving responses bases on which the client is calling the equivalent APIs, I do have an uber level check here : \r\n\r\n```\r\n if (endpoints.contains(Endpoint.V1_SUBMIT_TABLE_SCAN_PLAN) && restServerPlanningEnabled) {\r\n return new RESTTable(\r\n ops,\r\n fullTableName(finalIdentifier),\r\n metricsReporter", "path": "core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java", "line": 269, "type": "inline"}, {"author": "singhpk234", "body": "I see, though should we reconsider this in a separate PR focused for this interface in RESTCatalogAdapter this just implements the interface for now.", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "we use `test` prefix almost everywhere even for classes like `TestRESTUtil` \ud83d\ude05, but i do see how can this be redundant we annoted it with `@Test`, TBH I am open to both, please let me what you recommend", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "> I don't think we need to set this per test in different ways \r\n\r\nI thought about this more, i think we do need to test both async and sync which itself are two behaviours one thing i can do is based on the tableScan which is a parameter to decide shoud `async` may be i can write if else on the number of task ? or files paths like if contains contains delete or something like this, wdyt ? \r\n\r\npresently i have exposed a setter and a test case creates a planning behaviour and modifies the adapter", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "for newly added tests I always recommend people to omit the `test` prefix as it doesn't add any value. Also it just makes the naming longer than needed", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "Added Illegalstate handing with supported endpoint. please let me know if this addresses this feedback ", "path": "core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java", "line": 269, "type": "inline"}, {"author": "singhpk234", "body": "enabled via : https://github.com/apache/iceberg/pull/14661", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "This is not supported by client side planning itself as well, gonna take a long time ", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "singhpk234", "body": "To be enabled by : https://github.com/apache/iceberg/pull/14660", "path": "core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java", "line": null, "type": "inline"}, {"author": "amogh-jahagirdar", "body": "I still don't follow why this and the builder below is required? Are we not able to override the planning behavior just once in a manner which will force the client to exercise all the paths? ", "path": "core/src/test/java/org/apache/iceberg/rest/RESTCatalogAdapter.java", "line": 610, "type": "inline"}, {"author": "singhpk234", "body": "> Are we not able to override the planning behavior just once in a manner which will force the client to exercise all the paths\r\n\r\nIt is possible ! but having a setter simplifies the test specially since we have tests with time travel and scans with multiple which would make assertions lengthy.\r\n\r\nApproach 1 (check on table name prefix or some property, not preferred ?): \r\n```java\r\n default boolean shouldPlanTableScanAsync(TableScan tableScan) {\r\n return tableScan.table.name.startsWith(\"a", "path": "core/src/test/java/org/apache/iceberg/rest/RESTCatalogAdapter.java", "line": 610, "type": "inline"}, {"author": "nastra", "body": "nit: `restScanPlanningEnabled` to align with the property name itself", "path": "core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n public void fetchScanTasksPath() {\r\n```\r\nto align with the naming of the other server-side planning test names", "path": "core/src/test/java/org/apache/iceberg/rest/TestResourcePaths.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "nit: you could also use the existing `withPrefix` / `withoutPrefix` variables", "path": "core/src/test/java/org/apache/iceberg/rest/TestResourcePaths.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "isn't that the fetchPlanningResult or cancel path?", "path": "core/src/test/java/org/apache/iceberg/rest/TestResourcePaths.java", "line": 264, "type": "inline"}, {"author": "nastra", "body": "isn't that the `fetchScanTasksPath`?", "path": "core/src/test/java/org/apache/iceberg/rest/TestResourcePaths.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "ideally this should be part of the `rest` package and made package-private, but I'm guessing we're having a conflict with the visibility of `TableScanContext`?", "path": "core/src/main/java/org/apache/iceberg/RESTTable.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I'm guessing we have the same visibility issue as with RESTTable here?", "path": "core/src/main/java/org/apache/iceberg/RESTTableScan.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "what is this path used for? I think this can be removed", "path": "core/src/main/java/org/apache/iceberg/RESTTable.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "just a nit but we might as well just store the actual headers by calling get on the supplier once in the constructor and then pass them without having to call get everywhere", "path": "core/src/main/java/org/apache/iceberg/RESTTableScan.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "does this need to be volatile. Mainly I'm asking because I don't see any synchronization on this variable. \r\n\r\nI see the variable has been mostly introduced for cancelPlan. Maybe cancelPlan should instead take the planId as a parameter and then we can avoid having this variable, wdyt?", "path": "core/src/main/java/org/apache/iceberg/RESTTableScan.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I believe this should use `Endpoint.check()` to be consistent with other error messaging around unsupported endpoints. Also can we add a test that verifies the error messages on the 3 endpoints when we run with a server that doesn't support them?", "path": "core/src/main/java/org/apache/iceberg/RESTTableScan.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this should also use `Endpoint.check()`. Also can we have a test that verifies that the error is properly thrown on a server that doesn't support this endpoint?\r\n\r\nI actually wonder if it would make sense to add this check much earlier (in RESTTableScan) before we actually go down and start processing plan tasks, wdyt?", "path": "core/src/main/java/org/apache/iceberg/ScanTasksIterable.java", "line": null, "type": "inline"}], "9852": [{"author": "szehon-ho", "body": "Thanks for combining the earlier prs. It's clearer now. \r\n\r\nSome preliminary comments. \r\n\r\n@nastra please take a look as well to see if there are any major concerns. ", "path": null, "line": null, "type": "review_body"}, {"author": "szehon-ho", "body": "Nit: use a javadoc link here", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Variable name and comments use 'table'. Should be more generic, like 'entity'", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "I like this code re-use, later we can move code up here like refreshMetadata(), now that we have the common interface.", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Why not just move the CommitStatus up to BaseMetastoreOperations?", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Check if null?", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "This looks like a bug, something seems not wired.", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "What is the point of this method? It just wraps another method.", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "I think we need to store and check previousFiles in ViewMetadata , just like we are doing for TableMetadata here.\r\n\r\nIn Hive, unless we enable to provide an atomic getAndSet like in the mode provided by [HIVE-26882](https://issues.apache.org/jira/browse/HIVE-26882)/ https://github.com/apache/iceberg/pull/6570 , it is always possible that between commit + check, there is an intermediate commit that gets in, and the check would then fail with UNKNOWN status. (I believe it is still the correct be", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Nit: let's put this near newTableOps()", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "unnecessary change", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "nit: let's move it to the HiveOperationsBase, so it is nearer the validateTableIsIceberg? (It is used by HiveCatalog and HiveViewOperations, just like validateTableIsIceberg is used by HiveCatalog and HiveTableOperations , so may make more sense there)", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Seems it belongs in HiveViewOperations or somewhere not tied with tables? ", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Nit: call it tableTypeProp to reduce confusion with table.getTableType()?", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "what about keepHiveStats?", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "why do we need to do this? I feel we should keep existing logic as much as possible.", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "This is strange, the method is only used by listViews, what is the point of parameterizing? (Unless we are going to re-use some of the components, see following comments)\r\n", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "This may be a good idea, but I feel it's a bit early now, we dont even do this for regular tables.\r\n\r\nCan't we do this in a subsequent pr for both tables and views?", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Would be nice to make the listTables() method call this?", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Do we need this? We dont do it in renameTables.", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "We need to catch InvalidObjectException, like the case of tables.", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Could we have a common method for renameTable and renameView in this class? I feel a lot of code is exactly same, and we may reduce errors that way.", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "In general, my comments in this method are that, we should keep the existing logic as much as possible, as it is quite delicate and we don't want to re-introduce bugs that have been fixed over the many releases.", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "I dont see any code change by moving it actually, is there a change on your end? (as it moves just up the hierarchy and is still visible to sub classes)", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "From https://github.com/apache/iceberg/issues/9433, it looks like we need to delete the metadata file?", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": 270, "type": "inline"}, {"author": "szehon-ho", "body": "OK I am just not sure on whether this optimization makes sense, ie 100. Is it to avoid OOM on HMS side? (as anyway we have to materialize all the tables on our side) Hence initial suggestion to make a new pr for this. \r\n @pvary any thoughts on whether its worth it?", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "I would just pass in 'entity name' to be more readable.", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "We do not keep all of the tables in memory. We only keep the `TableIndentifier`s in memory which is much smaller amount of data than the whole Hive `Table` object.\r\n\r\nI have seen issues where there were many tables in a single database and it caused memory issues on HMS, and on the HMSClient side as well.\r\n", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "we shouldn't be breaking existing APIs. I understand that one might want to move the method to a different place for better usage across other places, but we need to do this in a way that doesn't break existing APIs.\r\n\r\nThis could be achieved e.g. by still keeping `BaseMetastoreCatalog::fullTableName` which would internally call `CatalogUtil.fullTableName`", "path": ".palantir/revapi.yml", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n public static void dropViewMetadata(FileIO io, ViewMetadata metadata) {\r\n```", "path": "core/src/main/java/org/apache/iceberg/CatalogUtil.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "given that this method is being publicly exposed, I think there should be tests that verify different inputs", "path": "core/src/main/java/org/apache/iceberg/CatalogUtil.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this doesn't have to change because we need to keep `BaseMetastoreCatalog.fullTableName` so that we don't break the API", "path": "core/src/main/java/org/apache/iceberg/MetadataTableUtils.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n this.fullTableName = fullTableName(catalogName, tableIdentifier);\r\n```", "path": "core/src/main/java/org/apache/iceberg/inmemory/InMemoryCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n \"Failed to load view metadata for view: {}\",\r\n```", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n LOG.info(\"Skipping drop, view does not exist: {}\", identifier, e);\r\n```", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I don't think this error message is accurate, especially if the namespace has more than 1 level", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I think there should be a single method `private void checkNamespaceIsValid(Namespace namespace)` that throws an error when it's an empty namespace or a namespace with more than one level. \r\n\r\nThen, all methods in the catalog can just call `checkNamespaceIsValid(Namespace namespace)`.\r\n\r\nCan you please open a separate PR to handle this? ", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I agree. I don't see any reason to parameterize this if it's only used for views", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "yes, this is correct to have this here and there are tests that verify this behavior. This also needs to be fixed for tables (probably in a separate PR)", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "`originalTo` looks weird. Why not name it `to`?", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n TableIdentifier from, TableIdentifier to, String entityType) {\r\n```", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "also rather than passing a String `entityType` I'd consider introducing an enum: `enum ContentType { TABLE, VIEW;}`", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "why is the error msg for table and view different? I think this should also mention `Cannot rename %s to %s. Table does not exist...`", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "why rename this to `fromEntity`? I think leaving it as `table` is fine", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "is this some new validation code? it doesn't look like this has been previously done when renaming a table?", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n protected ViewOperations newViewOps(TableIdentifier identifier) {\r\n```", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I'm not sure if we really want to do this just for Hive. I'm not really familiar with how Hive deals exactly with this internally. Is there maybe a better way to solve this for views?", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "why does this require `BaseMetadata` as a parameter? I'm not convinced that this is the right refactoring. For views this method just passes uuid + schema. ", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I'm not convinced using `BaseMetadata` here is the best approach just to pass properties or the metadata location(s)", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I don't think we can break this API / removing this class without deprecating it first. See https://iceberg.apache.org/contribute/#deprecation-notices", "path": ".palantir/revapi.yml", "line": null, "type": "inline"}, {"author": "nastra", "body": "I'm not convinced that introducing this abstraction makes stuff easier. In most cases the methods in Hive just pass 1-2 things like schema / properties around", "path": "core/src/main/java/org/apache/iceberg/BaseMetadata.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n Supplier<List<String>> metadataLocationsSupplier) {\r\n```", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "`entityName` is fairly abstract here, what does it refer to?", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "we're only effectively passing properties in the implementation, so why does `BaseMetadata` help here?", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this effectively only uses the properties, so no need to pass `metadata`?", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "also `hiveLockEnabled()` check table properties which are only ever set on a table but not on a view", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "why do we pass `BaseMetadata` here if this method is only applicable for tables? `TableProperties.ENGINE_HIVE_ENABLED` won't be set on a view", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "same as above", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "@nastra we had discussed this a bit, with @pvary as well , you can take a look at the original pr https://github.com/apache/iceberg/pull/8907/files without the common metadata. The main motivation is that HiveCatalog/ HiveTableOperations code is a bit hard to get right in terms of failure handling/locking, and cloning it is not that great. The common metadata is the minimum to allow that code to be shared.", "path": "core/src/main/java/org/apache/iceberg/BaseMetadata.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I understand that we want to reduce code duplication and it's a perfectly valid reason to think about having a base class, but https://github.com/apache/iceberg/pull/9852#discussion_r1516394903 is a bit concerning. What will be the behavior if e.g. `engine.hive.lock-enabled` or `engine.hive.enabled` deviates from its default value for a table but there's no option to control this on a view? Maybe I'm missing something obvious but it would be great to clarify how the final behavior in such cases ", "path": "core/src/main/java/org/apache/iceberg/BaseMetadata.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "OK good catch about 'engine.hive.enabled', its a problem on the other class (HiveOperationsBase).\r\n\r\n@nk1506 can we remove this code please ? For example, HiveOperationsBase defaut for hiveEngineEnabled() should be false , and only have HiveTableOperations should override this.", "path": "core/src/main/java/org/apache/iceberg/BaseMetadata.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "It probably should not matter, as we will never store data in views, but it is definitely confusing. cc @pvary to verify", "path": "core/src/main/java/org/apache/iceberg/BaseMetadata.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "For engine.hive.lock-enabled, I think it is actually the desired behavior to do this on Table and Views. It will invoke a more optimized hive locking scheme, again @pvary being the author here. Ive always thought the property name is confusing: https://github.com/apache/iceberg/pull/6570/files#r1143949780", "path": "core/src/main/java/org/apache/iceberg/BaseMetadata.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "The `engine.hive.lock.enabed` should be available for views. The `engine.hive.enabled` is used for setting the storage handler class, which is irrelevant for views", "path": "core/src/main/java/org/apache/iceberg/BaseMetadata.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I understand that for tables you have to configure more things and my comment is not about tables, but for views there's no reason to introduce this abstraction just to be able to pass the schema of the view. ", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "@nk1506 can we make a public enum in the BaseMetastoreEnum one and deprecate this one then?\r\n\r\nIts not perfect to have it public now but I dont think we can avoid it because Java cannot double-inherit from two base classes.", "path": ".palantir/revapi.yml", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Hm, it seems even in the lock mode, we get a lock around the time we commit and the time we checkCommitStatus(). So I wonder if the tables logic for checking the history (for multiple locations) is for the new mode (non-lock mode). @pvary do you know the context ? \r\n\r\nI think ViewMetadata already has this model (history() and ViewHistoryEntry()). It seems it should be set in general then. @nastra is there any reason we dont do this in other catalogs? I think this is a bit hacky way to do t", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "what do you think to pushdown the switch to a method in HiveOperationsBase called validateTable()? (pass in content). The switch statement there should choose the right string for the error message. ", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Why do we have to fetch the table again to throw an errro?", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "I guess we can explitily make this method take in everything that it needs to set, will that be better? (instead of BaseMetadata, pass in Schema, UUID, snapshot, etc). There is a non-trivial number of params though.", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Not sure I follow, what happens when we dont have this? Is it just the error message is wrong? I feel we can mostly re-use the existing exception (maybe can add the type).", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "There could be cases when we have a HMS commit failure, but the actual commit is successful, we just don't receive the answer. We try to check the current history for our commit. If we find it, then we say that the commit was successful independently of the previous HMS response", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "@nk1506 I understand that `engine.hive.lock.enabed` will be available for views. My point is that there's no reason to check `if (metadata.properties().get(TableProperties.HIVE_LOCK_ENABLED) != null)` on a view, because a view won't have `TableProperties.HIVE_LOCK_ENABLED` set in its properties", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "yes this is the wrong place to do this validation. You're already doing this validation in https://github.com/apache/iceberg/pull/9852/files#diff-5ecfc223b311b523a12be7482b6235318fdd7535c371f899daf7e09eacddad77R322", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n default void setSchema(Schema schema, Map<String, String> parameters) {\r\n```", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "HI, I checked with @RussellSpitzer who authored this originally for Hive. It's a real issue for Hive tables, because frequent commits lead to situations where we lose contact with HMS or HMS goes down but the commit goes through. Then it takes awhile to reconnect, and in the meantime there's other commits. It seems it was frequent for Hive tables, especially where we commit quite often (streaming for example).\r\n\r\nAs this mechanism is not in ViewMetadata, let's forget it then, given that we do", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I don't think we really need these methods", "path": "core/src/main/java/org/apache/iceberg/view/ViewMetadata.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "`TestHiveViewCommits` doesn't extend this class, so I don't think this is needed", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "Yes, we need these @nk1506 , the logic should be exactly the same, except for the comparison of metadata in checkCommitStatus. In HiveTableOperations we compare against all metadata paths in the history, in HiveViewOperations we compare only the expected one.", "path": "core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "@nk1506 @nastra Do you guys think its really necessary? Rather than just changing the message to 'table or view already exists'", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "imho, it is too much, I see now we have ViewAwareTableBuilder, TableAwareViewBuilder classes (not to mention extra call) just for the correct error message.", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "I may have missed the discussion, did we decide not to use common base class? Now we will have to review this code much more thoroughly in that case.", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java", "line": 118, "type": "inline"}, {"author": "szehon-ho", "body": "How do we know its a view ? Cant it just be 'table or view already exists'.", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java", "line": null, "type": "inline"}, {"author": "szehon-ho", "body": "In this case should be updateHiveView", "path": "hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java", "line": null, "type": "inline"}], "8909": [{"author": "ajantha-bhat", "body": "Nessie throws `AlreadyExistsException` for a view in this case. Do we need to wrap it up with `NoSuchTableException` for a table in this case to generalize like other catalogs? ", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "Currently Nessie's `IcebergView` model stores only one dialect and representation. I didn't change Nessie side to hold multiple representation because it was needed for _global state_. Now that _global state_ is removed in Nessie, only view metadata location is enough. It will have current version id and all the representation for that version. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "Similar to Table's `TestBranchVisibility`", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestBranchVisibilityForView.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "similar to `TestNessieTable.java`", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": 53, "type": "inline"}, {"author": "nastra", "body": "nit: should this say `content` maybe?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/UpdateableReference.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "minor: does this need to be public?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": 38, "type": "inline"}, {"author": "nastra", "body": "nit: line can be removed so that all final fields are grouped together", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "is this a message that we can typically rely on? what if the error msg changes?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "why does this have to be an `AtomicBoolean`?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I think it would be good to make this error msg clearer", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "any particular reason why `NessieUtil.removeCatalogName(..)` is only called on the `to` ref and not on the `from` ref?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n private List<TableIdentifier> listContent(Namespace namespace, Content.Type type) {\r\n```", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": 160, "type": "inline"}, {"author": "nastra", "body": "I think something like this would be better here:\r\n```\r\nString errorMsg = isView ? \"...<view error msg>..\" : \"...<table error msg>...\"\r\nthrow new AlreadyExistsException(errorMsg)\r\n```", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "same question: Is this an error msg that can be relied upon?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "rather than passing a boolean flag, maybe just pass the content type?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "it seems weird to only store one and not all of the representations", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "why does this need to be `AtomicBoolean`?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "The code here is mirroring the existing code for `renameTable`, where a call to `NessieUtil.removeCatalogName` exists since the beginning of the Nessie module:\r\n\r\nhttps://github.com/apache/iceberg/blob/87143d53c05de308332860be12a450a8c7afb95f/nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java#L167\r\n\r\nI don't see any test exercising this, so indeed it would be good to revisit this logic. That said, since it's not new to this PR, I'd be in favor of doing so in a follow-up PR.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "I think the plural is fine here: in English, \"content\" can be countable or uncountable, depending on the context. Here, it's countable, and so can appear in the plural.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": 160, "type": "inline"}, {"author": "adutra", "body": "Maybe this method could be merged with `table` since they are very similar:\r\n\r\n```java\r\n public <C extends IcebergContent> C asContent(\r\n TableIdentifier tableIdentifier, Class<? extends C> targetClass) {\r\n try {\r\n ContentKey key = NessieUtil.toKey(tableIdentifier);\r\n Content c = withReference(api.getContent().key(key)).get().get(key);\r\n return c != null ? c.unwrap(targetClass).orElse(null) : null;\r\n } catch (NessieNotFoundException e) {\r\n return null;\r\n }\r\n }", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "This catch block is not correct. We should never catch `NessieBadRequestException` in fact, as this denotes a misbehaving client.\r\n\r\nWhat is happening here is that `validateContentForRename` is not working as designed: see my next comment for a solution.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "This is incorrect. `isView` doesn't mean that `existingToContent` is necessarily a view. We have tests that exercise the case where we rename a view to a table. In such cases, `existingToContent` will be a table.\r\n\r\nHere is a correct version:\r\n\r\n```java\r\n IcebergContent existingToContent = asContent(to, IcebergContent.class);\r\n if (existingToContent != null) {\r\n Content.Type type = existingToContent.getType();\r\n if (type == Content.Type.ICEBERG_VIEW) {\r\n throw new AlreadyE", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "Here too, I think this catch block is a code smell, but this one is trickier to solve.\r\n\r\nThe issue is that in the test called `createTableViaTransactionThatAlreadyExistsAsView`, we create a transaction based on current hash (let's call it c0), then we commit something outside the transaction (let's call it c1), then we attempt to commit the transaction.\r\n\r\nAs a result, the transaction/client is trying to commit using the expected hash of c1, but it should really use c0 as the expected commit ha", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "Either Nessie can store all the representations or None I guess. But storing representations is not really required as global state is removed in Nessie, only view metadata location is enough. Lets see what others think. \r\n\r\ncc: @dimas-b, @snazy ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "I think it can be plural as mentioned here.\r\nhttps://github.com/apache/iceberg/pull/8909#discussion_r1375260224", "path": "nessie/src/main/java/org/apache/iceberg/nessie/UpdateableReference.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "Just like how `NessieTableOperations` is public, I am keeping it public thinking some engine integration needs this to be public if they have custom logic. \r\n\r\nNot just Nessie, `TableOperations` of all the catalogs are public\r\n", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": 38, "type": "inline"}, {"author": "ajantha-bhat", "body": "> NessieUtil.removeCatalogName exists since the beginning of the Nessie module:\r\n\r\nTrue. I checked the usage of `from`, it is used to prepare key for Nessie commit. So, it should not have catalog name. I will just update the code to unify it. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "good catch. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "In my understanding, commits will be retried on failure for `NessieConflictException` and it will call `doRefresh()` for retry which will again use the new reference?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "Because I need to set the failure from `NessieUtil.handleExceptionsForCommits` also. I cannot return a boolean from that method as it throws exception.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "same as above.\r\n\r\n> Because I need to set the failure from `NessieUtil.handleExceptionsForCommits` also. I cannot return a boolean from that method as it throws exception.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "I think the proper way to fix it is to check if any different kind of content exist for the key while committing this key in `doCommit()` of `NessieTableOperation` and `NessieViewOperation`. But that is an extra round trip with server and might degrade commit performance slightly? ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "1. Conflicts will _not_ be retried for `NessieReferenceConflictException` since we throw specialized errors, see here: https://github.com/apache/iceberg/blob/89b3a2111968f0aed79ec0ce1c87e2035e0d7f7c/nessie/src/main/java/org/apache/iceberg/nessie/NessieUtil.java#L188-L192\r\n2. My main point was: we should get rid of this `catch (NessieBadRequestException ex)` block as, in my opinion, it denotes a code smell.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "> I think the proper way to fix it is to check if any different kind of content exist for the key while committing this key in doCommit() of NessieTableOperation and NessieViewOperation. But that is an extra round trip with server and might degrade commit performance slightly?\r\n\r\nThis solution is viable but would imho be less efficient since, as you mentioned, it involves an extra roundtrip. \r\n\r\nI personally think that if `NessieTableOperation` could \"memorize\" the commit hash of that last time ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "That particular testcase is tricky because `createTransaction()` will call `refresh()` which already has code to check if any other content exist with the same name. But during `commitTransaction()` it will not call `refresh()` so that code is never hit and we need to check again if any new content created with same key concurrently. \r\n\r\nSo, I have added that logic now instead of catching `NessieBadRequestException` later on.\r\n\r\nI am not sure about `memorize` commit hash solution. Last time (som", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "Even Hive view support has same problem\r\nhttps://github.com/apache/iceberg/pull/8907", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "This API endpoint does not throw any errors when the content does not exist. It only throws if the reference does not exist. So the comment is not accurate. Also, it seems weird to ignore this error, even if the commit attempt below will probably throw the same error again.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "Checking for the content here imho defeats the purpose of calling `doRefresh` only when necessary. I think I still prefer to store the hash of the last refresh, then use it for commit. @dimas-b wdyt?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "This won't be sufficient. If the content exists, it could be anything, including another table, or a namespace, or something else. In which case the commit will fail with `NessieBadRequestException` again :-(", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "Nit: `schema` -> `SCHEMA`", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "Nit: `altered` -> `ALTERED`", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "Nit: the message says \"must not be equal\" but the assertion is `isEqualTo`. Also, there is no more global state in Nessie, so I'm not sure if this assertion is useful?", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "Nit: the test is actually updating an existing view.", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "Nit: change `renamedTableName` to `renamedViewName`.", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "Nit: why not simply `testRename` ?", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": 219, "type": "inline"}, {"author": "adutra", "body": "Nit: honestly the message is incomprehensible in English :-)\r\n\r\nHow about:\r\n\r\n```\r\nCannot rename view 'view' on reference 'Something' to 'renamed_view' on reference 'iceberg-view-test' : source and target references must be the same.\r\n```", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "Nit: maybe add after:\r\n\r\n```java\r\nAssertions.assertThat(catalog.dropView(VIEW_IDENTIFIER)).isFalse();\r\n```", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": 318, "type": "inline"}, {"author": "adutra", "body": "This test is identical to the previous one, apart from the fact that it uses a different `TableIdentifier` instance. Maybe remove?", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "Maybe add another view here, then call `listViews` again and test that we have 2 views now.", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": 331, "type": "inline"}, {"author": "adutra", "body": "```suggestion\r\n Assertions.assertThat(catalog.listViews(VIEW_IDENTIFIER.namespace()))\r\n .singleElement()\r\n .isEqualTo(VIEW_IDENTIFIER);\r\n```", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "There is no need to create this extra catalog.", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": null, "type": "inline"}, {"author": "adutra", "body": "This seems problematic to me. It would be better to engage with core devs and check whether the metadata is allowed to have extra properties. @nastra wdyt?\r\n\r\nAt first glance, the assertion seems indeed a bit restrictive:\r\n\r\n```java\r\n assertThat(((BaseView) renamed).operations().current())\r\n .usingRecursiveComparison()\r\n .ignoringFieldsOfTypes(Schema.class)\r\n .isEqualTo(original);\r\n```\r\n\r\nMaybe change the semantics to \"contains all the original fields but may also contain", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieViewCatalog.java", "line": 151, "type": "inline"}, {"author": "ajantha-bhat", "body": "So far, REST and inmemory catalog follows one pattern (NoSuchTableException) and all other catalogs follows one pattern (AlreadyExistsException).\r\n\r\nI tried making Nessie to follow as REST catalog, but it breaks other testcases. I will check more. ", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "was copied from `TestNessieTable`. I will fix in this testcase. Refactoring `TestNessieTable` can be done in a follow up to keep the PR scope small. ", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "This test uses `ImmutableTableReference` for identifier. Simple rename is tested from `TestNessieViewCatalog`", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": 219, "type": "inline"}, {"author": "ajantha-bhat", "body": "removed `AtomicBoolean` as it can be simplified. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "removed `AtomicBoolean` as it can be simplified. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "removed `AtomicBoolean` as it can be simplified. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieUtil.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "\r\n`replaceTableViaTransactionThatAlreadyExistsAsView`\r\n\t`NessieTableOperations.doRefresh() `--> throws `AlreadyExistsException`. But expecting `NoSuchViewException`.\r\n\r\nIf I fix it, another test case (createOrReplaceTableViaTransactionThatAlreadyExistsAsView) fails.\r\nBecause from the same place `doRefresh()` we are expecting two different kind of exceptions. I think test case need to be modified instead of unifying code.", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "True. I mentioned this above.\r\n\r\n> Either Nessie can store all the representations or None I guess\r\n\r\nI think we don't need to store any representation if we don't need those for metadata in Nessie project. So, I am waiting for @snazy's reply. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I actually think that the way Nessie currently handles this error case is correct by saying that a view with the same name already exists when you try to create a table. To achieve the same for REST is actually slightly more difficult. There's also a TODO in the test a few lines above where I wanted to improve the error reporting for these 2 particular cases that were adjusted in this test. That being said, I'm +1 on these 2 changes in the test", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n throw new UnsupportedOperationException(\"Cannot rename content type: \" + type);\r\n```", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "or maybe `Cannot perform rename for content type...`", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "why not mention the actual identifier in the message here?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "nit: maybe rename to `contentType` instead of using the `String` suffix, since you're already using the same var name in `dropContent()`", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n private String buildCommitMsg(ViewMetadata base, ViewMetadata metadata, String viewName) {\r\n```", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "mainly because the `view` part is already implicit through the method parameters", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "rather than having two cases for a create/replace, you could just use the `operation` from above: `return String.format(\"Iceberg view %sd with name %s\", operation, viewName);`", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I think this is the last major point that needs to be adressed before this PR can go in", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n \"Cannot commit: Reference hash is out of date. Update the reference '%s' and try again\",\r\n```", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieUtil.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "does this need to be moved to `NessieUtil` so that it can be used from Trino? (similar to https://github.com/apache/iceberg/pull/7893)", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "unnecessary newline", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n static String viewMetadataLocation(NessieCatalog catalog, TableIdentifier identifier) {\r\n```", "path": "nessie/src/test/java/org/apache/iceberg/nessie/BaseTestIceberg.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "could also be done in a single line: `return ((BaseView) catalog.loadView(tableIdentifier)).operations().current().metadataFileLocation();`", "path": "nessie/src/test/java/org/apache/iceberg/nessie/BaseTestIceberg.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n // Updating view with out-of-date hash. We expect this to succeed because of retry despite the\r\n```", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestBranchVisibilityForView.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n \"Cannot rename ICEBERG_VIEW 'view' on reference 'Something' to 'rename_view_name' on reference 'iceberg-view-test': source and target references must be the same.\");\r\n```", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "there is an unnecessary space. The same is for the error message for `ICEBERG_TABLE`", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I think that would be ok to change the semantics to `contains all the original fields but may also contain some extra ones (in case a catalog impl decides to add new properties)`", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieViewCatalog.java", "line": 151, "type": "inline"}, {"author": "nastra", "body": "this should be addressed by https://github.com/apache/iceberg/pull/9012/commits/c63be6534a0cfc595905056309b469c6561d414b", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "@nastra: Awesome. Thanks for fixing it for REST and in-memory catalog. That should help in having a unified behaviour. ", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "No need to change this as discussed in https://github.com/apache/iceberg/pull/8909#discussion_r1387607715", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "Not just that. The original property might modified during rename for Nessie. \r\nLike \"nessie.commit.id\" is different for original and renamed one. \r\n\r\nShall I ignore the properties aomparision as a whole?", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieViewCatalog.java", "line": 151, "type": "inline"}, {"author": "nastra", "body": "I would suggest to do such bigger refactorings as part of a separate PR, otherwise it will be hard to review", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": 38, "type": "inline"}, {"author": "ajantha-bhat", "body": "> I don't understand how this related to \"global state\" - those things are not related to each other at all.\r\n\r\n`IcebergTable` model used to only store table metadata location. But then it started storing other info like schema id, snapshot id etc to support global state. The `IcebergView` model developed was also based on the design of `IcebergTable` (which was based on global state). Else just View metadata location was enough. Thats how I thought it is designed. \r\n\r\n> Nessie should really no", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "Release not yet done (only merged to master)", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieViewCatalog.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "These are used in the test code (`TestNessieView`, `TestNessieTable`)\r\n\r\nIt was moved from `TestNessieTable`. So, it can also be used for Views. ", "path": "nessie/src/test/java/org/apache/iceberg/nessie/BaseTestIceberg.java", "line": 308, "type": "inline"}, {"author": "ajantha-bhat", "body": "The critical code is already extracted to `NessieIcebergClient`. \r\n\r\n`NessieTableOperations` and `NessieViewOperations` has only two methods `doRefresh` and `doCommit` which operates on different objects, one is `TableMetadata` and one is `ViewMetadata`. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": 38, "type": "inline"}, {"author": "ajantha-bhat", "body": "While committing the Table, we found that view existing with the same key. Hence we are failing the commit. \r\n\r\nOther catalogs behaves like this. Common testcase in `ViewCatalogTests` will fail If I remove this check. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "yes this is correct behavior here IMO", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "Aah, `SqlText` is not nullable. So, I cannot construct without it. I will raise a PR at Nessie side to make it Nullable.\r\n\r\nhttps://github.com/projectnessie/nessie/blob/fff9154da6a19cf289c09f2996b75f3cd15e0ca0/api/model/src/main/java/org/projectnessie/model/IcebergView.java#L56", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "> This test class duplicates the test cases - most are not specific to views.\r\n\r\nI will remove `TestBranchVisibilityForView`", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestBranchVisibilityForView.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "This particular testcase `createOrReplaceViewThatAlreadyExistsAsTable` from `ViewCatalogTests` expects this. (similar to other catalogs)\r\n\r\nBelow ISE, I can change to `AlreadyExistsException`", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "`createTableViaTransactionThatAlreadyExistsAsView` from `ViewCatalogTests` will fail without this check. It throws `NessieBadRequestException` without this check. \r\n\r\ndoRefresh() won't be called when Iceberg calls `commitTransaction` from above testcase. Hence, it needs this check. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "I checked again. `NessieTableOperations` and `NessieViewOperations` is based on two different classes `BaseMetastoreTableOperations` and `BaseViewOperations`. I cannot make a base class extending these two. \r\n\r\nAlso, having a common interface is not much useful here as both view and table cannot implement the same interface. \r\n\r\nLastly checked about a static util class for common functionalities. The code looks similar but it is lot different as one use view and one use table (So one uses table ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": 38, "type": "inline"}, {"author": "ajantha-bhat", "body": "added a dummy sqlText and dummy dialect as we have concluded that we don't have to modify at Nessie side. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "extracted the common code to `commitContent`", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this is being addressed by https://github.com/apache/iceberg/pull/9012", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "this is being addressed by https://github.com/apache/iceberg/pull/9012", "path": "core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "This is similar to `updateTableMetadataWithNessieSpecificProperties` present in this file. Trino will also use this method. Trino will not use `NessieViewOperations` as it has to have its own class. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieUtil.java", "line": 207, "type": "inline"}, {"author": "ajantha-bhat", "body": "updated throw an exception that content type is not matching.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "When Trino supported Nessie catalog, we moved `loadTableMetadata` from `NessieTableOperations` to `NessieUtil` to avoid code duplication in Trino as has its own `NessieTableOperations` impl.\r\n\r\nSimilarly when view is supported for Trino, we need this function in Trino. So, keeping here can help in the same purpose. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieUtil.java", "line": 207, "type": "inline"}, {"author": "ajantha-bhat", "body": "Above I got a comment from Alex that we should not catch `NessieBadRequestException` in the client code. Hence, I was doing it for all the cases. \r\n\r\nNow, I am catching `NessieBadRequestException` and checking the content existence only during this error case.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "Above I got a comment from Alex that we should not catch `NessieBadRequestException` in the client code. Hence, I was doing it for all the cases. \r\n\r\nNow, I am catching `NessieBadRequestException` and checking the content existence only during this error case.", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java", "line": null, "type": "inline"}, {"author": "ajantha-bhat", "body": "two classes cant implement the same interface. But in this case, I am not finding what common function they need to implement because `doRefresh()` and `doCommit()` work on different kind of objects from both the classes and two classes overrides `doRefresh()` and `doCommit()` from different base classes. One is `BaseMetastoreTableOperations` and one is `BaseViewOperations`. \r\n\r\nSo, I am not clear on how to do this. \r\nFeel free to add a commit on top of this PR. ", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java", "line": 38, "type": "inline"}, {"author": "nastra", "body": "nit: should this maybe mention the content type here?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "it seems a bit weird that there are `commitTable()` / `commitView()` methods, but all the other view/table-related methods are unified to `xyzContent()`. Wouldn't it be better to just leave `listTables()` (and others) and introduce `listViews()` (and similar)?", "path": "nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "nit: can this be moved to the line above?", "path": "nessie/src/test/java/org/apache/iceberg/nessie/BaseTestIceberg.java", "line": 319, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n Schema schema = new Schema(required(2, \"age\", Types.IntegerType.get())));\r\n```", "path": "nessie/src/test/java/org/apache/iceberg/nessie/BaseTestIceberg.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n Schema schema = new Schema(required(1, \"id\", LongType.get()));\r\n```", "path": "nessie/src/test/java/org/apache/iceberg/nessie/BaseTestIceberg.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "```suggestion\r\n @TempDir private Path temp;\r\n```", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieViewCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "@ajantha-bhat can you please follow-up on this in a separate PR?", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieViewCatalog.java", "line": 151, "type": "inline"}, {"author": "nastra", "body": "do these need a `@Test` annotation to be executed?", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieViewCatalog.java", "line": 191, "type": "inline"}, {"author": "nastra", "body": "the current version will always be non-null, so you might rather want to test that the version id is what you're expecting", "path": "nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java", "line": null, "type": "inline"}], "4588": [{"author": "wypoon", "body": "This class is not strictly necessary. I could simply use an AtomicLong in its place. (I do not think, however, that the counter needs to be thread-safe, as each task will have its own counter.)", "path": "core/src/main/java/org/apache/iceberg/deletes/DeleteCounter.java", "line": 22, "type": "inline"}, {"author": "wypoon", "body": "I do not know why the `ConstantColumnVector` is constructed with a hardcoded integer type. I observe that this code here can get called with `ConstantVectorHolder`s containing a constant of other primitive types. If `copy()` is called on an `InternalRow` in the `ColumnarBatch` containing this `ColumnVector`, then we get a `ClassCastException`. In the situations where I call `copy()`, the `ConstantColumnVector` contains either a boolean or an integer, so I only fix it for those possibilities.", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorWithFilter.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "@flyrain this code was added by you; can you please explain?", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorWithFilter.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "I copied some code from class `IcebergArrowColumnVector`, and didn't change its construction logic. But I agreed with you that we shouldn't hard-code to int type.", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorWithFilter.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I'm not sure these trace messages will be very valuable, do you think we need them for the metric code or where they just temporary for testing?", "path": "core/src/main/java/org/apache/iceberg/deletes/Deletes.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "nit: Start with a capital letter", "path": "data/src/main/java/org/apache/iceberg/data/DeleteFilter.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This probably can be reverted?", "path": "data/src/main/java/org/apache/iceberg/data/DeleteFilter.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Probably remove these messages?", "path": "data/src/main/java/org/apache/iceberg/data/DeleteFilter.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Should start with a Capital", "path": "data/src/main/java/org/apache/iceberg/data/DeleteFilter.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Start with a capital ", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Capital? But maybe remove this one, not sure it's adding a lot of value since you could just add the line in the `deleteFilter` code", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I would probably just break up the ternary operator below and log when creating the Delete filter?", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Capital here, but i'm not sure this is a helpful debug statement either? Probably add a line in the DeleteFilter Constructor instead?", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Not sure how these changes are related to the PR", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Probably separate this into another PR", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkCopyOnWriteScan.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "should be number of splits now?", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "How about scanTask.files().size?", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Capital first", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/metrics/NumDeletes.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Should probably extend CustomSumMetric\r\n\r\nhttps://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/metric/CustomSumMetric.html", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/metrics/NumDeletes.java", "line": 24, "type": "inline"}, {"author": "RussellSpitzer", "body": "Also should be CustomSumMetric\r\nhttps://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/metric/CustomSumMetric.html", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/metrics/NumSplits.java", "line": 24, "type": "inline"}, {"author": "RussellSpitzer", "body": "nit: final", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/metrics/TaskNumDeletes.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This feels like we are subverting the intent of this test which is to use the DF reader method to actually read the table. I think we'll need another approach for counting deletes or possibly just explicitly state how many deletes there are in tests that need to check it. At least until we have a method of reading a table with a user readable marker for deleted rows like `_isDeleted`", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "I could replace \r\n```\r\n CloseableIterable<CombinedScanTask> tasks = TableScanUtil.planTasks(\r\n table.newScan().planFiles(),\r\n TableProperties.METADATA_SPLIT_SIZE_DEFAULT,\r\n TableProperties.SPLIT_LOOKBACK_DEFAULT,\r\n TableProperties.SPLIT_OPEN_FILE_COST_DEFAULT);\r\n```\r\nwith\r\n```\r\n SparkScanBuilder scanBuilder = new SparkScanBuilder(spark, table, CaseInsensitiveStringMap.empty());\r\n scanBuilder.pruneColumns(sparkSchema);\r\n SparkScan scan = (SparkScan) scanBui", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "I use a lowercase \"number\" to be consistent with other (standard) metrics appearing in the Spark UI (e.g., \"number of output rows\"). It would look odd otherwise.", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/metrics/NumDeletes.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "I am aware of `CustomSumMetric` but there is no benefit in extending it instead of `CustomMetric`, since I would still override its `aggregateTaskMetrics` method as that returns the `String` representation of the long instead of a locale-formatted string as I do (and as other longs are shown in the Spark UI). So if there are 1,000 deletes, it should show as \"1,000\" and not \"1000\".", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/metrics/NumDeletes.java", "line": 24, "type": "inline"}, {"author": "RussellSpitzer", "body": "That is the path it takes at this moment, but that's the issue here, we want this test check the output of\r\n``` \r\nDataset<Row> df = spark.read()\r\n .format(\"iceberg\")\r\n .load(TableIdentifier.of(\"default\", name).toString())\r\n .selectExpr(columns);\r\n```\r\n\r\nHaving this copy the implementation details decouples the test from what actually may be happening we change the code in the future", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "So we aren't suing CustomSumMetric because it's string output is different than we want? I think that's ok but seems like we should be changing CustomSumMetric in Spark instead of having our own summing class that does the same thing but outputs the number differently.", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/metrics/NumDeletes.java", "line": 24, "type": "inline"}, {"author": "wypoon", "body": "Here is another option:\r\nI'll leave `TestSparkReaderDeletes` alone, and instead create a new class that also extends `DeleteReadTests` (or even `TestSparkReaderDeletes`) and override `rowSet` there. It is perhaps also not necessary to run all the tests in `DeleteReadTests` or `TestSparkReaderDeletes` in this new class. I had also increased the parameter space of `TestSparkReaderDeletes` in order to cover all 3 file formats, but I can dial it back and distribute the increased coverage to the new ", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "I can open a Spark PR to change `CustomSumMetric`, but for now I'd just extend `CustomMetric`.\r\nThe thing is, Spark instantiates your `CustomMetric` class and using reflection, calls its `aggregateTaskMetrics` method to get the value to display in the UI. However, even if you extend `CustomSumMetric` and don't override `aggregateTaskMetrics`, Spark still needs to instantiate your `CustomMetric` class, so you can't get away from putting your `CustomMetric` class in the classpath of the Spark Hist", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/metrics/NumDeletes.java", "line": 24, "type": "inline"}, {"author": "kbendick", "body": "Given that we have defined the `DeleteCounter` interface ourselves, does it make sense to define a default `NullDeleteCounter` that is a no-op?\r\n\r\nThere are a number of places where we check `if (counter != null)`.", "path": "core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Also, is there any concern that this makes calls to `isDeleted` not idempotent?", "path": "core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "There are existing tests that have no need for counting deletes, that call some static methods in `Deletes`. I kept those forms of the methods and had them call new forms that take a `DeleteCounter`, passing a null `DeleteCounter`. That is the reason why a `DeleteCounter` field might be null. The way the code is now, if we decide to do without `DeleteCounter` and simply use an `AtomicLong`, then it can be updated very easily. I do understand your suggestion for a `NullDeleteCounter`.\r\n\r\nYou rais", "path": "core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "I have removed these trace messages.", "path": "core/src/main/java/org/apache/iceberg/deletes/Deletes.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "Added a note to the javadoc for `PositionDeleteIndex#isDeleted`.", "path": "core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "that's fine then, I forgot Spark did that", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/metrics/NumDeletes.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Should there be a \"parquet\", false here as well?", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeleteCount.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Instead of doing this would it be possible for us to use a custom SparkListener to just grab the status at the end of the job and check those values? I think it is a safer approach than reimplementing the read path here.\r\n\r\nWe could also just modify the SQLAppStatusListener? or something like that", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeleteCount.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "I could add that if you think it is beneficial. I was trying to reduce overlap with `TestSparkReaderDeletes`. For testing the delete count, I think the parameters I have added cover all 3 file formats as well as both vectorized and non-vectorized, so I think the existing code paths are covered (to my knowledge). At present, I don't think adding `{\"parquet\", false}` exercises any additional code path for the delete tracking.", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeleteCount.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Shouldn't we be able to determine the number of deleted rows here by checking the number of entries in the bitmap? So shouldn't we be able to just count in the \"delete methods\"? Also do we have to pass through the counter here? Seems like we could just get this value out of the index when we are done building it instead of passing through the counter into the index?", "path": "core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "Let me look into that approach.", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeleteCount.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "You are right that one can get the number of deleted positions in the `PositionDeleteIndex` by `roaring64Bitmap.getLongCardinality()`.\r\nHowever, having one counter keeps the bookkeeping simple and easy to reason about.\r\nThere are multiple paths through the code that can be taken, where deletes may be applied, and the unifying thing here is that the same counter is passed into all these places.\r\nIt is more difficult to reason through the counting if I have to account for the `PositionDeleteIndex`", "path": "core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I think having \"isDeleted\" have a side effect ruins the book keeping simplicity for me. I'd much rather if we were to do this counter approach we would the \"delete\" methods rather than the \"isDeleted\" method. The delete method already has a side effect and we would expect it to be the place where the count would happen as opposed to \"isDeleted\" which we have to add a JavaDoc note to for illustrating the side effect behavior. \r\n\r\nI still am not really sold on passing through the counter here, it ", "path": "core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "Since it has been almost 4 weeks since I touched this (it seems that's how long the last update has gone unreviewed), I had forgotten the details of what I had investigated and had done. I went through the code paths again, and I see now that there is just one place where `PositionDeleteIndex#delete()` is called and where I need to track the delete count in the `PositionDeleteIndex`. So your suggestion is very feasible and not messy as I feared!\r\nThank you for the suggestion. I'll update the PR ", "path": "core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "The default is true, which means that when the `SQLAppStatusListener` receives a `SparkListenerSQLExecutionEnd` event, it asks its `ElementTrackingStore` to asynchronously aggregate the metrics. This async call sometimes results in the metric we need not being available. To make the tests more predictable, ensure that a synchronous call to aggregate metrics is used.", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "Wait for the `SparkListenerSQLExecutionEnd` to be received. If our `SparkListener` has received the event, the `SQLAppStatusListener` should have received it too and called its `ElementTrackingStore` to aggregate the metrics.", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "@kbendick I considered using an `Optional<DeleteCounter>` instead of a `DeleteCounter` that could be null, but it's six of one and half a dozen of the other. In the end, I adopted your suggestion to have a no-op `DeleteCounter`.", "path": "core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "This is no longer applicable as I have reverted splitting out testing delete count into a separate test class.", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeleteCount.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "This is a code path that was not previously exercised by the unit tests.\r\nWe add a configuration that allows us to control the `streamFilterThreshold` and cause this code path to be taken.", "path": "data/src/main/java/org/apache/iceberg/data/DeleteFilter.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Can we extract all of these set/streamFilter changes out into another PR?", "path": "data/src/main/java/org/apache/iceberg/data/DeleteFilter.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "I will revert the last commit from this PR. If this PR can be merged without that, I'll open a separate PR once it is merged, consisting of that change.", "path": "data/src/main/java/org/apache/iceberg/data/DeleteFilter.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "we can the skip the cast here if we just keep the type on 102", "path": "core/src/main/java/org/apache/iceberg/deletes/Deletes.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Instead of passing through the counter here couldn't we just do\r\ncounter.increment(deleteRowPositions.deleteCount)?", "path": "data/src/main/java/org/apache/iceberg/data/DeleteFilter.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I think this may be ok to include in the interface itself, I can't think of an implementation which couldn't track this?", "path": "core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Not sure why we need this class, Is there a case when we really don't want to count? I would imagine perf wise the difference between counting and not counting is pretty small?", "path": "core/src/main/java/org/apache/iceberg/deletes/NonCounter.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Dow we need these polymorphisms? Or is this just to keep apis? ", "path": "core/src/main/java/org/apache/iceberg/deletes/Deletes.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Do we need to pass through the counter here? Seems like this should just be an internal member of delete filter that is created with the delete filter? Or is this just because we have no other way of getting the counter information back out of the applied delete filter?", "path": "data/src/main/java/org/apache/iceberg/data/DeleteFilter.java", "line": 84, "type": "inline"}, {"author": "RussellSpitzer", "body": "Is this just for legacy compatibility? Do we have any users of this here? as a protected method I would think we would be free to drop it if we wanted to.", "path": "data/src/main/java/org/apache/iceberg/data/DeleteFilter.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Is this just a refactoring?", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Instead of the apis here, could we just modify \"extractDeleteCount()\" \r\n\r\nto instead be something like \"\r\n```java \r\nlong lastMetricCount(String metricName) {\r\n// Get executionID of last run thing\r\n// Find metricName\r\n// return value for MetricName\r\n}\r\n```\r\n\r\nThis way we can avoid some state in the job?", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java", "line": 167, "type": "inline"}, {"author": "RussellSpitzer", "body": "Oh this is great, Looking into the apis you dug up I think there may be an even easier way to do this. Sorry I didn't realize you could do this before\r\n\r\nLooking into sharedState you can do something like\r\n```scala\r\n// Some operation\r\nscala> spark.range(1, 100).withColumn(\"x\", col(\"id\")).withColumnRenamed(\"id\", \"y\").writeTo(\"local.default.test\").append\r\n\r\n// UI Metrics\r\nscala> spark.sharedState.statusStore.executionsList.last.metrics\r\nres41: Seq[org.apache.spark.sql.execution.ui.SQLPlanMetric] =", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Also can you please add this to the Num Splits PR so we can have a few tests in there as well? This will be really useful for all of our future metric additions", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Left some more general structural comments. I think Ideally I want to push that state up as high as we can in the stack. Anything which doesn't actually need to receive a counter in it's constructor should avoid it imho.\r\n\r\nI think we would also really benefit from @flyrain doing a full review on this as well especially now that we have the \"markDelete\" pathway as well. I assume for that we probably will just skip counting deletes since we don't really care.", "path": null, "line": null, "type": "review_body"}, {"author": "wypoon", "body": "@kbendick suggested a no-op version of the `DeleteCounter` instead of using null. There are legacy uses where counting is not needed, but I tend to agree that it doesn't hurt to use a single `DeleteCounter` throughout and that `NonCounter` is probably not needed.", "path": "core/src/main/java/org/apache/iceberg/deletes/NonCounter.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "Ok, I can put `numberOfPositionsDeleted()` in the `PositionDeleteIndex` interface.", "path": "core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "If we add `numberOfPositionsDeleted()` in the `PositionDeleteIndex` interface, then we don't need the cast either.", "path": "core/src/main/java/org/apache/iceberg/deletes/Deletes.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "Yes, there are there to support existing API calls.", "path": "core/src/main/java/org/apache/iceberg/deletes/Deletes.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "Yes, we need to pass the `DeleteCounter` in. The `DeleteCounter` is created in either `RowDataReader` or `BatchDataReader` and in their `open(FileScanTask)` method, an instance of their `SparkDeleteFilter` static nested class is created for that `FileScanTask` with the `DeleteCounter`. We need a single counter in either `RowDataReader` or `BatchDataReader` which is used to aggregate the delete count over all the `FileScanTask`s.", "path": "data/src/main/java/org/apache/iceberg/data/DeleteFilter.java", "line": 84, "type": "inline"}, {"author": "wypoon", "body": "This constructor without the `DeleteCounter` is called by `GenericDeleteFilter` and `FlinkDeleteFilter`.", "path": "data/src/main/java/org/apache/iceberg/data/DeleteFilter.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "Actually, this eventually calls `Deletes.toPositionIndex(CloseableIterable<Long>, DeleteCounter)` where we do\r\n```\r\n PositionDeleteIndex positionDeleteIndex = new BitmapPositionDeleteIndex();\r\n deletes.forEach(positionDeleteIndex::delete);\r\n counter.increment(((BitmapPositionDeleteIndex) positionDeleteIndex).numberOfPositionsDeleted());\r\n return positionDeleteIndex;\r\n```\r\nThat method is called by other callers besides this. I think it is best to do the `counter.increment(posi", "path": "data/src/main/java/org/apache/iceberg/data/DeleteFilter.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "Yes and no. There doesn't seem to be a natural way to get the `DeleteCounter` to `ColumnarBatchReader`. In this method, the `rowIdMapping` int array is updated by applying the equality deletes in the `DeleteFilter` `deletes`. As each row is tested, if it is deleted, we need to increment the `DeleteCounter`. The `DeleteFilter` has the `DeleteCounter`, so I moved the logic completely over to `DeleteFilter`. The alternative is to expose `DeleteFilter`'s `DeleteCounter` so that it can be used here, ", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "Hmm, turns out there is a `TestSparkParquetReadMetadataColumns.CustomizedPositionDeleteIndex` that implements `PositionDeleteIndex`. I'll leave the interface alone then, since I don't want to change unrelated tests.", "path": "core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "I removed the `deleteCount` field and the `setDeleteCount(long)` method, and the `deleteCount()` method now simply calls a `lastExecutedMetricValue(...)` method along the lines you suggest.", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java", "line": 167, "type": "inline"}, {"author": "flyrain", "body": "It makes sense to put it in the interface. The change of `CustomizedPositionDeleteIndex` will be minor. \r\nMinor suggestion: `numberOfPositionsDeleted` -> `numberOfPositionDeletes`", "path": "core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Why do we need a debug log here? ", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "How about extending from CustomSumMetric? So that you don't have to reimplement this method.", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/metrics/NumDeletes.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Same here, to extend from CustomSumMetric?", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/metrics/NumSplits.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "How about moving `numSplits` to class `BaseDataReader`? This logic can be shared by all subclass of BaseDataReader in that case.", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "I'd also suggest to move this to `BaseDataReader`.", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "I'd prefer not to move this logic to `DeleteFilter` since it is more relevant here. We may expose DeleteFilter's DeleteCounter as you suggested.", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "This may have perf impact since it is per-row basis. I don't think it is big though, but would recommend to use jmh to micro benchmark it.", "path": "core/src/main/java/org/apache/iceberg/deletes/Deletes.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "+1 for the new class.", "path": "core/src/main/java/org/apache/iceberg/deletes/DeleteCounter.java", "line": 22, "type": "inline"}, {"author": "flyrain", "body": "The same perf concern here.", "path": "core/src/main/java/org/apache/iceberg/deletes/Deletes.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Can we make it formal method Java doc? Or we may not need this comment.", "path": "core/src/main/java/org/apache/iceberg/deletes/Deletes.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "This is not the right way. positionDeleteIndex has all pos deletes for the data file, but not necessary all deletes are used in a read. We need to do the same thing as batch eq delete(line 223 in DeleteFilter.java)", "path": "core/src/main/java/org/apache/iceberg/deletes/Deletes.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Looking a bit more, we don't need this due to https://github.com/apache/iceberg/pull/4588/files#r905365603.", "path": "core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "@wypoon thanks for working on this. I believe this is pretty nice to have. Left some comments. Sorry for the delay. ", "path": null, "line": null, "type": "review_body"}, {"author": "wypoon", "body": "I already explained this in response to the same suggestion by @RussellSpitzer. `CustomeSumMetric#aggregateTaskMetrics` does not format the String. We want the metric value to appear the same way as other built-in metrics. E.g., 1000 should appear as \"1,000\" in the US locale, not \"1000\".", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/metrics/NumDeletes.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "See above.", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/metrics/NumSplits.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "I ran `IcebergSourceParquetPosDeleteBenchmark` before and after this change, and the numbers for `readIceberg` show hardly any difference.", "path": "core/src/main/java/org/apache/iceberg/deletes/Deletes.java", "line": null, "type": "inline"}, {"author": "wypoon", "body": "Update: The `DeleteCounter` is now created in `BaseReader` and passed to the instance of `BaseReader.SparkDeleteFilter` constructed in the `open(FileScanTask)` method `RowDataReader`/`BatchDataReader`. The argument still stands: we need a single counter in the reader which is used to aggregate the delete count over all the `FileScanTask`s.", "path": "data/src/main/java/org/apache/iceberg/data/DeleteFilter.java", "line": 84, "type": "inline"}, {"author": "wypoon", "body": "Thanks for pointing this out. I have fixed this.", "path": "core/src/main/java/org/apache/iceberg/deletes/Deletes.java", "line": null, "type": "inline"}], "3231": [{"author": "jackye1995", "body": "looks like we should make these variables in a shared util class, instead of importing this from the output stream to the input stream", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmOutputStream.java", "line": null, "type": "inline"}, {"author": "jackye1995", "body": "nit: newline after control statement", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Can we make it final?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "It is not used anywhere. Do we need it?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Do we need it to be public since the class itself is public? Or we can have a builder.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Nit: make it final.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Oh, there is a TODO item referring it in the method read, never mind.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "These fields can be final.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmOutputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Thanks @ggershinsky for the patch. Can we add a unit test for these two classes? ", "path": null, "line": null, "type": "review_body"}, {"author": "flyrain", "body": "Should we set the length if field plaintextLength is -1? Looks like this method is valid only after `newStream()` is called.\r\n```\r\nif(plaintextLength == -1) {\r\nthis.newStream();\r\n}\r\n```", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputFile.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Minor suggestion: Would it be a bit easier to understand if message is \"Should read `toLoad` data, but only get `loaded` data\"?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "The massage could be\r\n```\r\n\"The seeking position \" + newPos + \" has reached or exceeded the max stream size \" + plainStreamSize.\r\n```", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Looks like we don't have to implement it, the abstract class `InputStream` has the same logic.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Can we add message like, \"Failed to create file: %s\", targetFile.location()?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmOutputFile.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Can we add message like, \"Failed to create or overwrite file: %s\", targetFile.location()?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmOutputFile.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Make it a final static field?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmOutputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "It can be a final field.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmOutputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "How about a message like this?\r\n```\r\n\"Wrong length of encrypted output: expected \" + (positionInBuffer + GCM_TAG_LENGTH)) + \" but \" + encrypted;\r\n```", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmOutputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "This can go to a common place as well since both classes need it. We may have a util class, e.g. `GcmUtil`", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmOutputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Add message for easier debugging?\r\n```\r\nnew IOException(\"Failed to get an instance of GCM cipher\", e)\r\n```", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "They can be in the same line.\r\nIIUC, this means the stream length is not long enough. Can we say something like \"The stream length should be at least \" + AesGcmOutputStream.PREFIX_LENGTH?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Both plainStreamSize and fileAadPrefix could be final.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "Can we rename it to lastCipherBlockSize to make it more readable?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "flyrain", "body": "final as well.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "I like this, could I ask for just 4 more test cases\r\n\r\nEmpty File (no bits)\r\nFile that aligns perfectly with encryption Chunk Size\r\nFile that is exactly one byte to larger than the aligned and one that is one byte smaller than the aligned file. (we probably hit this unaligned version with the testFileSize below but just to make sure)", "path": "core/src/test/java/org/apache/iceberg/encryption/TestGcmStreams.java", "line": 243, "type": "inline"}, {"author": "RussellSpitzer", "body": "When do we close the stream in this case? Or do we just make and close it in that case?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputFile.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Aren't we also in danger of the read being less than prefix bytes?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Instead of \"file with wrong magic string\" -> \"Cannot read unencrypted file, missing header containing: ...\"?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Constant for this?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Can we make the math here a little more explicit, all of the -1 and such make things a little hard for me to read. We discussed this a bit before offline, I mainly just don't want to have to think about why we have a +1 here or why we have a -1 somewhere else etc ...\r\n\r\nI think we were discussing a something like \"number of full blocks\", \"remainder\" etc ...", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Maybe i'm over thinking this, but shouldn't plainStreamSize and position both be ints? Just wondering why we need the safety check here", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Precondition?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This is part of the code I think is kind of complicated because of the special casing of the last block logic. Let's try to make this as clear as possible. \r\n\r\nDo we actually need to keep these numbers here? I was thinking about this again, could we instead do something like\r\n```java\r\nMath.min(cipherBlockSize, remaining)\r\n```", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "i think we could also probably call this \"cypherBytesToLoad\"? ", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Not sure I understand this? remainingInBlock is unencrypted bytes and remaining is encrypted right?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "Could we change skip the break and just have it be \r\n```java\r\nif (finishTheBlock && !endOfStream)\r\n```", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "This should also be remaining == 0 right?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "RussellSpitzer", "body": "@jackye1995 + @rdblue Could you please take a look as well? Since this will define whole file encryption I was hoping we could solidify this before we do the major version bump.", "path": null, "line": null, "type": "review_body"}, {"author": "rdblue", "body": "Why not implement `EncryptedInputFile`?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputFile.java", "line": 25, "type": "inline"}, {"author": "rdblue", "body": "Where does this come from? Is it introduced by Iceberg or from another standard?", "path": "core/src/main/java/org/apache/iceberg/encryption/Ciphers.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Where is this encryption format documented so I can take a look at compare it to the spec?", "path": "core/src/main/java/org/apache/iceberg/encryption/Ciphers.java", "line": 70, "type": "inline"}, {"author": "rdblue", "body": "Should this be `decryptBlock` instead?", "path": "core/src/main/java/org/apache/iceberg/encryption/Ciphers.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think this is specific enough. What is happening here? What is the endianness?", "path": "core/src/main/java/org/apache/iceberg/encryption/Ciphers.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "You're right. This would be the result of the `decrypt` method.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputFile.java", "line": 25, "type": "inline"}, {"author": "rdblue", "body": "We should be able to know the length without opening the stream, right?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputFile.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Can we call this a header? We use `prefix` in several places and I think it would be more clear to call it a header since it is only at the start of the file.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It would be good to have some of these calculations as util methods. This could be `blockOffset(blockIndex)`.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Using `remaining` is a bit confusing because its meaning isn't clear. It could be remaining in the encrypted stream or remaining bytes to read, or something else. I'd recommend being more specific and using `bytesToRead`.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I thought this would be the block size, but it looks like the encrypted block size includes the nonce and tag. I think that's a bit confusing and we should make sure the behavior is specified in the spec.\r\n\r\nSince we're planning on making the block size constant, that works well. We can have a constant for this so it's easy.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "If the unencrypted file length is 0, we should return an in-memory input stream with 0 bytes instead of a decrypting one.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "If we know from the file size that the content is 0 bytes, we don't need to use an encrypted stream. I think we should fail if the header isn't there and something tries to read the file, but if there are no encrypted blocks then we can just return a simpler stream.\r\n\r\nThat will get rid of the `emptyCipherStream` cases and simplify this implementation.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This conclusion isn't correct. While unlikely to happen, the underlying stream is allowed to return fewer than the requested bytes. That isn't an error case and it doesn't indicate that the stream hit EOF.\r\n\r\nIf you want to ensure that you get all the bytes requested, then use `IOUtil.readFully`. That will throw `EOFException` if there aren't enough bytes.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think that this can be removed when the block size is set to a constant.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think most Iceberg streams throw `EOFException` instead of returning -1.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think this should be an `IllegalArgumentException`. Can you use a precondition?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: name should be `isLastBlock`", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This isn't an exceptional case. If you need to load the entire buffer, use `IOUtils.readFully`.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I find the naming a bit confusing because it is inconsistent. I think it is good to use `plain` and `cipher` to distinguish but we should do that consistently. In this case, I would call this `currentCipherBlockSize`.\r\n\r\nI think it would also be good to break this up so you're not combining encrypted block reads and decrypts with the logic to copy data from a plaintext buffer. Can you add a `nextCipherBlock` or `nextPlainBlock` method that handles this?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It looks to me like this is going to read and decrypt the current block every time `read` is called. Am I wrong? Why not keep the current decrypted content in a buffer that persists across calls and consume bytes as needed from that buffer?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why use big endian when other places use little endian?", "path": "core/src/main/java/org/apache/iceberg/encryption/Ciphers.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Shouldn't this be EOFException?", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is this needed? Can we just leave it as it was or is it now failing?", "path": ".palantir/revapi.yml", "line": 808, "type": "inline"}, {"author": "rdblue", "body": "`static final` variables should use names that are ALL_CAPS.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmOutputStream.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Why is this called? It isn't used anywhere.", "path": "core/src/main/java/org/apache/iceberg/encryption/AesGcmInputFile.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I thought we had decided on a static block size?", "path": "core/src/main/java/org/apache/iceberg/encryption/Ciphers.java", "line": 44, "type": "inline"}, {"author": "rdblue", "body": "Does the call to `getInstance` in the constructor guarantee that the `Cipher` instance is not shared? I'm guessing that it must if it has state (mode, key, etc.) but I don't see it guaranteed anywhere.", "path": "core/src/main/java/org/apache/iceberg/encryption/Ciphers.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Should this check that the number of bytes returned by `doFinal` is correct?", "path": "core/src/main/java/org/apache/iceberg/encryption/Ciphers.java", "line": null, "type": "inline"}], "10484": [{"author": "stevenzwu", "body": "Can we avoid the requirement of `equals` and `hashCode`?\r\n\r\nNote that `metadataFileLocation` can be null. There were also discussion on potentially removing the metadata json file requirement from the spec.\r\n\r\n```\r\n private String metadataFileLocation(Table table) {\r\n if (table instanceof HasTableOperations) {\r\n TableOperations ops = ((HasTableOperations) table).operations();\r\n return ops.current().metadataFileLocation();\r\n } else if (table instanceof BaseMetadataTable) {\r\n ", "path": "core/src/main/java/org/apache/iceberg/SerializableTable.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "It is probably best not to use tag for this purpose. `Tag` is a named snapshot id. doesn't seem to be the right fit. \r\n\r\nAlternative would be table property. I would love to hear others' inputs on the lock implementation.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `TableMaintenanceMetrics`?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/MetricConstants.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "what is taskId for? it doesn't seem to be exposed to outside.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/Trigger.java", "line": 35, "type": "inline"}, {"author": "stevenzwu", "body": "nit: may indicate where it is milli or micro in the arg names", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerEvaluator.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "why is this a `KeyedProcessFunction` after `keyBy`? I thought it will always be parallelism of 1. ", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 60, "type": "inline"}, {"author": "stevenzwu", "body": "do we need `List<TableChange>` here? `TableChange` is mergeable. a single instance would be enough?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `init` name is confusing. it is not initialization of this function/operator", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "The only requirement is `atomicity`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I'm undecided between 2 solutions:\r\n1. The trigger contains the ID of the Maintenance Task (implemented in this PR)\r\n2. Using side outputs to separate out the triggers for the Maintenance Tasks\r\n\r\nWith the 1st approach the flow looks like this:\r\n- TriggerManager (1 instance)\r\n- Watermark generator (1 instance)\r\n- Splitter to start the specific tasks (1 instance)\r\n- Maintenance Tasks (n instance)\r\n\r\nWith the 2nd approach the flow looks like this:\r\n- TriggerManager (1 instance)\r\n- Watermark genera", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/Trigger.java", "line": 35, "type": "inline"}, {"author": "pvary", "body": "State handling and timers need keyed operators... This is awkward, but we haven't find a better solution.\r\nMaybe another idea?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 60, "type": "inline"}, {"author": "pvary", "body": "This is the `TableChange` per scheduled Maintenance Task since the last Trigger of the given task.\r\n\r\n", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Any better ideas?\r\nThe method initializes the state if it is not a job restart, and starts the recovery procedure to clean up locks and prevent concurrent Maintenance Task runs on recovery.\r\n\r\nI was struggling to find a good name, so I'm open to any suggestions \ud83d\ude04 ", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "Looking at the `PropertiesUpdate` code, it only provides `set` (not `putIfAbsent`). so maybe it doesn't provide the atomicity.\r\n\r\nMy early comment regarding using tag for lock still holds true though. it doesn't feel very natural. I understand the benefit of using Iceberg construct for no additional dependency. But tag-based locks also have higher overhead (metadata JSON file rewrite) compared to other standard solution like SQL database. Iceberg commit's atomicity guarantee in the end also com", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "also how are we going to deal with orphaned locks, where `unlock` didn't happen?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "what if it is a error (like OutOfMemoryError)?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I saw `ProcessFunction#Context` also has `timerService()` available. `getRuntimeContext()` also seems to work for `ProcessFunction`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 60, "type": "inline"}, {"author": "stevenzwu", "body": "oh. you move the init logic here (instead of `initializeState`) due to the need of `currentProcessingTime()`?\r\n\r\nnit: maybe a bit cleaned to move the method to the first line in this method, e.g. `current` can be calculated inside the init method.\r\n```\r\ninit(out, ctx.timerService())\r\n```", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I don't fully understand the 2 options. we can postpone this discussion to later PRs when the taskId is actually used.\r\n\r\nin the 2nd option, why does watermark generator needs to be n instances?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/Trigger.java", "line": 35, "type": "inline"}, {"author": "stevenzwu", "body": "still not quite following. I thought TriggerManager accumulates `TableChange` and decide if it is time to actually trigger/schedule the maintenance task. should the accumulation be done via `List<TableChange>` or on-the-fly merged single `TableChange`?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "circling back on the `TableChange` interface from last PR. this style of constructor is bit hard to remember what each of those 5 numbers are for. a builder pattern may be a little easier. in this case, it could be `TableChange.builder().commitNumber(1).build`", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: is this helper method beneficial? I would think the original form is a bit easier to read/understand", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "not sure if this hiding helps with code understanding. it is easier to understand if the evaluator is explicitly defined", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "why unlock for each of this method? where is it locked? ideally, it is easier to read if lock and unlock happens in the same scope/method.", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": 578, "type": "inline"}, {"author": "stevenzwu", "body": "how is the `clearLocks` set by the caller? or what scenario it should be true?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "what if lock was created/acquired by another process?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "why don't we put name inside the `TriggerEvaluator` class?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 106, "type": "inline"}, {"author": "stevenzwu", "body": "this is essentially a delayed cleanup, because timer service is not available in this context?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "where is recoveryLock unlocked?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 329, "type": "inline"}, {"author": "pvary", "body": "I've tried this and failed for some reason. Let me double check this and come back with the results.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 60, "type": "inline"}, {"author": "pvary", "body": "We're planning to have multiple Maintenance Tasks in the PostCommitTopology. But we are planning to have a single `TriggerManager`, so it is easier to prevent concurrent runs. This single `TriggerManager` needs to handle the scheduling of every Maintenance Task. Every Maintenance Task could have a different scheduling requirement, like DataFileCompaction on ever 10 commits, ExpireSnapshots on every hour, and DeleteOrphanFiles on every day. For this, we need to keep track the accumulated `TableC", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "> tag-based locks also have higher overhead (metadata JSON file rewrite) compared to other standard solution like SQL database\r\n\r\nAgreed. That's why the interface. We can provide other implementations later. Maybe even a REST catalog based solution, or HMS based solution above the JDBC based one. I envision this as a first solution without additional external depenency which could be used with any existing Iceberg implementation.\r\n\r\n> also how are we going to deal with orphaned locks, where unlo", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "> in the 2nd option, why does watermark generator needs to be n instances?\r\n\r\nHere is how lock handling will go:\r\n- `TriggerManager` creates a lock\r\n- We generate a Watermark which travels through all of the Maintenance Task streams\r\n- We join the streams at the end and remove the lock when the Watermark arrives from all of the substreams.\r\n\r\nWe need to join the Maintenance Task streams, so when cleaning the locks on recovery then the recovery trigger watermark will signal that the previously st", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/Trigger.java", "line": 35, "type": "inline"}, {"author": "pvary", "body": "The `TriggerManager` creates the lock before emitting the `Trigger`.\r\n\r\nIn the real use-case the lock will be removed by the last operator of the Table Maintenance Stream, to allow the `TriggerManager` to fire the next `Trigger`", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": 578, "type": "inline"}, {"author": "pvary", "body": "All of the publicly available parameters of these operators will be set by the `TableMaintenance.Builder`.\r\n\r\nThat said, this config is probably not needed anymore. The original intent was to enable the user to remove stale locks on job restart, but the new recovery solution covers this possibility.\r\n\r\nLet me think a bit more on this, but I might end up removing this configuration.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "We don't support this scenario ATM.\r\n\r\nWe can make it work with named locks, where we can identify the owner of the currently held lock by some means, but then we need to solve the prioritizing and starving of these concurrent Maintenance Jobs. I decided against solving that issue ATM.\r\n\r\nAnd this is the feature where we will need the `cleanup` config again to remove all of the stale locks for the given table... \ud83d\ude04 ", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "These are the specific Maintenance Tasks scheduled by the user. It could be one or more of the following:\r\n- DataFileRewrite\r\n- ManifestFileRewrite\r\n- ExpireSnapshots\r\n- DeleteOrphanFiles\r\n\r\nWe can add more later, if we need/want.\r\n\r\nAlso, the user might decide to schedule a given type multiple times with different configurations, like RewriteDataFiles with only small file compaction on every 10 commits, but a full table rewrite on every day.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 106, "type": "inline"}, {"author": "pvary", "body": "In the last operator of the Table Maintenance Stream, where we join the results of all of the Maintenance Tasks, and remove the lock when the Watermark has arrived from all of them. This makes sure that any recovered and still running task is finished at this time.\r\n\r\nWe essentially use the Watermark as a marker that every record is processed in all of the Maintenance Tasks.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 329, "type": "inline"}, {"author": "pvary", "body": "Another idea:\r\n- Is there any guarantee in Flink that the operators of the same subtasks, in the same slot sharing group are run on the same JVM?\r\n\r\nBased on this graph: https://nightlies.apache.org/flink/flink-docs-master/fig/slot_sharing.svg, which I have found on this page: https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/flink-architecture/#task-slots-and-resources this seems to be the case.\r\n\r\nIf we can make sure that the `TriggerManager` and the `LockRemover` are in the ", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "So the issue is with the timers:\r\n```\r\njava.lang.UnsupportedOperationException: Setting timers is only supported on a keyed streams.\r\n```\r\n\r\nThe method is there, but we can not set the timer from a non-keyed `ProcessFunction`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 60, "type": "inline"}, {"author": "pvary", "body": "Removed, added a utility method for comparison as it is used only in the tests ATM", "path": "core/src/main/java/org/apache/iceberg/SerializableTable.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Created a `JVMBasedLockFactory` which uses `Semaphore`s as locks.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Added `Ms` to the names", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerEvaluator.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Added the builder.", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Refactored a bit these methods", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Refactored a bit these methods", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Removed the `clearLocks` for now. We might want to get it back when/if we start supporting concurrent compaction jobs, but it is not important as of now. The recovery will create it's own lock anyway, and cleanup will remove the lock in the end", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: JVM seems redundant as this is Java code. maybe `SemaphoreLockFactory` or `InMemorySemaphoreLockFactory`?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/JVMBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "thanks for verifying. please add the context to the class javadoc, which is empty currently.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 60, "type": "inline"}, {"author": "stevenzwu", "body": "I am not sure if we are talking about the same thing. I was saying moving the `String taskName` into the `TriggerEvaluator` class. this way, we only need `List<TriggerEvaluator> evaluators`. Then we don't need the check on `taskNames.size() == evaluators.size()`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 106, "type": "inline"}, {"author": "stevenzwu", "body": "nit: add the time unit `1 ms`. same for the error msg below", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 107, "type": "inline"}, {"author": "stevenzwu", "body": "nit: maybe add the context as code comment?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `shouldCleanUp` seems more accurate. `isCleanUp` makes more sense in the `Trigger` class", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "just global operator doesn't guarantee the colocation. we probably should call out the Flink slot sharing group. some brief example would be helpful.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/JVMBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "oh. I got it now. the list is has one `TableChange` accumulation for each maintenance task", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "should we avoid `TagBasedLockFactory` to start with? I am still a bit concerned about using Tag as distributed lock mechanism just semantically.\r\n\r\n`SemaphoreLock` is a simple and good implementation to provide. It does has some limitations. E.g., we probably can't specify the slot sharing group to control the colocation for Flink SQL jobs?\r\n\r\nAnother potential implementation is using k8s ConfigMap/Etcd. but we probably don't want to pull in the k8s Java SDK dep in Flink module.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "do we need `deleteFileBytes`?\r\n\r\nI am also wondering if we need to separate out position delete and equality deletes for trigger evaluator.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerEvaluator.java", "line": 65, "type": "inline"}, {"author": "stevenzwu", "body": "nit: can we use variable naming pattern similar to existing code? Here are some from `PartitionsTable` class. \r\n\r\n```\r\n private int dataFileCount;\r\n private long dataFileSizeInBytes;\r\n private long posDeleteRecordCount;\r\n private int posDeleteFileCount;\r\n private long eqDeleteRecordCount;\r\n private int eqDeleteFileCount;\r\n```", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerEvaluator.java", "line": 63, "type": "inline"}, {"author": "stevenzwu", "body": "I am not sure these constants improve readability. e.g., what is the relationship (comparison) btw EVENT_TIME and EVENT_TIME_2? it would be easier to read the literals directly.\r\n\r\nsimilar comment for dummy name.", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/ConstantsForTests.java", "line": 22, "type": "inline"}, {"author": "stevenzwu", "body": "do we need to check if `nextTime` is larger than the `current` time here?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 203, "type": "inline"}, {"author": "stevenzwu", "body": "nit: move the variable definitions right before the usage for readability", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "we don't check the boolean return value from `tryLock`. is that correct?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 329, "type": "inline"}, {"author": "stevenzwu", "body": "nit: It is probably more common to keep these lists as class members. and in the `snapshotState` method, just update the state variable with the list.\r\n\r\nfor this use case (low-frequency state read/write), it doesn't matter performance wise (either way).", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "might be a bit more readable to do the `mod` here (than in the `nextTrigger` method)", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "for each `Trigger` record, we are loading the table from catalog. why don't we let downstream operator load the table directly? i.e. downstream operator gets the `TableLoader` and we avoid including `SerializableTable` to the `Trigger` object?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 257, "type": "inline"}, {"author": "stevenzwu", "body": "in case of nothing to trigger, why do we need to reset the cursor/position to 0?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 250, "type": "inline"}, {"author": "pvary", "body": "If the slot sharing group is not defined, then the operator is in the `default` slot sharing group. This means that if the slot sharing group is not set, then we are most probably ok.\r\nOTOH, I have yet to find the exact code which schedules the Flink subtasks, so I am still not sure that the user can guarantee the co-location of the `TriggerManager` and the `LockRemover`.\r\n\r\nTag based locking is working even if the Flink co-location was not successful. So I still feel that it has its own uses.\r\n", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "The job should be restarted, and the cleanup process will handle the remaining lock", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I will need to send the list of names to the `LockRemover` as well - so I can log the results, collect the metrics using the same lists. I don't want to sent the evaluators to the `LockRemover`, so I think it is better to keep a separate list here for consistencies shake.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 106, "type": "inline"}, {"author": "pvary", "body": "Yeah\r\n- If we succeed, then we were able to create the lock and this is fine\r\n- If we fail, then we already have a lock on the table (maybe from a previous failed recovery), and this is fine as well", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 329, "type": "inline"}, {"author": "pvary", "body": "It is hard to understand requirement that the `TriggerManager` and the `LockRemover` operators should be on the same JVM. I would like to see this highlighted as much as possible - that is why I added the JVM to the name", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/JVMBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "We will drive the slot sharing groups for these operators, so i being in the same slot sharing group is fine, then we will be fine. Also we will drive how the operators are added to the streams, so we are fine there as well. So I don't think we need to go into the details here.\r\n\r\nI'm more concerned if being in the same slot sharing group and being a global operator doesn't guarantee co-location. ", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/JVMBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "I'm fine with revisiting the content of the `TableChange` in another PR.\r\nMaybe when we went trough most of the Maintenance Tasks we will have a better understanding what are the most relevant metrics for the evaluators", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerEvaluator.java", "line": 65, "type": "inline"}, {"author": "pvary", "body": "I would do it once when we have the final list of variables in the `TableChange`. I just finished rewriting the `TableChange` and added the builder with the old names. I would like to rewrite them to the new names when we have the final list.\r\n\r\nSince this is a very limited/localized change we could easily postpone it for later", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerEvaluator.java", "line": 63, "type": "inline"}, {"author": "pvary", "body": "In upcoming tests we will have WATERMARK for EVENT_TIME, WATERMARK for EVENT_TIME_2, where it will be important that the watermark and the eventtime are aligned.\r\n\r\nAlso if you take a look at several tests, it is important that the source name is part of the metrics name. So we need a constant for them minimally in the test itself. Since multiple tests will use these constants, I think it is better to define them once, and reuse them in multiple tests.", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/ConstantsForTests.java", "line": 22, "type": "inline"}, {"author": "pvary", "body": "That would lead race conditions when the `processElement` and the `onTimer` method calls \"should\" happen at the same time. If `nextTime` is set, we can be sure that the `onTimer` will be called when the time is right. The check happens there, and we can avoid the race condition.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 203, "type": "inline"}, {"author": "pvary", "body": "They are used multiple places (different cases in the if condition). So that is why I added it to the beginning of the method. Also the `currentProcessingTime` is advancing even during the execution, so we can't recreate it all the time. AFAIK this is good practice to define the variables used in multiple paces at the beginning of the method.\r\n\r\nI have moved the `init` to the top based on your request, which moved the type declarations from the top.\r\n\r\nDo we have a clear guide how to organize ou", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Moved to the `snapshotState` based solution. It makes the code more readable as updating list states is awkward.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "This is needed, so we can differentiate the 2 situations\r\n- We start now\r\n- We reached the last element, and we need to start from the beginning", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Most of the tasks need Table anyways. If we do not do it here, we will have a `SnapshotTable` operator in:\r\n- RewriteDataFiles\r\n- RewriteManifestFiles\r\n- DeleteOrphanFiles\r\n\r\nWe do not need the table for:\r\n- ExpireSnapshots - as we need to load the table anyways\r\n\r\nI'm open to both solutions. I'm opted for reading the table here to simplify the specific Maintenance Tasks later.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 257, "type": "inline"}, {"author": "pvary", "body": "If we have multiple Maintenance Tasks scheduled, then I want to keep the scheduling fair.\r\nSo if we find a task to trigger, then we run it, but after it is finished, we start from the given position to prevent \"starvation\" of the tasks.\r\nWhen there is nothing to trigger, we start from the beginning, as the order of the tasks might be important (RewriteDataFiles first, and then RewriteManifestFiles later - in this case we don't want to rewrite manifest files which are removed by the RewriteDataFi", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 250, "type": "inline"}, {"author": "pvary", "body": "Added the javadoc - could you please review?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 60, "type": "inline"}, {"author": "stevenzwu", "body": "> When there is nothing to trigger, we start from the beginning, as the order of the tasks might be important (RewriteDataFiles first, and then RewriteManifestFiles later - in this case we don't want to rewrite manifest files which are removed by the RewriteDataFiles anyways)\r\n\r\nmight make sense to add as code comment?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 250, "type": "inline"}, {"author": "stevenzwu", "body": "nit: `1 ms` with a space", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "if it is hard to guarantee colocation, I would suggest we remove the JVM lock for now. Otherwise users may misuse it without knowing the problem. basically not expose the JVM lock until we can guarantee colocation.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "should we load the table once in the factory constructor? lock object only needs to refresh table state in the tryLock and unlock methods.\r\n\r\nalso it doesn't seem that `tableLoader` is closed.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "`refs` is retrieved both here and in the `isHeld` method. `table.refresh()` is also done in the `isHeld` method. should we remove line 76 and 77?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: remove `someone`. I don't know if we need to print the refs keySet. it would imply all tags are locks", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "if table is empty, is there any need to run maintenance. wondering if retiring false makes more sense here. is the purpose to avoid wait loop on `tryLock` to be true.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "can we add those as code comment?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": 578, "type": "inline"}, {"author": "stevenzwu", "body": "nit: Iceberg error msg tends to be in the style of `Invalid something: reason`. so here could be `Invalid evaluators: null or empty`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "do we need the local variable of `nextTime`? read `nextEvaluationTime` directly?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "add the context as code comment as it is not immediately obvious?\r\n\r\nif tryLock failed, it is fine to send the `Trigger.cleanUp` event?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 329, "type": "inline"}, {"author": "stevenzwu", "body": "where is tableLoader closed?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 161, "type": "inline"}, {"author": "stevenzwu", "body": "is the assumption here that there is only one timer scheduled at the most?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 280, "type": "inline"}, {"author": "stevenzwu", "body": "didn't quite understand the race condition part", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 203, "type": "inline"}, {"author": "stevenzwu", "body": "where do we need to differentiate the 2 situations?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "reset `isCleanUp` to false when recovery is done? `isCleanUp` is only used once by the `init` method. why don't just use the `inited` boolean?\r\n\r\nmaybe `isCleanUp` naming doesn't help either. seems more like `shouldRestoreTasks`?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "maybe add a comment like `Recovered tasks in progress. Skip trigger check`?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 231, "type": "inline"}, {"author": "stevenzwu", "body": "we could have multiple timers scheduled for this scenario if the recoveryLock is not released in multiple `checkAndFire` calls?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 230, "type": "inline"}, {"author": "stevenzwu", "body": "ok. let' revisit this once we see later parts", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/Trigger.java", "line": 35, "type": "inline"}, {"author": "pvary", "body": "I think it is much easier to follow the logic around the locking if we don't complicate it event more.\r\nFor example we should rethink the logic around the recovery lock to be sure that the cleanup logic is working if there is no commits in the table.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Also this happens only very-very rarely", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "`nextEvaluationTime` might be changed by the `checkAndFire` method, and I prefer to print the `nextTime` in the logs", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Yes. We make sure that only on timer is scheduled to prevent race conditions and such.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 280, "type": "inline"}, {"author": "pvary", "body": "If we trigger the `checkAndFire` when the `nextTime` is larger then the `current`, then we will have a trigger by this new `checkAndFire`, and also the timer could be called to. It is better to avoid this situation altogether by relying on the timer to trigger when it is needed.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 203, "type": "inline"}, {"author": "pvary", "body": "Added a comment like this:\r\n```\r\nThe result of the 'tryLock' is ignored as an already existing lock prevents collisions as well\r\n```\r\n\r\n> if tryLock failed, it is fine to send the Trigger.cleanUp event?\r\n\r\nConsider the situation that we tried to recover a failed job, created the recovery lock, but the job failed again before finishing the lock cleanup. In this case if we try to recover the failed job again, then we will have a lock, but no `Trigger.cleanUp` in the pipeline, so we need to fire a ", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 329, "type": "inline"}, {"author": "pvary", "body": "Removed the JVM lock altogether", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/JVMBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Added open/close to the LockFactory", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "removed refresh, and also refs.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Removed the suggested parts", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "We come here only by 2 routes:\r\n- `processElement` - in this case `nextTime`/`nextEvaluationTime` is null - no previous timer scheduled\r\n- `onTimer` - in this case the previous timer has fired, so no previous timer scheduled", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 230, "type": "inline"}, {"author": "pvary", "body": "As discussed on the other comment, we need this, so the scheduling is fair", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "let's call out its intended scope/limitation of the tag based locking: mainly only working with single job scenario. while we are clarifying that, maybe also call out multiple Flink jobs running compaction is not recommended.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "alright fair enough", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: this error msg doesn't seem correct. it just failed to acquire lock. it doesn't actually `stop` anything.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "wondering if retry is safe here. Could the lock/unblock action initiated by multiple threads/operator subtasks?\r\n\r\nhere is one scenario I am thinking about.\r\n\r\n- t1: operator 1 calls unlock, it actually removed the tag by the catalog service/backend, but response failed to get back due to network issue\r\n- t2: operator 2 calls lock, and created the tag successfully.\r\n- t3: operator 1 retry the unlock and removed the tag\r\n\r\nIf the above scenario couldn't happen, let's call out the constraints in t", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "also how do we deal with orphaned locks where unlock failed (e.g. due to network issue with catalog service)", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "`nextTime` is only used in the `else` case without running `checkAndFire`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "oh. I didn't mean to \"run the checkAndFire when the nextTime is larger then the current\". anyway, I guess it is not important in this else section.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 203, "type": "inline"}, {"author": "stevenzwu", "body": "nit: for consistency, should it be `recover lock` in the log msg?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "what other operator (besides TriggerManager here) subtask can acquire lock?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I was saying perform the mod operation here to make sure `startsFrom` is always a valid position. This way, if we log this value, it will show correct value.\r\n\r\n```\r\nstartsFrom = (taskToStart + 1) % evaluators.size();\r\n```\r\n\r\ninstead of in `nextTrigger` method\r\n```\r\nint normalizedStartingPos = startPos % evaluators.size();\r\n```\r\n\r\nThis should affect fairness, right?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: for consistency, maybe `acquire the recovery lock`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "I saw 4 places where `schedule()` were called in `TriggerManager`\r\n\r\n1. init: `if (shouldRestoreTasks)`\r\n2. checkAndFire: `if (recoveryLock.isHeld())`\r\n3. checkAndFire: `if (lock.tryLock())` and `else` (these two are mutually exclusive)\r\n\r\n1 and 2 can co-exist. 1 and 3 also seems to can happen together.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 280, "type": "inline"}, {"author": "stevenzwu", "body": "it is probably cleaner to do line 296-306 in `initializeState` method in the `if (context.isRestored())` block", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 309, "type": "inline"}, {"author": "pvary", "body": "We need changes in `TriggerManager` (handling cleanup) and the `Lock` (handling lock priority) implementation/requirements as well to handle multiple Flink jobs running compaction.\r\n\r\nBased on this I added the comment to `TriggerManager` instead:\r\n```\r\nThe current implementation only handles conflicts within a single job. Users should avoid scheduling\r\nmaintenance for the same table in different Flink jobs.\r\n```", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Would this message be more user friendly:\r\n```\r\nConcurrent lock created. Is there a concurrent maintenance job running?\r\n```", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "> wondering if retry is safe here. Could the lock/unblock action initiated by multiple threads/operator subtasks?\r\n\r\nThis is a valid issue. We need to write a more complicated locking with lockIds, and checks like we have in `MetastoreLock` and `LockRequest`. The main idea is to have something like `__flink_maitenance_<RND>` for the lock.\r\n- For checking the lock, we check if there is any tag starting with `__flink_maitenance_`\r\n- For creating the lock, we create a tag with a `__flink_maitenance", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "> also how do we deal with orphaned locks where unlock failed (e.g. due to network issue with catalog service)\r\n\r\nWe could reuse the recovery lock mechanism for this situation: After a given period, if we can't acquire the lock, then we can create a recovery lock, and send a `Trigger.cleanup`. If the recovery lock is not removed after a time period we could just throw an exception.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "`startsFrom == 0` means that we are finished this check loop\r\n`startsFrom == evaluators.size()` means that we need to continue this check loop from the beginning", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Yeah, fixed the comment", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "changed the comment and also changed `Trigger.cleanUp` to `Trigger.recovery`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "That's a good point. Added the check to the `init` `shouldRestoreTasks` as well.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 280, "type": "inline"}, {"author": "pvary", "body": "I leave this open, so you can double check the fix", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 280, "type": "inline"}, {"author": "pvary", "body": "We initialize the state-s in the `open` method. The `initializeState` is called after that.\r\nThat is why I needed to move the state restoration to here.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 309, "type": "inline"}, {"author": "pvary", "body": "> > wondering if retry is safe here. Could the lock/unblock action initiated by multiple threads/operator subtasks?\r\n> \r\n> This is a valid issue. We need to write a more complicated locking with lockIds, and checks like we have in `MetastoreLock` and `LockRequest`. The main idea is to have something like `__flink_maitenance_<RND>` for the lock.\r\n> \r\n> * For checking the lock, we check if there is any tag starting with `__flink_maitenance_`\r\n> * For creating the lock, we create a tag with a `__fl", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "> > also how do we deal with orphaned locks where unlock failed (e.g. due to network issue with catalog service)\r\n> \r\n> We could reuse the recovery lock mechanism for this situation: After a given period, if we can't acquire the lock, then we can create a recovery lock, and send a `Trigger.cleanup`. If the recovery lock is not removed after a time period we could just throw an exception.\r\n\r\nThe `unlock` will retry, and if it is failed `CHANGE_ATTEMPTS` times, we throw an exception. If we face ma", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: import the class", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "let's add comment to explain that there could never been two locks created at the same time. `tryLock` is only called by single-parallelism `TriggerManager`, which always check if there is any existing lock or not before creating a new tag/lock.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "we should probably throw an exception if there are multiple locks/tags exist for the prefix. because this is the assumption of how the random suffix works, let's enforce it here.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "`initializeState` happens before `open` method. we probably should move all state related to the `initializeState`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 309, "type": "inline"}, {"author": "stevenzwu", "body": "maybe add some info log for both if and else case.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 178, "type": "inline"}, {"author": "stevenzwu", "body": "> `startsFrom == 0` means that we are finished this check loop `startsFrom == evaluators.size()` means that we need to continue this check loop from the beginning\r\n\r\nwhere do we use that distinguish?", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: maybe `\"Failed to acquire lock. Delaying task to {}\", current + lockCheckDelayMs`", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "> Added the check to the init shouldRestoreTasks as well\r\n\r\nI am not seeing the change. not sure if I missed sth or not.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 280, "type": "inline"}, {"author": "stevenzwu", "body": "can we use `CollectingMetricsReporter` from Flink?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/MetricsReporterFactoryForTests.java", "line": 39, "type": "inline"}, {"author": "stevenzwu", "body": "`TestLockFactoryBase` extends from `OperatorTestBase`. is operator needed to test `TagBasedLockFactory`?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "do we need this? Does timeout/expire depends on the processElement action?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": 195, "type": "inline"}, {"author": "stevenzwu", "body": "can you add a comment to explain why we have 2 triggers in this case?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": 252, "type": "inline"}, {"author": "stevenzwu", "body": "typo? log -> lock", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "nit: is `locked` more accurate than `tableLock`?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "can you add a comment to explain this if section?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": 342, "type": "inline"}, {"author": "stevenzwu", "body": "nit: maybe `// Simulate the action of removing lock and recoveryLock by downstream lock cleaner when it received recovery trigger`", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "do we need to another table change event here?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "why would both tasks triggered in this case?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": 423, "type": "inline"}, {"author": "stevenzwu", "body": "this function name is difficult to see what is asserted. a couple of possible alternatives\r\n1) make it more specific like `assertRateLimitedMetric(int expected)`\r\n2) still one metric a time, but a more general form `assertMetric(String name, int expected)`", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": 491, "type": "inline"}, {"author": "stevenzwu", "body": "a semantic question. in this case, would we call it `concurrent run`? also would be be `blocked` or `triggered`?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": null, "type": "inline"}, {"author": "stevenzwu", "body": "maybe add one more step of advancing the time so that next check will still make the task as blocked?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": 532, "type": "inline"}, {"author": "pvary", "body": "Added a comment to the `Lock` methods, so other implementations could build on this assumption.", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "We can't access the keyed state in the `initializeState` method:\r\n```\r\njava.lang.NullPointerException: Keyed state 'triggerManagerNextTriggerTime' with type VALUE can only be used on a 'keyed stream', i.e., after a 'keyBy()' operation.\r\n\r\n\tat org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:76)\r\n\tat org.apache.flink.streaming.api.operators.StreamingRuntimeContext.checkPreconditionsAndGetKeyedStateStore(StreamingRuntimeContext.java:238)\r\n\tat org.apache.flink.streaming.api.opera", "path": "flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/TriggerManager.java", "line": 309, "type": "inline"}, {"author": "pvary", "body": "Any idea, on how to get the current `MetricsReporter` in the tests?", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/MetricsReporterFactoryForTests.java", "line": 39, "type": "inline"}, {"author": "pvary", "body": "`TableLoader` is needed for the `TagBasedLockFactory`. Also SQL is used to create the table, and add a snapshot to it. There could be other ways to get the desired setup, but I thought that reusing the `OperatorTestBase` add the least mental complexity.", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTagBasedLockFactory.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "Yes.\r\nThis test is for testing the `TriggerEvaluator` timeout handling.\r\nWe need a known time for a successful trigger (`newTime`) for a previous trigger, and later we can check if the `TriggerEvaluator.timeout` is handled correctly.", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": 195, "type": "inline"}, {"author": "pvary", "body": "Added:\r\n```\r\n// At this point the output contains the recovery trigger and the real trigger\r\n```", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": 252, "type": "inline"}, {"author": "pvary", "body": "Added:\r\n```\r\n// Simulate the action of the recovered maintenance task lock removal when it finishes\r\n```", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": 342, "type": "inline"}, {"author": "pvary", "body": "Good catch!\r\nNot needed.\r\nRemoved.", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": null, "type": "inline"}, {"author": "pvary", "body": "We have 2 tasks scheduled in this test:\r\n- The firsts task is triggered on ever 2nd commit - this should be triggered on the 4th commit\r\n- The second task is triggered on every 4th commit - this should be triggered on the 4th commit", "path": "flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestTriggerManager.java", "line": 423, "type": "inline"}], "3543": [{"author": "rdblue", "body": "I don't think that we want this to be logged unless there is a problem. An info log here could really flood the logs.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: We normally wrap arguments starting at the parameter list start, or move all arguments off of the first line.\r\n\r\n```java\r\n CachingCatalog(Catalog catalog...\r\n long expirationInterval)\r\n\r\nOR\r\n\r\n CachingCatalog(\r\n Catalog catalog, ...)\r\n```", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is there a better name for the variable than \"key\"? It's a table identifier, right? Can we use \"ident\" in place of key for readability?", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "This is just a notification?\r\n\r\nDoesn't need to be info. Maybe don't log it at all?", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "The `hasMetadataTableName` check is duplicated in `onTableExpiration`. I think I prefer it there instead of here. You may as well move the `expirationEnabled` there as well. Or maybe this should check the cause and only call `onTableExpiration` if that is why the record is being removed?", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Nit: newline between control flow and return.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": 121, "type": "inline"}, {"author": "rdblue", "body": "This is probably fine.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What about having a single property? We could use the expiration interval to disable the cache by setting it to 0.", "path": "core/src/main/java/org/apache/iceberg/CatalogProperties.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Is this needed?", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Thanks for the review @rdblue. Will update accordingly. Left an explanation about the strange (and rather ugly) `CachingCatalogTestHelper` but I think I can work around it and remove it.", "path": null, "line": null, "type": "review_body"}, {"author": "kbendick", "body": "Yeah this is so left over from debugging (and from testing with Spark). This one really isn't needed and I'll remove it entirely.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Good catch. Will be much simpler if it's all in the `onTableExpiration` function.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Will remove the comment then. Thanks for the input.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I personally don't care for this either.\r\n\r\nThe reason to have this is that the style checker complains that the `@VisibleForTesting` annotation is used for things that aren't package-private. It actually errors out.\r\n\r\nThe methods can't be package private otherwise as `TestCachingCatalog` is in org.apache.iceberg.hadoop but `CachingCatalog` is in org.apache.iceberg.\r\n\r\nIt was _much_ cleaner without this. I can possibly try to move the tests over to org.apache.iceberg instead.\r\n\r\nI originally ha", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'm okay moving this to debug.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "What do you think about checking the cause and using that instead of `expirationEnabled`?", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Users should probably not set this unless they want a specific value. If they don't know, they should omit it.", "path": "core/src/main/java/org/apache/iceberg/CatalogProperties.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Another option is to make instance variables package private and then have a subclass for testing that exposes the methods publicly. Then you have a whole class that is visible for testing. That would also remove test code from the production class.", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I was thinking about that too. Checking expirationEnabled alone isn't necessarily sufficient depending on the behavior we want.\r\n\r\nThe causes are as follows for those who don't know:\r\n\r\n- `EXPLICIT` - Things like `invalidate`, `remove`, etc. \r\n- `REPLACED` - New entry overwriting an old one - Doesn't seem possible looking at the code.\r\n- `COLLECTED` - for us, this would occur due to configuring `softValues`.\r\n- `EXPIRED` - the reason that would come from `expireAfterAccess etc - what `expiration", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Fixed in a few places.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I'm ok with that. To be clear, this would thus turn it on by default. I'm ok with that as user's can always disable it.", "path": "core/src/main/java/org/apache/iceberg/CatalogProperties.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I like that idea. I'm working on another PR today, so I'm going to leave as is for the moment and come back to this later but your idea is much cleaner. \ud83d\udc4d ", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Looks like `EXPIRED` allows us to avoid checking `expirationEnabled` so let's plan to do that.\r\n\r\nFor the others, I think that we could use this as a way to invalidate the cached entries that were dependent on the table. But, that should already be done as you noted for `EXPLICIT`. It is probably a good idea to eventually move this over to happen through the same code path here eventually, but it should be okay to leave it as-is for now.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Yes, I think caching should be on by default. That's the current default for `cache-enabled` anyway.\r\n\r\nOne thing I missed originally is that the `cache-enabled` property is already used in released versions. So we should continue to support disabling the cache using it. We should also disable the cache if the expiration is 0.", "path": "core/src/main/java/org/apache/iceberg/CatalogProperties.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I'm +1 for using a single property to control whether expiration is enabled and how long the expiration time is. This also makes other places in the code simpler in terms of checking, since you then only have to check `cache-enabled` & `cache.expiration-interval-ms`. ", "path": "core/src/main/java/org/apache/iceberg/CatalogProperties.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "personally I think 15 mins is a very very long time. Intuitively I would have assumed something around < 2 minutes as the default value. So even though users can configure this as they like, I think that we should have good/reasonable default values, but maybe there's a particular reason why we think 15 mins is a good default?", "path": "core/src/main/java/org/apache/iceberg/CatalogProperties.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I guess if we remove `cache.expiration-enabled` and only use `cache.expiration-interval-ms` I believe it could still make sense to have a good default that is maybe > 0 but <= 2 mins. wdyt?", "path": "core/src/main/java/org/apache/iceberg/CatalogProperties.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "I would be in favor of removing `CachingCatalogTestHelper` and inline the assertion checks within the tests", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "cleanup is already being called by `catalog.tableFromCacheQuietly(identifier).isPresent()`. Also I don't think test code should be calling stuff on the cache directly", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "having methods in the Cache return Optionals makes testing this more difficult than necessary imo. It also makes it more difficult to debug when the test actually fails. I would suggest something like `Assertions.assertThat(catalog.cache().asMap()).containsKey(identifier);` as that will show you all the entries in the cache when this check fails, thus making our lives easier :)", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "maybe replace this with `Assertions.assertThat(catalog.cache().asMap()).doesNotContainKey(tableIdent);`. Then you don't need to have a specific error message, since this code will actually show you the entries in the cache when the test fails, thus it will be quite obvious why this check failed", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "maybe `tableCache.policy().expireAfterAccess().get().ageOf(identifier)`?", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "you could try something like `Assertions.assertThat(catalog).extracting(\"icebergCatalog\").isInstanceOf(CachingCatalog.class);`", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogCacheExpiration.java", "line": null, "type": "inline"}, {"author": "nastra", "body": "so basically your checks could become something like this:\n```\nAssertions.assertThat(catalog).extracting(\"icebergCatalog\").isInstanceOf(CachingCatalog.class);\nAssertions.assertThat(catalog).extracting(\"icebergCatalog\").extracting(\"expirationIntervalMillis\").isEqualTo(0);\nAssertions.assertThat(catalog).extracting(\"icebergCatalog\").extracting(\"expirationEnabled\").isEqualTo(true);\nAssertions.assertThat(catalog).extracting(\"cacheEnabled\").isEqualTo(true)\n```", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogCacheExpiration.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I agree. 15 minutes is too long. I was thinking more like 30s.", "path": "core/src/main/java/org/apache/iceberg/CatalogProperties.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "There should be no need to pass both the boolean and the expiration interval. Just pass 0 if expiration is not enabled. Also, I think there's an extra `l` in the expiration variable name.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Yeah that's how I had it previously, but the `@VisibleForTesting` annotations caused compilation to fail due to style checking errors.\r\n\r\nI'll definitely remove this class and then make a subclass like Ryan mentioned. This class came in sort of last minute to get around the compilation failures from using `@VisibleForTesting` and admittedly I do not like this class at all.", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Cool. 15 minutes was more of just a placeholder. It would be a pain to specify 15 minutes in ms anyway.\r\n\r\nI'll go with 30s.", "path": "core/src/main/java/org/apache/iceberg/CatalogProperties.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Yeah if this is just a testing method then it doesn't need the overhead of the option.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "That's a great suggestion. Thank you. The assertion messages in this case aren't really adding much.\r\n\r\nAfter I added that `CachingCatalogTestHelper` class last minute, things got admittedly a little ugly. Appreciate the input!", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Oooo that's great. I was not aware of this `extracting` business. Thanks for sharing, Eduard!", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalogCacheExpiration.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Some of the methods that are visible for testing actually return optionals, such as `ageOf`.\r\n\r\nI'll figure out what to do about those as I've returned to this PR to clean up this class. \ud83d\udc4d ", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I updated this to use `EXPIRED` and left a note about considering having other expirations happen there. It does admittedly happen in a background thread, but left as a follow up task.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "So the `asMap` call and subsequent iteration causes a cache access.\r\n\r\nI'll test it again, but otherwise, I believe that's why I'm using the `getQuietly` mehod.", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Nevermind I am mistaken. Switching to this!", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "This class has been removed!", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I'm going to continue to in-line most of these. I just wanted to get them out of the old class with the assertion helpers first and foremost.", "path": "core/src/test/java/org/apache/iceberg/hadoop/TestCachingCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Removed this.", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Switched to this.", "path": "core/src/test/java/org/apache/iceberg/CachingCatalogTestHelper.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "This was moved into `TestableCachingCatalog`. The Option is still kept, only because that's what `ageOf` returns.\r\n\r\nI can move the `.get` call into it, but that's why `Optional<Duration>` is still present in the `TestableCachingCatalog`.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Removed the extra flag. It's now just a zero to disable expiration.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Javadoc should use `<p>` on lines between paragraphs so that it renders with separate paragraphs.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Javadoc doesn't support markdown syntax, so if you want to use fixed-width type you have to wrap it like this: `{@code expirationIntervalMillis}`.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "We don't need this, right?\r\n\r\nIf you want to keep it, let's change it to be a bit shorter: `Evicted {} from table cache ({})`. Remember that the people getting these messages probably don't know any variable or class names, so things like `TableCache` should be formatted in plain English, \"table cache\".", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "To me, and expiration interval of 0 basically means do not cache tables. I'd probably check this above and remove the cache entirely for an expiration interval that is 0.\r\n\r\nTo disable expiration entirely, I'd probably use some threshold, like 1 year. If the value is above that, then expiration is effectively disabled. I'm interested what other people think about the upper threshold, though.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "It may be cleaner for this to be an inner class (not static).", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I think you can remove the TODO. You're right to think about whether we should do that if it is async, and I don't think it is a good idea. If you invalidate an entry, then I like removing all metadata tables synchronously.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Modified? How?", "path": "core/src/test/java/org/apache/iceberg/util/FakeTicker.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I'm not sure I agree with this. I think that an `enabled` flag is compatible with other configuration. It is one simple config you can use to flip on or off a feature, without altering other properties. For example, if I were trying to debug something and I suspected caching, I'd go set this to disabled. It would annoy me at that point if Iceberg complained that I have caching disabled, but set a config property for it. So I'd probably override with `cache-enabled` and ignore that the interval i", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I could see this being an error case. Why would you sent the interval to a negative value?", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "Not sure I understand what is happening here.", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think this file needs to change, does it?", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkCatalogConfig.java", "line": null, "type": "inline"}, {"author": "rdblue", "body": "I don't think this needs to change.", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestTimestampWithoutZone.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "To allow for `-1` to mean off. Symmetry with how Spark\u2019s config is designed.\r\n\r\nWe don't need to keep that though. It makes instantiating a negative Duration too easy.", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "EDIT: I see now. Sorry misunderstood at first. So if `cache-enabled` is false, just ignore this and set it to zero / don't use it.", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "At one point it was an error if a user specified cache-enabled and had a non-zero cache.expiration-interval-ms.\r\n\r\nThat's somewhat what this business is about: https://github.com/apache/iceberg/pull/3543#discussion_r766267640\r\n\r\nI can remove this though since it's being defaulted to off for the moment anyway.", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestTimestampWithoutZone.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Not if we're defaulting to zero now (aka off).\r\n\r\nThis was to make the tests predictable. At least at one point it was an error to have a non-zero cache.expiration-interval-ms with `cache-enabled` as false.\r\n\r\nPart of the problem is that we continue to send those properties further down into the call, which is why I normalized them. Let me try removing this now since it's currently defaulted to zero.", "path": "spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/SparkCatalogConfig.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Oh my bad I started updating this in another branch but then got sick and didn't open that yet.\r\n\r\nI can move the changes from that branch here I guess.", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "Since we have `cache-enabled`, which specifically controls whether or not to use the caching catalog, I don't love the idea of having this time based config flip that value. I think `cache-enabled` should be respected if we have it.\r\n\r\nOpen to the high value idea though. I can see how zero could be interpreted as \"expire immediately\" aka don't cache. But I think it would be much more confusing for people if we allowed the expiration-interval to override the explicit existing boolean flag for `ca", "path": "core/src/main/java/org/apache/iceberg/CachingCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "I added these lines which take in a plain `Duration`: https://github.com/apache/iceberg/pull/3543/files/8f197962ca045e99bdcaacc5370854314ab361fa#diff-aab28f9cd9db09ced4aeb1b9ffab120458652bdb001c977a5c790677b4c76030R53-R56\r\n\r\nIt's not even core guava, but a guava test package that hasn't been updated in a long time. So if there's a way we can avoid pulling that in and provide proper attribution, that would be great.", "path": "core/src/test/java/org/apache/iceberg/util/FakeTicker.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "The problem is that the catalog configuration and a number of other things are still pulled from the `SparkSession` at later times, like for table clean up jobs etc.\r\n\r\nAnd the `options` map is passed through a lot more source code.\r\n\r\nSo if an option is moved on behalf of the user, the goal was to make it reflect what we set it in their spark config and in the `options` map.", "path": "spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java", "line": null, "type": "inline"}, {"author": "kbendick", "body": "At this point actually, it's not even really a copy. There are several similar `FakeTickers` and I've removed all of the methods that came from Guava.\r\n\r\n`read` is part of the interface and it has to be in nanos, hence the `AtomicLong` o
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment