Skip to content

Instantly share code, notes, and snippets.

@dkam
Last active November 30, 2025 12:36
Show Gist options
  • Select an option

  • Save dkam/604a815cdd419661c836133a8681341d to your computer and use it in GitHub Desktop.

Select an option

Save dkam/604a815cdd419661c836133a8681341d to your computer and use it in GitHub Desktop.
Convert Wikidata into Parquet format
bzcat latest-all.json.bz2 | sed '1d;$d;s/,$//' | split -l 100000 --suffix-length=4 --numeric-suffixes=1 - --filter="
duckdb -c \"COPY (
SELECT * FROM read_json('/dev/stdin', union_by_name=true)
) TO 'parquet/\$FILE.parquet' (
FORMAT PARQUET,
COMPRESSION ZSTD
)\"
" chunk_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment