Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify the dst crs in convert? #160

Open
FlorisCalkoen opened this issue Mar 28, 2024 · 2 comments
Open

Specify the dst crs in convert? #160

FlorisCalkoen opened this issue Mar 28, 2024 · 2 comments

Comments

@FlorisCalkoen
Copy link

I don't think gpq currently contains a method to specify the target crs. Also I see that by default you use "OGC:CRS84", what is your rationale for that? Why not, for example, use "EPSG:4326"?

I'll add a little bit of context on my use case. So I just used gpq to convert a 'big' collection of parquet files to geoparquet by simply doing gpq convert non-geo.parquet valid-geo.parquet in a for loop. Further in my processing chain I load these geoparquet files using GeoPandas, but I ran into an issue because when the crs == "OGC:CRS84" it cannot be converted to epgs. Although it's expected behaviour I'm mostly just curious why you use "OGC:CRS84" instead of "EPSG:4326".

gdf = gpd.read_parquet("valid-geo.parquet")
print(gdf.crs.to_epsg()) # None
print(gdf.to_crs(4326).to_epsg()) # 4326

I'll probably change my routines from gdf.crs.to_epsg() to gdf.crs.to_string(), but I guess that several others rely on to_epsg() as well when using GeoPandas, so I thought it's worth opening a discussion point here.

@FlorisCalkoen
Copy link
Author

Following up, my guess is that dask_geopandas is also struggling to read GeoParquet files thave have been converted with gpq due to a similar issue/decision that has been made in the gpq crs conversion/specification. See example below:

storage_options = {"account_name": <account_name> "credential": <token>}
href = "<protocol>/<container>/<prefix>/valid-geo.parquet"
gdf = dask_geopandas.read_parquet(href, storage_options=storage_options)
Python traceback
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/backends.py:136, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    135 try:
--> 136     return func(*args, **kwargs)
    137 except Exception as e:

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:538, in read_parquet(path, columns, filters, categories, index, storage_options, engine, use_nullable_dtypes, dtype_backend, calculate_divisions, ignore_metadata_file, metadata_task_size, split_row_groups, blocksize, aggregate_files, parquet_file_extension, filesystem, **kwargs)
    536     blocksize = None
--> 538 read_metadata_result = engine.read_metadata(
    539     fs,
    540     paths,
    541     categories=categories,
    542     index=index,
    543     use_nullable_dtypes=use_nullable_dtypes,
    544     dtype_backend=dtype_backend,
    545     gather_statistics=calculate_divisions,
    546     filters=filters,
    547     split_row_groups=split_row_groups,
    548     blocksize=blocksize,
    549     aggregate_files=aggregate_files,
    550     ignore_metadata_file=ignore_metadata_file,
    551     metadata_task_size=metadata_task_size,
    552     parquet_file_extension=parquet_file_extension,
    553     dataset=dataset_options,
    554     read=read_options,
    555     **other_options,
    556 )
    558 # In the future, we may want to give the engine the
    559 # option to return a dedicated element for `common_kwargs`.
    560 # However, to avoid breaking the API, we just embed this
    561 # data in the first element of `parts` for now.
    562 # The logic below is inteded to handle backward and forward
    563 # compatibility with a user-defined engine.

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:57, in GeoArrowEngine.read_metadata(cls, fs, paths, **kwargs)
     55 @classmethod
     56 def read_metadata(cls, fs, paths, **kwargs):
---> 57     meta, stats, parts, index = super().read_metadata(fs, paths, **kwargs)
     59     gather_spatial_partitions = kwargs.pop("gather_spatial_partitions", True)

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:549, in ArrowDatasetEngine.read_metadata(cls, fs, paths, categories, index, use_nullable_dtypes, dtype_backend, gather_statistics, filters, split_row_groups, blocksize, aggregate_files, ignore_metadata_file, metadata_task_size, parquet_file_extension, **kwargs)
    548 # Stage 2: Generate output `meta`
--> 549 meta = cls._create_dd_meta(dataset_info)
    551 # Stage 3: Generate parts and stats

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:103, in GeoArrowEngine._create_dd_meta(cls, dataset_info, use_nullable_dtypes)
     99         raise ValueError(
    100             "No dataset parts discovered. Use dask.dataframe.read_parquet "
    101             "to read it as an empty DataFrame"
    102         )
--> 103 meta = cls._update_meta(meta, schema)
    104 return meta

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:77, in GeoArrowEngine._update_meta(cls, meta, schema)
     74 """
     75 Convert meta to a GeoDataFrame and update with potential GEO metadata
     76 """
---> 77 return _update_meta_to_geodataframe(meta, schema.metadata)

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/arrow.py:36, in _update_meta_to_geodataframe(meta, schema_metadata)
     35 geometry_column_name = geo_meta["primary_column"]
---> 36 crs = geo_meta["columns"][geometry_column_name]["crs"]
     37 geometry_columns = geo_meta["columns"]

KeyError: 'crs'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[36], line 6
      3 with fsspec.open(href, mode="rb", **storage_options) as f:
      4     gpd.read_parquet(f)
----> 6 dask_geopandas.read_parquet(href, storage_options=storage_options)

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:112, in read_parquet(*args, **kwargs)
    111 def read_parquet(*args, **kwargs):
--> 112     result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs)
    113     # check if spatial partitioning information was stored
    114     spatial_partitions = result._meta.attrs.get("spatial_partitions", None)

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/backends.py:138, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    136     return func(*args, **kwargs)
    137 except Exception as e:
--> 138     raise type(e)(
    139         f"An error occurred while calling the {funcname(func)} "
    140         f"method registered to the {self.backend} backend.\n"
    141         f"Original Message: {e}"
    142     ) from e

KeyError: "An error occurred while calling the read_parquet method registered to the pandas backend.\nOriginal Message: 'crs'"

@tschaub
Copy link
Member

tschaub commented Apr 3, 2024

It looks like you are running into an issue with dask-geopandas. The crs is optional in a GeoParquet geometry column. It looks like dask-geopandas assumes it will be present here https://github.com/geopandas/dask-geopandas/blob/3489a1cbafbeda3c0d4493133112969268e58d66/dask_geopandas/io/arrow.py#L36

I think this is the same issue geopandas/dask-geopandas#270

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants