Scalable schema-naive ingestion#49
Conversation
|
@kylebarron thinking about this again ... Looking at how the data is actually getting layed out with these structures, I think that there could potentially actually be some other benefits to using Map even as the final storage layout. With Assets as a Struct, the nested Asset Struct attributes (href, rel, type, etc...) end up with a column per Asset attribute per Asset type, while as a Map(String, Struct) there would be a single column for all hrefs, a single column for all rels etc. This should compress much better than the struct representation -- this should be particularly effective for any assets with proj:* definitions. For the case that you put forward of getting just the red href, it wouldn't need to read the entire wide red asset, it would just need to pull from that href column. If you were getting a lot of hrefs (and particularly if you were getting hrefs from multiple assets), any benefit from a Struct might actually shift in favor of a Map. I'm going to try to benchmark some parquet file sizes and query performance with a basic table with just an id and the assets as either Map or Struct to see if any of my suspicions hold true. |
|
@bitner while you're experimenting, I'll bring in the idea of moving assets up to the top-level, in the same way we did with
I think a lot of access patterns for stac-geoparquet are only going to want to grab/query on one asset, not all of them, so anything that enables access to only one asset would be a win. |
|
@gadomski that actually goes completely counter to where I was thinking, having separated keys, or struct really makes the schema management soooooo much more complicated The interesting thing.... |
I'm just coming from the STAC perspective, where for a given Collection you usually have the same set of asset keys for every item (maybe missing one, which could be nullable). |
In general, converting STAC to GeoParquet runs into schema inference issues, because GeoParquet needs a strict schema while STAC can have a much looser schema, or a schema that changes per row.
The current Arrow-based conversion approach uses two alternate methods:
Instead, in chatting with @bitner, we realized that we could improve on these two approaches by leveraging the knowledge that we're working with STAC spec objects. As long as the user knows which extensions are included in a collection, stac-geoparquet can pre-define the maximal Arrow schema defined by the STAC Item specification. This allows for minimal work by the end user while enabling streaming conversions of JSON data into GeoParquet.
To avoid the user needing to know the full set of asset names, we define assets under a
Maptype, which has pros and cons as noted in radiantearth/stac-geoparquet-spec#7. In particular, it's not possible to statically infer the asset key names from the Parquet schema using a Map type, and it's also not possible to access data from only a single asset without downloading data for every asset. E.g. if you wanted to know theredasset's href, you'd have to download the hrefs for all assets, while a struct type would allow you to access only the red href column.But converting first into a Map-based GeoParquet file, as we do in this PR, could make for an efficient ingestion process, because it would allow us to quickly find the full set of asset names.
So this scalable STAC ingestion would become a two-step process:
The second part would become much, much easier by happening after the first step, instead of trying to start directly from JSON files.
Change list
PartialSchema). Note that this requires a certain amount of complexity because the schema for how we want data to reside in memory is not necessarily the same as the schema used for parsing input dicts.This heavily uses
pyarrow.unify_schemasto be able to work with partial schemas (for the core spec and for each extension).This continues the discussion started in radiantearth/stac-geoparquet-spec#7.