Can the goals of the protocol be achieved at the I/O library layer? #7
Replies: 2 comments 2 replies
-
|
@sharkinsspatial Thanks for taking the time to read through this even if only briefly and writing up your thoughts. I appreciate the detailed comments! This is definitely an important conversation to have. I don't want to discount the value of improving I/O client-side. I think there would continue to be value there even if CCRP becomes a thing. But let me see if I can respond to your points an a way to adequately convey why I think client-side I/O optimizations can only get us so far.
I hope these responses help you see why I'm thinking we need to do more than just optimize client I/O. Let me know if I can unpack any of this more or if you have more thoughts. I really do appreciate having to defend this idea, because I'm still not sure if it is really a good one or just a mirage, so make me work for it. ❤️ |
Beta Was this translation helpful? Give feedback.
-
|
Another proposed advantage of CCRP: it inherently supports hrefs that index into an array. Where you have a dataset like a grid-aligned temporal stack of imagery or other data, you could have a STAC collection asset point to the dataset, and then items in the collection (individual images) could have an asset point to the corresponding time slice in the stack. Doing so presents an interface that allows the equivalent of searching into a zarr array, not just for the array itself. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
@jkeifer Very excited to see all the fast progress on this. I hadn't been tracking this work but I was just referred to your blog PR on CNG cloudnativegeo/website-cloudnativegeo.org#88. I don't think I've had time to fully digest the concept but a few things jumped out at me initially that might warrant some wider discussion. Again, these are after a single read through of the very thorough blog post so you should view my thoughts with a grain of salt 😆.
This problem seems more generalized than chunks. In my mind it is an I/O optimization problem. We want to dispatch adjacent HTTP range requests in a way that balances HTTP request overhead with optimal data transfer size. Though it is not exposed in an obvious way,
fsspechas already pioneered this concept as an I/O library optimization through its use ofcat_rangesand its configuration options which allow merging of adjacent requests within a specified overall size range. See https://developer.nvidia.com/blog/optimizing-access-to-parquet-data-with-fsspec/ for some more details on optimizing parquet reads. @kylebarron and I have had several discussions about implementing middleware forobstorewhich uses heuristics for "smart" adjacent range coalescing. Theobstoremiddleware approach is attractive because it allows anyone to "bring their own" optimization configuration that is tuned for their specific use case rather than trying to build a single solution for everyone.The protocol docs describe CCRP as "A byte broker - returns raw, unprocessed chunks". I'm unclear on how the larger coalesced block of bytes returned by the protocol will interact with codec pipelines on the client side 🤔? I'll have to defer to others with more knowledge in this area but my limited understanding is that some compression schemes are only valid when operating on the bytes of the originally compressed chunk and would not function to compress the larger stream. Who's responsibility will it be to split this larger coalesced "chunk" into its original constituents for codec processing? Will their need to be companion client implementations that handle disassembling data returned from the broker?
The biggest reason so many users embrace object storage for data storage and delivery is a simple one - "laziness" 😆. I personally would much rather delegate the responsibility of uptime, maintenance and hyper scaling to the cloud providers. Any additional layer we introduce between object storage and the client is server infrastructure we need to pay for, maintain and scale. This ability to defer this scaling to the underlying object storage implementation is probably the biggest reason for the explosive growth of cloud distributions of scientific data.
In short I agree that the problems that CCRP is trying to address are widespread and painful, but I wonder if focusing on I/O client optimizations would allow us to address a wider range of these problematic cases with a simpler approach.
Beta Was this translation helpful? Give feedback.
All reactions