PoC: New dataframe read source interface#1864
Draft
Jolanrensen wants to merge 14 commits into
Draft
Conversation
…test and production code for improved API unification and flexibility.
… in converters/parsers
e592bd7 to
2696eed
Compare
… parseToDataFrameReadSource parser option.
2696eed to
aa2bd1b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
#450
@zaleslaw
WIP and proof-of-concept.
Drafting and exploring what a new
DataFrame.read()could be and do (together with claude). (namedreadSource()for now)In its current state you can give it anything, and it figures out the rest (be that an ArrowReader, a URL, a String, or an Excel sheet). Extra options can be provided when needed.
DataRow.readSource()also works.It also comes with a
DataFrameSchema.readSource(), if you need just the types (overridden by jdbc), and something likeCodeString.read()maybe, if you just need the generated interfaces (overridden by openapi-generator).I'm also thinking about what a unified system like this could bring to the rest of dataframe. It will be very easy, for instance, to hook it into our parsers or converters! Currently the only format we can parse/convert is json Strings->DataFrame, but this could open up any conversion to DataFrame.
I prototyped it in our
convertoperation, meaning you can convert any supported type toDataRow,DataFrame, orDataFrameSchemanow :)I also tried to implement it for
parse, since JSON parsing was already there. This appears to be a bit trickier though. There's a lot of edge-cases, where, for instance,"[a b c]"can successfully be parsed as CSV, causing all sorts of issues later on.I did manage to make this pass all tests so far though, by making "parsing to dataframe read source" optional (false by default), enabling it only where needed and adding some extra checks for String input of CSV and JSON.