All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
- Upgrade Dgraph client to 24.1.1 while moving to official Dgraph Java client (io.dgraph:dgraph4j:24.1.1) with shaded dependencies (pull #270)
- Detect and partition sparse region of UIDs (pull #224)
- Estimator "maxLeaseId" renamed to "maxUid", as used with option
dgraph.partitioner.uidRange.estimator(pull #221). - Upgraded gson and requests dependencies (pull #225).
- Work with maxUid values that cannot be parsed (pull #216).
- Handle maxUid values larger than Long.MaxValue (pull #216).
- Handle Dgraph data type "default" as plain strings (pull #223).
- Supports full unsigned long (64 bits) value range of Dgraph uids, mapped into signed longs (pull #222).
- Moved to shaded Java Dgraph client (uk.co.gresearch.dgraph:dgraph4j-shaded:21.12.0-0).
- Moved Java Dgraph client to 21.12.0.
- Support latest dgraph release 21.12.0 (pull #147)
- Moved Java Dgraph client to 21.03.1.
- Support latest dgraph release 21.03.0 (pull #101)
- Adds support to read string predicates with language tags like
<http://www.w3.org/2000/01/rdf-schema#label@en>(issue #63). This works with any source and mode except the node source in wide mode. Note that reading into GraphFrames is based on the wide mode, so only the untagged language strings can be read there. Filter pushdown is not supported for multi-language predicates yet (issue #68). - Adds readable exception and suggests next steps when GRPC fails with
RESOURCE_EXHAUSTEDcode. - Missing
maxLeaseIdin cluster state response defaults to1000Lto avoid an exception.
- Improves predicate partitioning on projection pushdown as it creates full partitions.
- Fixes bug that did not push predicate value filter correctly down to Dgraph causing incorrect results (issue #82)
- Fixes bug in reading
geoandpassworddata types. - Tests against Dgraph 20.03, 20.07 and 20.11.
- Moved Java Dgraph client to 20.11.0.
- Upgraded all dependencies to latest versions.
- Optionally reads all partitions within the same transaction. This guarantees a consistent snapshot of the graph (issue #6). However, concurrent mutations reduce the lifetime of such a transaction and will cause an exception when lifespan exceeds.
- Add Python API that mirrors the Scala API. The README.md fully documents how to load Dgraph data in PySpark.
- Fixed dependency conflicts between connector dependencies and Spark by shading the Java Dgraph client and all its dependencies.
- Refactored connector API, renamed
spark.read.dgraph*methods tospark.read.dgraph.*. - Moved
triples,edgesandnodessources from packageuk.co.gresearch.spark.dgraph.connectortouk.co.gresearch.spark.dgraph. - Moved Java Dgraph client to 20.03.1 and Dgraph test cluster to 20.07.0.
- Add Spark filter pushdown and projection pushdown to improve efficiency when loading only subgraphs.
Filters like
.where($"revenue".isNotNull)and projections like.select($"subject", $"`dgraph.type`", $"revenue")will be pushed to Dgraph and only the relevant graph data will be read (issue #7). - Improve performance of
PredicatePartitionerfor multiple predicates per partition. Restoring default number of predicates per partition of1000from before 0.3.0 (issue #22). - The
PredicatePartitionercombined withUidRangePartitioneris the default partitioner now. - Add stream-like reading of partitions from Dgraph. Partitions are split into smaller chunks. This make Spark read Dgraph partitions of any size.
- Add Dgraph metrics to measure throughput, visible in Spark UI Stages page and through
SparkListener.
- Move Google Guava dependency version to 24.1.1-jre due to known security vulnerability fixed in 24.1.1
- Load data from Dgraph cluster as GraphFrames
GraphFrame. - Use exact uid cardinality for uid range partitioning. Combined with predicate partitioning, large predicates get split into more partitions than small predicates (issue #2).
- Improve performance of
PredicatePartitionerfor a single predicate per partition (dgraph.partitioner.predicate.predicatesPerPartition=1). This becomes the new default for this partitioner. - Move to Spark 3.0.0 release (was 3.0.0-preview2).
- Dgraph groups with no predicates caused a
NullPointerException. - Predicate names need to be escaped in Dgraph queries.
- Load nodes from Dgraph cluster as wide nodes (fully typed property columns).
- Added
dgraph.typeanddgraph.graphql.schemapredicates to be loaded from Dgraph cluster.
Initial release of the project
- Load data from Dgraph cluster as triples (as strings or fully typed), edges or node
DataFrames. - Load data from Dgraph cluster as Apache Spark GraphX
Graph. - Partitioning by Dgraph Group, Alpha node, predicates and uids.