pipelines

Apache beam and python pipelines for data ingestion

The main purpose of ingestion.py is to create a simple pipeline capable of using the csv file in data storage and processing it with Python and the data stream to insert the data into BigQuery.

It is necessary to first check how clean the data is, for this reason the data_quality.py pipeline helps to detect the null fields that might exist in the data source and report them in an error lookup table in a large query, so that the first step is to verify the csv with data_quality.py and after finishing a validation process continue inserting the data in BigQuery with ingestion.py

Step One: Check for Null Values

This is an example of how the data_quality.py pipeline could run on the Google cloud console:

user@cloudshell: python data_quality.py --project=main-ember-314713 --region=us-central1 --runner=DataflowRunner --stagein_location=gs://databucket1001/test -- temp_location=gs://databucket1001/test --input gs://databucket1001/f_medium_sessions.csv --output lake.errors_null --save_main_session

Second Step: Pipeline to Insert data into BigQuery

this is an example how the pipeline ingestion.py could be executed in Google Gloud console:

python ingestion.py --project=main-ember-314713 --region=us-central1 --runner=DataflowRunner --stagein_location=gs://databucket1001/test --temp_location=gs://databucket1001/test --input gs://databucket1001/f_event_sessions.csv --output lake.f_event_sessions --schema "profileId:STRING,dt:DATE,medium:STRING,eventCategory:STRING,landinpagepath:STRING,country:STRING,sessions:NUMERIC,users:NUMERIC" --dictionary "profileId,dt,medium,eventCategory,landinpagepath,country,sessions,users"

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md
adjust.ipynb		adjust.ipynb
data_quality.py		data_quality.py
ingestion.py		ingestion.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pipelines

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pipelines

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages