Robert Huaman Caceres - Software Engineer

Ingestion engine with PySpark.md

I worked on a project in which it was necessary to perform multiple ingests of information to the datalake (on-premise). Up to that moment, the client performed new ingests of information as 100% new developments, implementing validations and processing established by the users. Performing ingests in this way mainly generated the following problems: repeated and not very scalable code, and very long development times. To solve these problems, I proposed and implemented: "An ingest engine developed in pyspark". The component was developed under the object-oriented programming paradigm, using the best development practices (clean code and SOLID). This engine solved the 2 problems faced by the client in new developments: reusable and optimised code, and development times were reduced by 90%.

Achievements:

Reusable and optimised code. The validations and processing that were performed in each new ingestion were encapsulated in the component so that they could be reused in new ingestions.
Development times were reduced by 90%, since for new ingestions we limited ourselves to the use of the "engine".

Tech Stack:

pythonPySparkHadoopHiveOOPSOLIDClean CodeGit (Bitbucket)JenkinsBash