I worked on a project in which it was necessary to perform multiple ingests of information to the datalake (on-premise). Up to that moment, the client performed new ingests of information as 100% new developments, implementing validations and processing established by the users. Performing ingests in this way mainly generated the following problems: repeated and not very scalable code, and very long development times. To solve these problems, I proposed and implemented: "An ingest engine developed in pyspark". The component was developed under the object-oriented programming paradigm, using the best development practices (clean code and SOLID). This engine solved the 2 problems faced by the client in new developments: reusable and optimised code, and development times were reduced by 90%.
Achievements:
- Reusable and optimised code. The validations and processing that were performed in each new ingestion were encapsulated in the component so that they could be reused in new ingestions.
- Development times were reduced by 90%, since for new ingestions we limited ourselves to the use of the "engine".
Tech Stack:
pythonPySparkHadoopHiveOOPSOLIDClean CodeGit (Bitbucket)JenkinsBash