Collaborative Data Engineering with Git

DataCamp · Intermediate ·🔄 Data Engineering ·1y ago

Skills: Tool Use & Function Calling80%Delivery Management70%AI Tools for PMs60%

Key Takeaways

The video discusses collaborative data engineering with Git, covering topics such as version control, branching, and merging, as well as best practices for documentation and conflict resolution. Tools used include Git, GitHub, Airflow, and VS Code.

Full Transcript

Hello everyone and thank you for joining today's session. My name is Ree and I'll be a moderator today. We're just about to get started with today's session. We're just waiting so everyone has a chance to join. If you haven't done so already, please do register for this session. You can scan the QR code that's on screen now. Uh there's a link to it in the chat and there's also a link to it in the YouTube video description as well. Also, if you're not already subscribed to the data camp channel, please do subscribe and drop a like on this video just to let us know that you're here and you've arrived safely. Um for today's session, collaborative data engineering with git, there is a little bit of setup that's required if you want to code along with us. Uh all of which is documented in the resources uh document in the chat. So that's also in the video description. It's also pinned in the chat. So if you plan on coding along live with us today, please do check out the instructions there. Uh we've got a GitHub repo and then there's also a readme file which will explain everything. So yeah, if you do plan on coding along with us live, please do check out the resources document that's pinned in the chat and also in the video description. And for anyone else that's just joined while I was talking, welcome to the session. My name is Ree. Uh we are going to be getting started very shortly. If you haven't already, please do register for this session. You can scan the QR code that's on screen now and there's a link to register as well in the chat and in the video description. That means we can send you the recording and the resources as soon as it's up in our resource center. Uh please do ask questions during this session. We're going to be covering your questions during the last 10 minutes. But also if you code along live and you're having any uh issues or you want to clarify anything, please do let us know in the chat and we can get help you out with that as well. If you do want to code along with us, everything you need is in the resources document that is pinned in the chat and also in the video description. So please do check that out now if you want to code along with us live. Brilliant. I think that is everything from me. So now I'll hand you over to your host for today's session, Richie. Richie, please take it away. Hi there, data scamps and data champs. This is Richie. So, uh, let's kick off straight away. Uh, Git is the world's most popular version control tool. So, if you write any sort of code, then it's an essential skill to have. Now, it's useful for working individually. So, uh, Git is essentially infinite undo when you're working on your own. But the fun really starts when you try to collaborate with others. So obviously in a work environment uh being part of a team is a very natural thing. So you're going to be collaborating with your colleagues on creating software or even creating uh data related uh code and so uh you need to learn how to collaborate with git get slightly more complicated and that's what we're going to cover today. So uh we are going to be uh practicing on a data engineering style uh repository and trying to solve some problems uh uh imagining uh that we were working with others. Uh so no data engineering uh knowledge required but uh this can make a very fun example. All right so I've got a fantastic guest for you today. So uh Amanda Crawford Adamo is a both a software engineer and a data engineer. uh she's got a decade of hands-on experience in uh using git and other version control tools. Uh so uh she's applied these across both software development projects and across data engineering projects. Uh so Amanda was previously a senior data engineer at Dropbox and uh software engineer at Microsoft. So she's got both sides of uh the coin and she's also the instructor of data camp's advanced git course. So with that, over to you, Amanda. Thank you, Richie, and welcome everyone. I'm so excited to talk about lost Amanda. Hold on a second. Let's bring you on. Just having a few technical issues here. All right, so um I'm here. Okay, good. Um so welcome everyone. I am so excited to talk about um get in context for the whole um you know working in team organization and like um you know Richie said git is one of the tools you hear when you are building you know version control what you're building um you can use that personally but like Richie said um it's really powerful when you are working on a team and um you can essentially Oh, we can hear you. Okay, got it. Cool. Um you working on a team with a ton of different developers. There's a ton of changes and there's a story that needs to be told, but it can be very hard to um Oh, I can hear Amanda. I'm going to close the chat, but it can be very hard to um keep track of those changes and it's really important um in the data engineering setting to capture that. So today we're going to talk about collaborative data engineering and we're going to go through a code along using git to um you know and I'll teach you a tip or a trick to um write your history in a way that is um that tells a story. Right? My name is Amanda Crawford O'Damo. Um, you know, you heard my titles and I'm a data engineer, senior data engineer at Dropbox and at Microsoft software data engineer. Um, so I've also had some experience in testing data systems. And so, um, I'm here sharing my information with you all. Um, and so yeah, let's get started. So when we um what we want to think about today um is basically our goal is going to talk about how do we collaborate as data engineers or future data engineers with data analysts and our stakeholders um when we're building warehouses and data products and features, right? Um and so the second goal for me is to show you tools so that you can create a transparent development history that is shared amongst you you know your your partners your data engineer your data analyst partners your finance stakeholders your you know marketing stakeholders um and we want to use git so that we can break down that silos um we see that a lot of data teams work in silos there's you know research that's happening on that constantly that's showing that hey you know data teams have a high chance of you know building silos into the way that they work um through their de development workflow. Um and it can mean that we have some miscommunication um about you know a lot of the data features that we're building as data engineers um and may be uh different from what's expected. And so um what we want to do is to try to reduce this and um reduce the cost um of that right so you know McKenzie and Air Table and Forester did a few studies and they found that you know $3.1 trillion is lost annually to data silos right it's not just a technical issue it's a business risk right that's a a ton of money um and 70% of companies are struggling with that. And so you can see that where you have a lot of teams um spending time, you know, up to 12 hours a week trying to find out why, you know, reports are not reflecting what, you know, business stakeholders expect. Um and trying to identify whether we can say that this is a truthful, you know, data set or analysis or dashboard. And so it can become costly and prevent you know businesses from moving at um the speed that they need to to make good decisions and to provide um value for their customers. And so um our mission is to use Git um to bridge part of that gap and make you know collaboration a little collaboration a little bit seamless. It won't solve the whole problem, but you know, it's the small things that matter that contribute to breaking down that silo, making sure that um you are sharing that data and the changes and the way that you are transforming that data and processing it so that your stakeholders and your partners such as data analysts and scientists um trust the trust you in that data that you're producing through your development and work, right? Um and so some of the problems that we see and that was found in the um in the studies that McKenzie and Forester um and Air Table you know noted was that um it's more specific to data engineering teams um as much as we you know develop and we code right um we can't really necessarily say that data engineering tools tools and and code bases and products are the same as traditional software products or product facing um software, right? Where you know we have this interconnected environment where pipeline dependencies directly affect our downstream reports, right? It doesn't mean that you know you can make a change and it can be um opaque, right? And folks may not see until uh later. Sometimes that happens with changes, but when you make a change, more than likely it's going to affect maybe the data analyst if they're running some experiments or it might affect, you know, one of the finance reports where they expected um they see a decrease or increase in revenue um numbers. And so um making those changes or you know if you I don't know, let's say you create a pipeline, you change the the update, right? um the update time that directly can change some someone's report if they're pulling at a certain time and you didn't communicate that right. Um and schema changes um can break um existing analyses. Whereas when you have like a transactional system, you can add a column or two and you can you know very rarely you'll see your product break if it's a good your app break if it's you know built well. But in the world of data engineering, if you add a column and someone's using select all, right? That's going to um and that's funneling into some visualization tool to create a dashboard, you might see that dashboard fail because it says, "Hey, you have a column that I wasn't expecting." Um, and so it's very pnicity. Um and so when you um think about data engineering and how everything that you change has an immediate impact and effect, you have to think about why um well more so how to uh mitigate that impact to your users. Okay. Performance optim optimizations um can cause significant changes to query ch times if we don't test well. Right? And that can cause you know churning for additional teams. um you can see reports delayed and sometimes delay can be a costly thing especially if you're in the finance side of data engineering where I come from right and you know when you are finding these issues a lot of times data engineers um are making fixes as fast as they can because that data needs to be updated um so that you know each and every day business leaders know where the company stands and so when you make those fixes. That's where a lot of times from my experience and also from the you know reports that's out there um that you are essentially um losing information because you don't document it as well um in a pull request and it doesn't really tr you know doesn't capture why you made that change in the heat of the moment and you know you have folks who may ask hey why did this happen you know I see that it happened I see that there was a change I see that was But why? Who do I talk to? And that's where we can say that we have a documentation crisis on top of you know communication um you know human communication crisis. we have a documentation crisis and um it's really hard to um be very specific for all the changes that you made in a data system because one a lot of times when you make changes you're making changes that affect disperate systems. Um and essentially you may say that you're fixing no value but you're fixing a no value where um in the in the pipeline. Um, are you changing the database uh schema to allow um, you know, or no longer allow null values? Are you, you know, um, pulling that value in from, you know, let's say, uh, point of sale systems like like Stripe or something? Um, and so just saying fix null values or update pipelines or add a column may not really give the whole picture when you write that in your commit message. But you know today we'll talk about how you can use writing your documentation at the commit level. This is a bit different because a lot of times we say that you know make sure your PR has all this information. And to be honest sometimes you can lose that information by the time you get to creating a PR. So why not, you know, use that commit and and to document well in those intermediate phases such that you can capture the information that you have right then and there and not have to move into different types of windows. Um you can do that all in one and um that's where I think that um we can see that you know when you use structured commits they resolve issues quite faster. You can use git blame to um immediately identify who made the change and the information there. It's built into a lot of idees that tell you hey use get blame and I'll show you what essentially um the commit message say the full thing and when we use git and commits we can create this living documentation. a lot of times you know data engineering teams and there's a lot of tools that's trying to combat um adding in documentation and they are really good but I think what's also as you know just as powerful is using it to constantly telling the story of why your commit is here and why it changed something right um and you don't have to use an external tool to do that and if you want to use an external tool you can also use the commit history to build that build around that too. But you want to use the commits because they tell a a important change in the history of your project evolution, right? Um and so these branches that you create, you can think of them as lines of thoughts. Um so for instance, if you're going to work on a feature, you know, you create a feature and then you say like, well, this feature is to um add or update the you know, discount logic. Um you also want to preserve that collaboration context. So sometimes you may do um some type of merging between branches, right? And you want to essentially make sure that you record that hey there is a branch that may have had some logic. I applied my changes over that and then I you know essentially am um merging that branch to our main branch right and you want to preserve that context because those sequence of changes can tell well here is where it was at this logic or this feature this data feature right and here is how we changed it to you know and what you see and because that matters right because for instance as a data engineer You may have made that change and it impacted you know the revenue numbers that discount logic that you added right or you changed right may have caused the revenue numbers to go up or down right and the analyst may see or the data scientists may see a fluctuation in the numbers right and what they can do is essentially see that hey that change um it used to you know compute the discount this way now it's computing it, you know, in this manner and that's why we see the change. So, I need to know where we used to be and where we're at now. Um, and uh you and and I also want to know who made that change, too. And so, another thing about that is that you can now search um if you have future teammates on your team, right? and they are getting adjusted to the um codebase, it's good for them to also understand the history of certain data features. I mean, when you're managing or a part of the team that builds a data product or a data warehouse, it's good for them to know that well, we tried this and it didn't work for this team or this stakeholder. we tried this um and you know uh maybe we shouldn't do that again and I don't have to search you know Jira or GitHub issues or you know all these different systems it's all here in one place right and I can use that information and I can find it also um sporadically um and just follow on it too so it helps you know in that scenario to reduce onboarding time for you know initial um onboarding time but also O on boarding into very specific areas of you know your data system and your data product features right so when we think about branching for data engineering teams we can follow you know several different strategies. This is one that I just proposed um in a sense that um it does follow the get flow semantic but you can also adjust it for trunkbased development and um what your company may follow. But essentially the gist of this is that you create feature branches when you are introducing or updating some of the data features and that's usually going to be by a data engineer but sometimes you may have a data analyst who may make some changes as well. Um but for data analysts if they are creating some analyses or a dashboard they can create you know a a branch that's you know has a prefix of analysis and then you know to be more specific on what they're what they're working on. And then when you are in the moment and you have to fix um issues and pro provide a hot fix you you know follow the same strategy the usual strategy of appending hot fix to that branch name and I think that when you're in a hot fix state it's even more important to be very descriptive about your commit messaging right um that is where there was a conflict somewhere in the business process and in between the pipelines and the dashboards and the metrics that you have, right? And so you need to be very clear on documenting that in your commit, your git commit um such that you're telling a story to you know for posterity and for future you and future um engineers right and also for data analysts who are using these um and to explain that to um the stakeholders as well. Okay. So the other thing is that now if you have your um commit history uh built in a way that essentially um is telling stories, you want to make sure that when you merge to main that you're doing it in a way that carries that that information carries over and preserve that context. And so what I propose is that you use um you know merge recursive merge. So it's basically using git merge using the d-n no um - ffff. So that means no fast forward. What happens when you use git merge? It will try to use a fast forward by default and if you um use a get merge fast forward it will um over you know it won't carry over those commit messages. Okay. it won't preserve the branch um information that that com those commits came from too. And so that really, you know, will overwrite the work that you've done, right? You you've gone through creating nice commit messages that explains the story of the evolution of certain components in your your um data project. And now when you merge, it overwrites it or it doesn't tell the full story. And that's where I say that it's important to um use get merge, the recursive merge, or you might hear a three-way merge, right? Um and so that it preserves all of that information. Um and you can essentially um you know, show when those features were integrated. You can be able to roll back um more precisely. It gives you much more flexibility and power in um how you manage your pro project but also how you communicate your your project because it's a living product right um and so I know that a lot of folks may say hey this comes with a lot of conflicts and that's yes it does but when you know how to merge well and how to resolve things easily um this here you know that won't be a blocker for you um but uh and you can learn more in my course on um using different features, advanced features to be able to, you know, merge um using recursive merge and be able to handle conflicts or prevent them as much as possible. Okay. So, I'm going to give folks a few minutes, but we're going to hop into the code along and um we're going to put some of these ideas in practice and um we'll just do a fake simulation of, you know, data engineer and data analyst working together in parallel and um documenting their work and resolving conflicts if we have time. Um, and we're going to do this on a um, you know, fake version of a sales pipeline. Um, so I'll give folks about two minutes um, to get themselves set up and I will um, set myself up so that I can switch over to uh, the um, editor. Okay. while people are getting set up. Uh we do have one question from the audience. Uh if you want to take that. And by the way, sorry for uh talking at the start there. I had a very odd situation where apparently everyone could hear me. Everything was frozen on uh on this for me. So I had no idea what was going on. Uh anyway, uh let's go to uh an audience question. Uh so we have a question from Afne saying I heard Git Kraken is a nifty visual git management tool. Uh has anyone used this? Uh Amanda, I don't know whether you've come across uh Git Kraken or whether you have any of the recommendations for git tooling. Yeah, I've used that years ago and I loved it. Um, so I think that you if you have the budget for it, use it. Um, but I think that Visual Studio is just as good as well. Um I think that it's important to understand get um outside of the tooling before you um you know look for tooling because it can get a little bit complex um when you have a layer of abstraction over it and you don't know all the changes but once you know the changes then I think you'll pick which tool and I think that you know visual studio is pretty good but get kraken was actually really good. I just haven't used it in a while. Okay. Yeah, certainly uh having some sort of visual tool can help to see what's going on rather than just like uh looking at terminal output. So I like the idea. But uh yeah uh thank you for the recommendations and choices. All right. Uh that's the only question for now. So I'm going to disappear again and uh we can continue on. All right. So, I am going to switch my screen here and um let me just do that in one second and then I hope that um let's go back to the slideshow. Okay. All right. So now that I have that set up, I have multiple screens. So so sorry for that. All right, I am ready. All right. Okay. So, um first and foremost before we get started, um I hope that um everyone was able to uh take a look at the um resources document that Reese posted in the chat. Essentially, you uh will need to go to this GitHub repository and um clone this locally. Um I do have it public for now but I know that uh it's going to be very easy for folks to try to merge into this repository. So uh my ask is to uh fork it first to be honest and then um download it. So you can fork the repository um if you go to get and you press this um fork your own copy here and what it'll do it'll create a um exact copy um just under your own GitHub um username. Okay. So um when you do that there are a few prerequisites to run the this um project. It's a um very small data engineeringesque repo. Um and so it will run on it runs uh the pipelines on airflow and um essentially you'll need to have docker installed and you would need to have the astro cli um installed. So the command line tool. So um this is a command line tool created by astronomer and it just makes it very easy for you to run airflow locally um on a docker um image. Okay, you also need to have git installed in Python. So here's some additional information for you to check to see if you have everything set up and then how you can download. Um and then once you have that set up, you can essentially um you know run the uh the you'll start the airflow instance by running astrodev start and then you will get a it'll automatically pop up but if you forget um it'll you know open uh port um 8080 and it will airflow will um essentially be there and you can see the UI. Okay. So going back to our um slide. Essentially we have um give me one second here. So we have a retail sales pipeline um and hypothetically speaking we have this point of sales system that we ingest um using airflow and then airflow publish the data to a data warehouse and then the data from the data warehouse the analyst is pulling its report. Now we don't have that fully set up set up in this GitHub repository but this is the um supposed scenario that we operate in as a data engineer right so we have our roles we have our data engineer um and they are responsible for the ingestion from the opponent cell system and any transformations okay and the analyst the analyst is going to need a discount um analysis and we need to work with them to add that and so we're going to collaborate with each other without breaking each other work. Um, so, uh, in this here, I set up some stuff already, but I'll go through the steps. Um, and then we'll pick up where, you know, um, I left off, uh, for the sake of time. So, the engineer is going to create a pipeline. Okay. And essentially we create a pipeline um and it has a uh we created a feature branch with a descriptive name and some semantic commit message. But what's important here is that commit message. So if you take a look right we're going to use some conventional um commit semantics um and there is uh in the resources or in the slides you can read more about it but um we're saying that we have a feature and what we're building is a pipeline okay and then we're going to add some initial sales data extraction okay so when someone reads you know this uh commit they're going to see that okay this is a feature commit and that it was working on creating a pipeline and it refers to sales. Okay. The analyst um will document the requirements. So they may you know create that in a Jira or uh GitHub and it's going to explain the business context, right? And so um usually it'll also tell you some of the requirements on you know when this needs to be provided um and how you know they expect this uh discount information to to operate and behave right and um they're going to document the impact. They're going to tell you what the impact is. And so it's important for you to take note of that, right? Um but they're creating a traceable paper trail when they do this. So, um, what you're going to do is you're going to take that information and then you're going to go ahead and implement that. So, if you have the, um, if you have the repository downloaded, what I have here is a, um, okay. All right, Amanda, um, just before we jump into the code, would you be able to just up the text size ever slightly? Is that all right? just send it. Yeah, perfect. Thank you very much. Thank you for letting me know. I wonder why it's not allowing. Let me try this again. That is odd. H. Okay. Um, let me try another way. Sorry, it looks like it's not allowing me to do that. So, what I can do is switch to just sharing the um the code screen for now. Um will that Let's see if that helps. Yeah, I think that might be a better solution. Then we can uh switch between them potentially. Yes. Okay. All right. Um is that better or let's see if I make it smaller. Uh yeah, if you if you could um drag it out so it's a little bit more rectangular, that would be perfect. Yes. Yeah, that's perfect. Thank you. All right. So, if you um essentially we have this repository. So, the way that um if you aren't familiar with uh Airflow, there are some resources that you can learn how to use Airflow. It's a great tool and a um to have in your knowledge toolkit. Um but when you um go and create new DAGs for Airflow or pipelines, you create it in the DAGs folder. And so we'll be working here and um essentially we have the sales pipeline.py file. So this is the file where we've created right um and we're going to work together on um it essentially uh is going to pull uh records from uh the cell system but in this file because this is just a tutorial demo it's not pulling from any system. Um but it allows you to run this uh locally and you know you can tailor and tinker with it um as you please. But we're going to you know um essentially add a um a calculate discount function. But before we do that we want to uh essentially create a feature branch. So uh essentially we will one second in here we'll say get checkout and we are going to say create a new branch right and we're going to append it with feature or prefix it with feature and then I am going to name it discount calculation right I want to be very specific on what I'm working on Right. And so when you do that, it's going to say that okay, switch to a new branch. Um, and that's where we are going to start making changes. So here I am going to essentially say that instead of um you know in a real environment you'll have much more complex pipelines um but here I'm going to just essentially add a discount change. So, I'm just going to say return and I'm going to put some fake code here. Um, but I'm going to return rows. Uh, insert it. That's about it. Um, and that is what we're doing a hypothetical here. And it is going to uh say quality checks. [Music] All right. All right. So, um, let's see where I went wrong. I think, huh, that's interesting. Maybe it's a I will worry about that later. But um so let's say that I just updated the calculate discount. It could be that I handle null values or um I'm bringing in this field. So I essentially am going to um say that hey let's you know commit this but with in a way that explains a little bit about what I'm adding. So because this is small um I'm going to say get commit. Thank you. Okay. So I'm going to say get commit dash m. That's just to commit a get message. And I'm going to say feature, but I'm going to wrap it around. And I'm going to say um I'm going to say discount. And so here I'm saying that this is a feature that we're asking. This is a discount um logic. And so I'm also going to go ahead and say a little bit more. So this is going to say add discount calculation uh logic and I want to do another thing here. I want to essentially say that um the data impact is where I will um pull from either I can pull from where the um data analyst specify that in our GitHub issue or our gyro ticket. Or I could say that um this will uh this uh addition will not impact just yet. um any revenue dashboards. Okay, it's important to say that it it won't impact like any dashboards. Um but you can also say that um you know sorry and uh you can also say that it will also make sure that um you know we can say no schema updates with this commit But a schema update may um be needed uh later. Okay. And so when you do that, right, you can essentially put in some information that tells you, okay, when the data analysts look at this, they can essentially say that, okay, well, it's not going to impact the revenue dashboard because this change is only adding it to the pipeline. Um, and they and I'm certain that there's no schema update here. Okay. So um what I also did here um is you know go a step further and let you know whoever may be reading this you know i.e. the data analyst or the data scientist that hey there still may be a schema update um later and so later on they may go and look back and say well okay they said there may be a schema um update let's see if there was a schema update to use this. Okay. So, now that that's done, um, oh, I forgot to do the get add first. Sorry about that. And I'm going to go back and do this here. Now that we have that committed, um, we essentially are setting ourselves up to to, you know, communicate what's next. So, let me go back to sharing the full screen that I had before. Okay. All right. And so that's what we're going to do here. Um, now I've added this transformation, but I did not I want to know I did not add it to the pipeline. And we'll do something later on there to add that to the pipeline. But I did add the function and uh essentially we uh now that now see that okay we've added this transformation and um the analyst can review it um at their will. Right? So now the analyst may say that hey I see you added the transformation but I don't see it to the pipeline. Sometimes analysts may say that I can do it for you. Um and a data engineer may have had a plan to do that later, right? And that's okay, right? But what happens is that um you know when you want to use merge, recursive merge, you might get yourself in a pickle where there is a change to um a certain area and you by two different or multiple developers and we need to figure out how to um resolve that. Okay. And so what I'm going to do is uh essentially you can follow along here is I'm going to uh get checkout and I am going to say okay I'm doing an analysis report and um essentially I want to uh look at Q3. Okay. And I want to essentially branch off this. Okay. And when I do this, I'm going to when I branch off of the data engineer report, right? um branch. I'm gonna essentially um Oh, I did not do that right. But essentially, I see that they added the uh calculate discount. Okay. Um but they didn't add the you know that logic to the uh pipeline. So essentially I am going to add that where we'll transform the sales and then we'll calculate the discounts. Okay. So the way that I do that, I will add another airflow uh task. I'll create one here um called calculate discount and it will be a Python operator. Task ID is going to I usually equal the same um name as the variable. And I'm going to call the um Oh, can't see my get screen and it's going to be essentially this calculate discounts. So now that I have um added that I'm actually going to do a change here. [Music] Okay. So now that I've added this transformation function, I'm going to call it after the sales and then I'm going to add it to airflow. Okay. And essentially if we go to airflow um we're going to look at this DAG and it is going to update. So we see that we have the extract sales transfer sales discount load sales. Okay. And then as a analyst I will do some additional work and okay you can see perfect. Okay good. I'm going to do some additional work in some you know Python notebooks and what I am going to do is say that um I'm going to say that returning 500 rows. Okay. All right. So I'm going to do get add. Then I'm going to do another get commit. And I know that I was supposed to only work on, you know, analysis, but I saw something I I wanted to help out with, right, as an analyst. So the feature I'm going to say that I'm going to add um discount. Uh I have a block here. Give me one second. It prevents me from seeing. Okay, better. discount uh transformation, right? Um I have that and to sales pipeline. This will um and we'll assume that if you calculate discounts, you're going to load it to a column, right? So this will um load let's say a hypothetical uh discount column right column and finance data warehouse [Music] okay so now um that's telling the impact as well right so now I am going to go ahead and um load there um I see that we are a little bit past time. Richie, is it okay um if I continue or um should I allow folks to continue following along um and you know using the repo? I don't I'm very sorry that we're over time. All right. No. Um uh we we can continue going for like another five minutes or so and we'll we'll do audience questions after that. If it's going to be longer than that then yeah uh I guess yeah let's try and wrap it up in the next five minutes then we can go to audience questions. Okay. Okay. Um I think that it might go a little bit longer. So I have this here where um if we want to merge right um the both both pi um both pipelines. So let's say that we have the um you know data engineer submit their logic and the we see that uh the an the analyst created a branch off of that data engineers branch right um and so they added some logic and so when they do that they will get a a conflict. What's most important here is that you'll use VS Code um to resolve that conflict. You'll create a PR and things like that. So I know they'll take more than five minutes. So sorry about that. But what's important is that when you find that okay well I have two points of contention right that means that you need to add a little bit more uh comment right and so you can say that well this integrates the pipeline discount because the data engineer created the transformation the data analyst went ahead and added that transformation to the pipeline itself. And so you add as much information as you can. This here is a snapshot of what you can add, but it's going to say that. Okay. Well, um I'd like to add a little bit more um you know, if if in reality I'll add the Jer ticket issue um number. Uh you jer ticket or the GitHub issue number um and I would add you know points of contact, right? um to you know other folks that's not listed as a developer attached to this commit but who may know more about it in this this way right and then you can tag it okay um using get tag so some of the takeaways that I want to point out here that helps with uh preserving the history and using get history to explain your changes to explain why the changes were made and um who the relevant parties um is you know using uh semantic commit messages. So there's a lot of libraries that's out there um like commitis in um that's one useful tool to make sure that your your um commit at the commit level um is structured in a way that tells that story. um try to preserve history with um using a recursive merge strategy um and don't shy away from it. So um we didn't get to go through going through resolving that but it would be as simple as using um a tool like Git Kraken or VS Code to resolve that, right? Um and you also want to use descriptive branch names, right? tag those important milestones that oh okay now that we had a few commits right and it fully finishes a feature right let's tag it and say that we implemented the um sales the sales uh discount logic um and to treat it as a shared knowledge as if you would you know tell a story or write a chat to someone if they ask a question how to uh change something right um and that is one of The easiest ways where we can start breaking down a knowledge silo, right, and bridge that gap between, you know, our counterparts who we actually share, sometimes we share the codebase, but we share this data product and these assets together. And what I want to do is make sure that everyone is informed as much as I can and up tod date. Okay, I have some resources here that goes over some of the conventions. Um if you want to learn more about the merge um strategies, the different ones that you can use. Um so get merge recursive merge is one of the tools, but you can also use um rebase to you know if you need to update some your commits or if you want to um do some very particular um squashing or merging them together to you know uh explain something in a more condensed fashion, you can do that. Um, and that's what you would learn in that course. Um, but if you, um, you know, have any questions, I'm, you know, looking forward to answering questions from you all. And I just want to say thank you so much for, um, coming to listen to me and to, um, you know, speak with me uh, through the chat. And, um, thank you so much for your time today. So, feel free to reach out on LinkedIn if you have more questions. Uh, wonderful. Uh thank you so much for that Amanda. Um it it's such powerful stuff. Uh but yeah trying to remember all the commands and the and the flow. It gets tricky. So uh it was helpful to to see all this today. Um so uh we got time for a couple of uh questions from the audience. Before that I've just got a question for you. So I really like your idea of having uh commit messages as kind of a history of what you've been doing. It's like a documentation for what you've been working on. How does that fit into kind of broader software documentation quite often you're going to be using other tools as well to document what's going on? Yeah. So you know for instance when we are using um I'm going stop screen sharing for a bit. Yeah. So um we have a lot of tools that git is built into or version control management um is built into it. So like for instance I had VS Code set up right and I had you know the get blame tool toggled on. So basically as you hover over the uh code it'll tell you the commit that added or the latest commit that um is relevant to each line of code. And so as you are reading right you you can essentially say like okay I see why this is here. um especially when you get into some complex transformation logic. But the other thing is that um we have AI that we um that's now being integrated into everyone's uh workflow with a lot of companies and you know it'll be really good for AI to actually understand the evolution so that they can propose you know um you know code that could optimize a certain part of your logic or to inform you on what happened with this that is really useful for you know training the the models. Um I think that the ultimate goal is that you have just a meta database underneath it and that for future tools that's being created and you know whether it's through using AI or some of the man-made processes that we have. We're building up this repository of information on the project and when you use it in a structured way you'll be able to analyze that data right using some analytics tool um you can analyze the metadata of your project right so um you can build around that where okay I want to pull out all the feature commits and I want to look for keywords all the sales and then you can you know put it together and craft it um like a timeline right um so that I I hope I answered your question but um yeah that's a few points that I think um you can use the commits with different tools that's out there. Okay. Yeah. So maybe it matters less about like which tools you're using and more about just the fact that you are uh tracking the state of the project and the history and so if you want to do project analytics to maybe try and improve your workflow for future projects or decide where you need to spend more time or less time then uh yeah having all that information is going to be very helpful uh for the project managers I think. Um all right so uh there's a question from uh param saying uh would it wrong to just use git check out source branch file name uh this was towards the end I can't remember the exact context this um I think when we were switching to the um analyst branch um so we can check out the file name as well um that's fine but it I would say that it's it's almost um similar to cherrypicking. Um so I you know you can be very specific with checking out one single file. If you know there's another developer that's well in that branch right that data engineer was working on multiple different things then I would use that right to specifically check out that file that I'm going to use and that that I'm going to change. But you don't want to use it if you know that that file has dependencies on other files. Otherwise you'll have a very long uh command line statement and get pretty lost. So I you know it's it's depend on the scenario right it depends right. So just to to be clear on this so is the idea the difference between like checking out a whole branch or checking out individual files from Charles. Yes. Yep. Okay. And in general you want like the whole branch just to keep things consistent because otherwise I guess you got bits of different branches floating all over the place. Is that right? No I think um it depends on the scenario. Excuse me. Um if you are checking off branching off someone's working branch and you know that they have a multi you know a lot of different changes then using the route um as you know saying the specific file you want to check out is useful but um otherwise use the full branch excuse me. Okay. All right. And uh okay uh since we've got two minutes left um seems like there are lots of different ways of doing this like working together collaborating. Are there any common mistakes you see like what do people do wrong most often? Yeah, I think that um I'm proposing, you know, writing the history in the commit level. And I see that a lot of folks get it wrong where they just say, um, I'll wait until the PR, right? And I think that that's where the knowledge is lost where you have a PR with a lot of changes that's manipulating or updating multiple different pieces and you're not explaining each one of those changes in each, you know, the effect that it has on each system, right? Um or each piece of your system. And I think that um you know, people probably say, well, the PR is the most important part of you know, managing your project evolution in history. But I'm saying that no, the commit is just as important. Um, if not, it could be a little bit more important than the PR because you're capturing your line of thought. And I think that a lot of people get it wrong when they when they see um using a recursive merge or a three-way merge strategy and you know out of you know wanting to avoid um handling conflicts. Um you know the tradeoff is that you're losing that history. You're using you're losing that branch context. But if you are able to use additional git concepts to manage that that conflict risk, you're able to really get the value out of it. So um my course goes over using interactive rebates to condense that down before you do that recursive merge, right? So that it's manageable to to work through those conflicts. Okay. Yeah, I can certainly see how there's going to be a lot of sort of um like desire to just put off documenting things properly at the commit lab and say, "Okay, I'm going to write it up properly when I get to doing the pull request." But actually, you want to be on top of things. Don't put off your documentation. I think needs to be repeated a lot. All right. Wonderful. Okay. Uh with that, we are at time. Before everyone dashes off, I just want to say if you enjoy today's session, tomorrow we've got a session on using AI agents to resolve uh GitHub issues. So uh you can be building an agent. So if you're involved in like if you care about automating GitHub workflows, then uh please do come back tomorrow. And on Thursday, we've got a session on the data camp quarterly roadmap. You can find out uh what's just been released, what's coming soon uh at data camp. So please do come back for those sessions. Uh use the QR code at the top left. All right. Uh, thank you so much, Amanda. That was great stuff. Oh, I forgot to say as well, one thing. Data Camp Radar, if you've not signed up for that, uh, it's on June 26. It's one of our biggest events of the year. Uh, please do, uh, sign up for Data Camp Radar. So, yeah. Sorry. Uh, thank you, Amanda. That was fantastic stuff. I really enjoyed it. I hope you Thank you all. Thank you for having me. Excellent. And thank you to everyone who asked a question. Thank you to everyone who showed up

Original Description

Resources (including link to GitHub repo + slides): https://bit.ly/4eg3rxv Register for this session to get the recording and resources sent to you! https://www.datacamp.com/webinars/collaborative-data-engineering-with-git Collaboration is at the heart of successful data engineering, and Git has become an essential tool for managing complex, multi-developer workflows. Yet many data practitioners are still getting up to speed with how to apply version control best practices in the context of data pipelines and analytics infrastructure. Mastering Git not only improves collaboration, but also ensures reproducibility, traceability, and long-term maintainability of data projects. In this hands-on code-along webinar, Amanda Crawford-Adamo, an experienced data engineer, will walk you through collaborative data engineering practices using Git. You’ll learn how to manage data workflows with version control, explore advanced Git techniques like branching strategies and repository organization, and dive into a real-world case study about collaborating on a data pipeline. This session is perfect for data and software engineers looking to improve the way they build, share, and scale data infrastructure.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DataCamp · DataCamp · 0 of 60

← Previous Next →

SQL Server Tutorial: Date manipulation

SQL Server Tutorial: Date manipulation

R Tutorial: Intermediate Interactive Data Visualization with plotly in R

R Tutorial: Intermediate Interactive Data Visualization with plotly in R

R Tutorial: Adding aesthetics to represent a variable

R Tutorial: Adding aesthetics to represent a variable

R Tutorial: Moving Beyond Simple Interactivity

R Tutorial: Moving Beyond Simple Interactivity

Python Tutorial: Why use ML for marketing? Strategies and use cases

Python Tutorial: Why use ML for marketing? Strategies and use cases

Python Tutorial: Preparation for modeling

Python Tutorial: Preparation for modeling

Python Tutorial: Machine Learning modeling steps

Python Tutorial: Machine Learning modeling steps

R Tutorial: The prior model

R Tutorial: The prior model

R Tutorial: Data & the likelihood

R Tutorial: Data & the likelihood

R Tutorial: The posterior model

R Tutorial: The posterior model

R Tutorial: An Introduction to plotly

R Tutorial: An Introduction to plotly

R Tutorial: Plotting a single variable

R Tutorial: Plotting a single variable

R Tutorial: Bivariate graphics

R Tutorial: Bivariate graphics

Python Tutorial: Customer Segmentation in Python

Python Tutorial: Customer Segmentation in Python

Python Tutorial: Time cohorts

Python Tutorial: Time cohorts

Python Tutorial: Calculate cohort metrics

Python Tutorial: Calculate cohort metrics

Python Tutorial: Cohort analysis visualization

Python Tutorial: Cohort analysis visualization

R Tutorial: Building Dashboards with flexdashboard

R Tutorial: Building Dashboards with flexdashboard

R Tutorial: Anatomy of a flexdashboard

R Tutorial: Anatomy of a flexdashboard

R Tutorial: Layout basics

R Tutorial: Layout basics

R Tutorial: Advanced layouts

R Tutorial: Advanced layouts

Python Tutorial: Time Series Analysis in Python

Python Tutorial: Time Series Analysis in Python

Python Tutorial: Correlation of Two Time Series

Python Tutorial: Correlation of Two Time Series

Python Tutorial: Simple Linear Regressions

Python Tutorial: Simple Linear Regressions

Python Tutorial: Autocorrelation

Python Tutorial: Autocorrelation

R Tutorial: The gapminder dataset

R Tutorial: The gapminder dataset

R Tutorial: The filter verb

R Tutorial: The filter verb

R Tutorial: The arrange verb

R Tutorial: The arrange verb

R Tutorial: The mutate verb

R Tutorial: The mutate verb

R Tutorial: What is cluster analysis?

R Tutorial: What is cluster analysis?

R Tutorial: Distance between two observations

R Tutorial: Distance between two observations

R Tutorial: The importance of scale

R Tutorial: The importance of scale

R Tutorial: Measuring distance for categorical data

R Tutorial: Measuring distance for categorical data

Python Tutorial: Plotting multiple graphs

Python Tutorial: Plotting multiple graphs

Python Tutorial: Customizing axes

Python Tutorial: Customizing axes

Python Tutorial: Legends, annotations, & styles

Python Tutorial: Legends, annotations, & styles

Python Tutorial: Introduction to iterators

Python Tutorial: Introduction to iterators

Python Tutorial: Playing with iterators

Python Tutorial: Playing with iterators

Python Tutorial: Using iterators to load large files into memory

Python Tutorial: Using iterators to load large files into memory

SQL Tutorial: Introduction to Relational Databases in SQL

SQL Tutorial: Introduction to Relational Databases in SQL

SQL Tutorial: Tables: At the core of every database

SQL Tutorial: Tables: At the core of every database

SQL Tutorial: Update your database as the structure changes

SQL Tutorial: Update your database as the structure changes

Python Tutorial: Classification-Tree Learning

Python Tutorial: Classification-Tree Learning

Python Tutorial: Decision-Tree for Classification

Python Tutorial: Decision-Tree for Classification

Python Tutorial: Decision-Tree for Regression

Python Tutorial: Decision-Tree for Regression

Python Tutorial: Census Subject Tables

Python Tutorial: Census Subject Tables

Python Tutorial: Census Geography

Python Tutorial: Census Geography

Python Tutorial: Using the Census API

Python Tutorial: Using the Census API

R Tutorial: A/B Testing in R

R Tutorial: A/B Testing in R

R Tutorial: Baseline Conversion Rates

R Tutorial: Baseline Conversion Rates

R Tutorial: Designing an Experiment - Power Analysis

R Tutorial: Designing an Experiment - Power Analysis

R Tutorial: Introduction to qualitative data

R Tutorial: Introduction to qualitative data

R Tutorial: Understanding your qualitative variables

R Tutorial: Understanding your qualitative variables

R Tutorial: Making Better Plots

R Tutorial: Making Better Plots

SQL Tutorial: OLTP and OLAP

SQL Tutorial: OLTP and OLAP

SQL Tutorial: Storing data

SQL Tutorial: Storing data

SQL Tutorial: Database design

SQL Tutorial: Database design

Python Tutorial: Introduction to spaCy

Python Tutorial: Introduction to spaCy

Python Tutorial: Statistical Models

Python Tutorial: Statistical Models

Python Tutorial: Rule-based Matching

Python Tutorial: Rule-based Matching

This video teaches collaborative data engineering with Git, covering best practices for version control, branching, and merging, as well as documentation and conflict resolution. By following these practices, data engineers can improve collaboration, reduce errors, and increase efficiency.

Key Takeaways

Create a Git repository for collaborative data engineering
Use feature branches for data features
Document changes with commit messages
Merge branches using recursive merge strategy
Use Airflow for data pipeline management
Integrate AI into workflow for project analysis
Use VS Code for conflict resolution and documentation
Build a meta database for future tools and analysis
Analyze project metadata for insights and decision-making

💡 Using Git for collaborative data engineering can improve collaboration, reduce errors, and increase efficiency by providing a clear history of changes and allowing for easy conflict resolution.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Tool Use & Function Calling

View skill →

Adding a Phone Gateway to a Virtual Agent

Administering an AlloyDB Database

Cloud Storage: Qwik Start - CLI/SDK

Cloud Composer: Copying BigQuery Tables Across Different Locations

Getting started with Firebase Cloud Firestore

Getting Started with Liquid to Customize the Looker User Experience

Related Reads

What Can We Do When Memory Becomes the New Bottleneck in Data Engineering?

Learn how to overcome memory bottlenecks in data engineering using Pandas chunking, Dask, and Polars, and why it matters for processing large datasets

Towards Data Science

Migrate from Ponder to Envio HyperIndex

Learn to migrate your indexer from Ponder to Envio HyperIndex to scale your data management

Dev.to · Envio

Data Backfilling with Apache Airflow: Architectures and Implementations for Historical Data Processing

Learn how to implement data backfilling with Apache Airflow for historical data processing and improve your data pipeline's accuracy and reliability

Dev.to · Wangila russell

Building a Production-Style Weather Analytics Pipeline from Scratch: ETL, ELT, Star Schema, and…

Learn to build a production-ready weather analytics pipeline from scratch using Python, DuckDB, and Apache tools, and understand the importance of ETL, ELT, and Star Schema in data engineering

Medium · Python

A Moment Frozen in Time | Arnav Iyengar | TEDxJenks Youth