Quick git for beginners… yeah, that’s me!

Using a collaborative git workflow

Since this was originally written for biostatistics students at the University of Miami, some of the steps will be referring to options as they exist using RStudio as the IDE. While I use RStudio and VS Code, slight adaptations to these instructions will need to be made by you in order to fit your IDE’s layout and options. 😊

When collaborating with a team (or even when creating and using personal data science project), these steps can help avoid potential conflicts and other issues that arise while using git. Please keep in mind that this list isn’t expansive nor does it get into nuanced details. I’ll leave those technical bits for instructionals that are better suited to handle complicated matters. What this post is, though, is a starter guide sprinkled with bits of anecdotal suggestions that I hope can keep you focused on the task at hand: being an awesome data scientist!

Navigate to your project in the IDE of your choice (RStudio, Posit Cloud, etc.)
Check that you are on the main branch by inspecting the git pane in RStudio or typing git branch into the terminal. For peace of mind, and to save complexity, the remainder of this post will not be referencing terminal commands. To learn more, see the git documentation for fantastic instructions.
Pull from the repository into the main branch. This ensures that you have the latest updates to the main version of the project. Sometimes you may not have been informed of changes added to main since your last time using the project. Pulling into main at the start and before working on anything can help avoid issues. Picture a tree– the main branch is the trunk and there are one or more offshoot branches that make the tree. Our git tree (the project) is slightly different though in that sometimes our branches merge into the trunk as the tree ages making the tree better; some branches are bad or we find that they serve no greater purpose and need to be pruned so the tree doesn’t die. Keep the tree strong and don’t hack away at the main branch.
IF you are creating a new feature, create a new branch. For example, if I were to write a new script to clean a dataset, I could create a new branch named data-clean-kg to tell other members what I’m doing. I like to add my initials at the end of the branch name to help my team members know who did what. It’s a convention I use, but adopt your own. Keep in mind that it’s best to make a branch for one feature, work on that feature, then continue these steps. It’s not always recommended to make a branch and work on multiple aspects of the project on the same branch. Back to the tree analogy, if too much weight rests on one branch (too many features relying on one branch), when the branch breaks so does the tree. Remember, these are suggestions and not hard-and-fast rules.
One feature per branch. This deserves its own line. Again, a feature doesn’t mean one script file. A feature is a new or edited aspect of the current project. Sometimes a new file must be sourced into another, so you can’t help but rendering changes to more than one piece of the overall project. What I’m referring to is this: in BST692 at the University of Miami, you’ll be constructing many different types of machine learning models… don’t write your logistic model on the same branch as your LASSO and KNN and random forest. Make sense? Instead make a branch lasso-kg or knn-kg or get-logistic-metrics-kg.
stage & commit your changes and add a short but descriptive message. I tend to write commit messages like these: initial commit: data cleaning script or edit: added new dataset and joined with original or feat: added Shiny dropdown (with “feat” being feature). Short and descriptive messages are helpful and your future self (or team members) will thank you if the project needs to roll back to a working state. Things break, mistakes happen, but documenting along the way will quickly get your project up and running again. So, write some code, stage and commit with a message, and get into the zen coding flow state.
push your commit to the project repository. Get into the habit of pushing your commits from your branch to the project repository. This will not add them to the main branch if you are working on a remote branch… as you should be if you completed Step 4. Do this before lunch, do it before that meeting, do it while you find your brain drift away from what you’re working on, but do it before you leave the office for the day! The beauty of using a tool like GitHub is that I’ll be upset if I brick my computer and lose my hard drive, but not as upset if my monthslong machine learning project can be restored onto another machine from the cloud.
Open a pull request in GitHub and add team members to review. Those reviewers will see your changes, have the ability to approve or add comments or both, and provide constructive feedback. This is done openly so others can see what’s happening within the project. For example, I request reviews from the main authors when working on the rUM package and we provide feedback if a new edit or feature is good, needs improvement, or is a bad idea. Finally, when another collaborator approves…
YOU MADE THE BRANCH, YOU MERGE THE CHANGES & DELETE THE BRANCH! You should not merge changes that you did not personally create and you definitely should not delete a branch you did not create. This is very important for collaborative workflows. I’ve had my proverbial hand slapped for approving a pull request, merging it into the main project, and deleting the branch… all from someone else’s work. This should always be regarded as a big “No, don’t do it!!!” The feature or branch’s author should be the only one to literally push the buttons to merge (again, once approved) and delete the branch.

I hope this little guide was helpful and provides insight on how to work as a team. I’m a beginner and always find ways to improve my workflow. I tend to test things on my own before doing them in the collaborative setting. Working with other people affords the opportunity to communicate effectively and efficiently. Use those same concepts when working with git: write commit messages that aren’t update done and rather edit: added forest plot for subset predictors. Lastly, this post will probably be updated as I find better ways. If you have some, please share: kylegrealis@icloud.com

Happy coding!

~Kyle