AI models are trained on public data. This is a fairly well known point, but did you realize that companies like Microsoft, who own GitHub, have been using your private data to train AI models as well?
It’s true. LLMs are fed huge datasets from various sources (IE, the internet) and for a long time people have been asking where the data is coming from and AI companies have been giving vague responses.
Well, GitHub’s data policies allow for them to use any publicly available repo for training AI models. It’s likely not the end of the world, but it just doesn’t sit well with me.
Add to that the fact that they’ve have a ton of outage issues recently and are going to be gating features behind a paywall sooner or later, I just decided to move away.
The whole point of a backup git server is to have access to your code anytime as a backup and to allow collaboration. Now, I’m exploring Forgejo as a private git server option, but for now, due to this site being on Astro, it was easier to utilize GitLab.
The nice thing about GitLab is that they have a generous free policy and their privacy documentation says they won’t use your data to train LLMs. That might change, but for now we’re good.
Also, GitLab allows for you to set up a private server and still utilize their services, so kind of best of both worlds if you wanted to go that route.
Either way, I’m moving away from GitHub. Eventually I’d like to have a home lab setup so I can just manage my data privately, but until I get the knowledge and materials to do it, I’ll utilize the new GitLab setup.