The day it broke away and became centralized was when we had a PR + mandatory "Required actions" to merge to main.
https://reticulum.network/manual/git.html#mirroring-reposito...
Just set up a Kubernetes deployment and you’re set.
But as others mention, GitHub’s primary strength is collaboration. If you want decentralized, solve this by creating a decentralized collaboration tool on top of fossil and/or git.
For example, how to do pull requests and code reviews?
Gosh, it's hard figuring out what changes Lorne made if only we had a system to merge those changes. Enter git
Gosh it's hard figuring out what packages Rachel had to make this work. Enter rubygems/pip/npm
Gosh it's hard figuring out sync these changes across a network. Enter github
Gosh it's hard figuring out how to get those packages working on my operating system. Enter docker
Gosh centralizing our distributed version control software system onto one website is getting really unreliable. Enter fossil(?????)
If we go any further having one computer per business with a sign up sheep is starting to sound pretty fucking attractive.
being a host for git repositories has never been its core competency. neither has its groupware offering.
does it even serve OSS well? a very interesting criteria is, "Have mature or adopted end-user-facing OSS recently merged a large PR from an unallied contributor?" The answer is overwhelming no. This is why there is so much innovation in this space.
Proudly self-hosting Forgejo since then.
> Our team is currently experiencing an unexpectedly high volume of tickets which has resulted in longer response times than we prefer. We acknowledge the long wait and apologize for the experience.
> Sometimes our abuse detecting systems highlight accounts that need to be manually reviewed. We've cleared the restrictions from your account…
Fully self-hosted IMO can be an overcorrection. The issue isn’t “relying on other people”—it’s relying on GitHub, when they’ve made it clear they don’t care about uptime and they don’t care about support turn-around-time.
It would be a pain as I'd have to set up a few integrations again, but github is far lower down the risk scale than the vast majority of SAAS providers
I hope people here are aware that you can push your repo somewhere else if wanted.
Git is a distributed system, there isn't even a server, only other git repo instances that are remote.
Is it true that official service status pages are updated automatically?
Depends. Typically no because there’s an art to crafting the actual message around impact… but sometimes yes it is automated
If the first they hear of an outage is when user requests start to fail, then that's a failure in their monitoring as well.
But effective monitoring is harder than people assume.
Isn't that what monitoring actually is? The issue seems to be in their testing, not monitoring.
There are synthetic tests, where you can generate API request calls or even simulate an entire user journey. These allow you to control the user agent, the payloads, and thus you know anything errors back are actual errors. These are triggered by the observability platform (think like running a cron-job) and thus you're not tied to user activity to see when problems arise.
There are other metrics outside of HTTP response codes too. Think like free RAM, CPU usage, disk space, etc. This is just naming some obvious ones because these types of metrics are generally bespoke to the type of application your monitoring. And with these types of monitors, you'd not just have an alert when things have failed, but ideally have alerts when an irregular trend is showing that things are likely to fail too. This latter type of monitors helps you get ahead of the problem before it become customer facing.
Then you have more traditional stuff like logs. This will also be bespoke to the application. But you'd expect errors in logs to get surfaced quickly. Assuming Github have good hygiene in what's being logged.
Tie that up with APMs, RUM, and other goodies like that and you'll have diagnostics to investigate issues when they appear.
(this is just a super high level view of observability too)
You should not alert on cpu, ram, etc
It doesn't "need" that. That just how most people set it up because it’s an easy sane default that allows for network jitter without inexperienced engineers thinking about different conditions triggering different types of responses.
If you’re measuring internal APIs from an observablity solution that’s has nodes already inside you’re network enclave, then there is a strong argument for alerting early.
> You should not alert on cpu, ram, etc
That’s not true to say as an absolute statement. And a generalisation it heavily depends on the system your monitoring and how it behaves under pressure.
But in any case, I wasn’t suggesting CPU alerts were the end goal. I said:
> these types of metrics are generally bespoke to the type of application your monitoring.
Ie you’ll use metrics but those metrics will be highly specific.
The CPU examples were an illustration as to what a “metric” is (it might seem obvious but not everyone is an expert) but the point was HTTP response codes aren't the only types of metrics one should be capturing and watching.
If your requests are fast and cheap, you can probe frequently relative to your goals, but often that's not really possible (think, long SQL queries, or scheduling a container/pod). There you need several datapoints, or possible fewer augmented with other signals.
Talking about long SQL queries, I quite like throwing CPU alerts on database servers. They'll be a low priority alert (ie no out of hours "pagers") so just something that goes into a slack channel. But they're a good indicator of when developers have poorly optimized SQL, or the DB schema is poorly defined (eg missing indexes), or the DB server itself is poorly sized.
This wouldn't be something you'd expect to need in production and definitely not something you'd rely on as a notice of a production outage. But it is an example of one of those 1% occasions where a CPU alert does add value to the overall observability of the application.
But this also ties into your excellent point about how you'd use CPU and other data points to build a picture of what's happening in your application.
idle CPU is often wasted CPU
Who says public status page equals internal monitoring.
They likely know faster than you. Whether they post it publicly is a different issue (hint: SLA penalties, news impacting stock etc)
Are you sure you’re replying to the right comment?
For context, the parent comment you replied to started with status page.
Then are you talking about internal leaks or just guessing? Otherwise besides what's public how do you know they don't know?
Someone then replied about how it takes a bunch of HTTP response errors for problems to be alerted and thus I commented that application observability would consist of more than just waiting for users to hit errors.
Maybe the Github Actions infrastructure isn't run like that.
edit: my oncall rotation notified on all 500s, 24/7, not just rates - https://news.ycombinator.com/item?id=48279262
Recently there was this: https://news.ycombinator.com/item?id=47252971 "10% of Firefox crashes are caused by bitflips"
Which makes me think a small amount of random issues which happen even though nothing is broken, is normal everywhere. Especially once move things around on a network, there's potential for a lot more random errors.
It does require constant tuning and adjustment though.
This is why data hoarders who have NASes with lots of space insist on running their servers with ECC RAM despite it being significantly more expensive. Because bit flips, for all intents and purposes, cannot happen. The RAM itself detects and corrects for them.
I wouldn't expect bit flips to be a significant contributor to enterprise problems.
If your network goes down because of a DDOS, or part of your system overheating, that's an internal issue you had control over.
If a bit flips because of cosmic radiation, you can't really do anything about that, and it's utterly unpredictable. That's "random" to me.
I know all of Gmail, every GCE service I can think of, every AWS service I can think of, Amazon.com, Netflix, and Github all do not page on just a single 500.
I know none of those are particularly "high performance" though. Curious where your experience is coming from.
I had a fairly long tenure, where I maintained multiple key services in critical online payments flow. Authentication, authorization, core business and risk data, as well as some cross-cutting control plane stuff, etc. You needed one or more of our services to take a payment, serve any request from the employee dashboard - pretty much everything hit our services. The entire company ground to a halt without my team.
We paged for every single 500. In instances where a particular class of 500 was spurious or not worth fixing, we would leave it acked or mark it as noise. But typically we'd just put in a fix as soon as possible so we didn't page.
Our graceful shutdown and traffic shaping stack was great, but occasionally we'd get a few pages during deploys or failovers.
Oncall was typically not bad, but when it did get bad it was terrible. I've been involved in huge outages that cost hundreds of millions of dollars. Usually it was the fault of multiple teams having compounding runaway failures rather than one service or bug in particular.
It's inexcusable to have a customer's payments not go through. We engineered around resilience. We had strict five nines SLAs and p99 targets and evaluated our adherence with even the smallest partial outage. Hundreds of other services depended on ours, and downstream impacts were huge, so we had to keep a tight ship.
We didn't have "business hours"-only paging either as our platform was available globally, including a heavy install base in Asia.
Assuming the existence of some kind of network (with zero guarantee of 100% reliability), how does this work in practice? Is each 500 treated as an event that needs investigation, even if the result of that would end up as 'a router dropped something from an internal buffer but the transaction as a whole was re-tried by a parent so the service itself recovered'?
Even if it's "DB in datacenter I tried to save to was hit by meteor" event, you can cater for this not to result in 500 (ie - DB unreachable, retry in a couple of minutes); the question is if you want to.
If my DB health check endpoint is returning 500s for N consecutive checks over M minutes, yeah, please wake me up at 3am!
If one user hit a weird edge case in form validation and got a one-off 500, please don't! We can fix that on Monday.
Not always easy to distinguish those clearly or configure those business hours rules, but for my team at https://heyoncall.com/ that is the goal -- otherwise your team burns out fast. Waking up someone at 3am has a real cost, so you better be sure it's worth it.
As others have said, follow-the-sun type models do exist, usually staffed by people in their normal working hours (EMEA, Americas, APAC) but this means you've still got to cover the weekend and public holidays (which there are a lot of when you factor in plenty of different countries).
Where you need a quick response you can have a core ops/noc team that looks at things with lower thresholds and shorter windows, and their job is to do the initial triage and then page the appropriate team earlier than they would have been alerted by their own alert thresholds/monitoring.
Actually clicking the button to change the status on a public status page is a whole different topic that becomes very political in certain companies.
But if it is synthetic queries sent from the monitoring platform, then you control the user agent, payload, and endpoints. So any failed requests are a symptom of a misconfiguration and/or failure that should be investigated. Albeit not necessarily as a P1 priority.
I'm sure you're not in ops. Or in a dev org of a service with decent request rates.
What you're asking for is a service to fail silently. There's no way a service with a decent request rate to have 0 500s. Not when it still sees development.
A 50 year old bank API? Maybe...
Is it more so to have something to link to for managers who aren't using the service have a pretty bar to look at and feel like they are "doing something"? Or is it more of a kind of a way to prevent confirming what you already suspect to be true. E.g. "Huh. Me and Jim are seeing problems. How about you Tom? Oh wait, crud. The service page is confirming it's down now. Never mind! Who wants coffee?!"
No, it's not. Official updates = potential SLA penalties. Always requires approval.
There's a threshold. It shows only once 1000 users complain.
/i
Can you sue companies for inducing such anxiety?
but I suppose that there might be some terms of conditions within using github (ahem Microsoft) that you can probably not sue them for something like this.
It really depends upon the severity of situation (imo)
For example, if a person had any heart condition and they got so stressed because of an error at github (which to be fair, I can understand the stress part, imagine losing some part of your software because it was on github and the amount of direct damage to livelihood if your income depended on it)
and I think that the judge might have to be in just the right technical know-spot as well and someone who can understand the situation from programmer's perspective hopefully.
Then I can see a case being made.
once again not a lawyer but an interesting question, would love reading other replies to your comment.
also for what its worth, you can sue any company for X,Y or Z. The question worth asking is if you can win such lawsuit.
Personally I believe it might be hard but not impossible but for all practical use cases it might as well be but the only answer can probably be found in court. I am just guessing at this point.
I vibe coded a script that interacts with both Gitlab and Github via their APIs and I've been using it pretty heavily since this morning. I crossed the streams! Goodness, I didn't know it would be _this_ bad!
spooky action at a distance
- So many super-heroes/super-villains
... You're off the hook this time./s
We can't be blocked here. Seems silly what we settled on this, but for a long time GitHub had been reliable enough for many years, but things are sliding down the pan as of late.
Been burned too many times on that one.
Move to EC2.
Darn AWS is down.
Alright, run it on a Mac Mini in your basement. Ahh dawn, your ISP is having issues. Good thing you have a backup 5G hotspot.
Ohh no, the power is out.
Eventually you have to trust someone else.
GitHub is a tragedy of the Commons. Too many people are using it, and Microsoft isn't willing to handle it correctly.
Feels like a very good business opportunity. Minimum 50k yearly contracts, GitHub with actual uptime. GitPro ?
Aggregate risk is too high.
This is supposed to be Hacker News! Who is coming up with a startup to fill the gap !
You should never entirely depend on a third party service to run your tests, either.
make test
Should work without CIOn my repo the jobs do not get scheduled on the PRs at all, so I assume that separation wouldn't help for todays issue.
Wait until you charge you for self-hosting runners.
Oh wait. They already tried.
You can now hire me as an overpriced consultant instead of paying Microsoft.
The latest language models have enabled this sort of thing for me. I can integrate a mini Jenkins into every project within a 5-10 minute prompting session. This sort of code isn't hard. It's just tedious, and the LLMs absolutely rock at boring repetitive stuff. Having a win32 service start up successfully on the very first try is something I haven't experienced until 2026.
I agree in a hosted+shared SQL scenario you have to be a little bit more careful with all of this. Arguably, you should have a separate schema management phase in these cases.
But if you are just SQLite embedded in the service, you can use the user_version pragma to track schema version and perform deterministic migrations (assuming a user didn't manually jack with the file in-between).
"Update something in the cloud" <- What do you mean?
That only works on extremely simple setups and has risks. If you have only a single server, you can stall it. Now, how to roll back?
https://www.reddit.com/r/GithubCopilot/comments/1toa9tf/mode...
So why are Actions so unreliable anyway? Occam's Razor would probably suggest the domain is inherently complex/difficult; but other providers show that reliability is possible. What would Occam's Razor suggest next? Poor management..?
You’d need at least some hash of sources + test results, and check that it matches that (in CI).
And you’d still deal with environment differences.
Reasonable concern. In ~10 years of indy development, I haven't forgotten to run tests before pushing to main, ever. So setting up and maintaining complicated machinery to solve a problem that could (but never has) happened doesn't justify taking focus off other more important things, namely building.
The benefit probably increases with team size (I'm a team of 1, so I appreciate the luxury of being able to dodge CI/CD entirely).
Say a disaster happens and someone pushes to main without running tests, 9 times out of 10 it will be of ~zero consequence (either the code works first time, it was a cosmetic change that hardly affected users etc).
I know there are horror stories and CI/CD would have prevented some of those, but IME they're just not that common nor severe for small operations, and even when they happen, only a small subset are irreversible/unfixable.
Basically, what you are suggesting is that everyone advertises their tests/builds run on slack? Also when two devs merge their changes, who compile/tests the master branch?
For small teams it could be as simple as everyone agreeing to ensure tests pass on main before pushing to prod.
Anyway. Forgejo's response to it: https://floss.social/@forgejo/116494295922963052
(Ofc, in a sensible universe, we just brush that off to a JS/Firefox glitch or my ISP.)
And yet, here I am. My code is not compiling, my AI isn't vibing, nonetheless I can't work! Two more hours before I can get off!
For Git, all you technically need is ssh access and some backup strategy for your server. It would be bare bones but workable. And there are of course plenty of OSS things that are a lot nicer than that.
I'm still using gh and gh actions and we are mostly below the freemium layer with that. But it is kind of slow and honestly a dedicated vm plus some high CPU/memory workers we can spin up on a need to have basis might be a lot faster. With GH outages becoming more common, my hand might be forced a bit.
In recent weeks, I've spun up listmonk (mailing list solution), matrix (as a slack alternative), and a few other things specific to our software stack. A github alternative would be more of the same. We don't need a lot.
The main objection is that with more moving parts to worry about, the workload for me also increases. Things need updating, monitoring, backups, alerting (and responding to alerts), etc. That sucks up my time and that is scarce.
Another reason for self hosting these days is that with agentic AI tools, self hosted things are a lot easier to integrate into agentic systems. If it is self hosted, you don't have to worry about API limitations, rate limitations, walled gardens, etc. All the traditional SAAS silos are becoming a problem from that point of view. The more locked down it is, the bigger the motive for moving away from it. That's why we ditched Slack for Matrix. Slack is hopelessly locked down and tedious to deal with. Matrix is super easy for this.
Technically Dropbox is just rsync.
Also https://xkcd.com/1319/ but for maintenance.
I don't think vibecoding at Github has much to do with it.
That makes sense. Thank you!
I don’t buy the excuse. I want to hitch my wagon to those “mysteriously lucky” competitors. (And have. And haven’t had similar issues to Github, since.)
Tough to say as this is all speculative, though.
Think critically.
agentic "ai" is going great
That being said there was a noticeable trend starting around 2022.[2] That being said they’ve also been doing a big migration to Azure. It’s likely a combination of things.
1: https://www.cnbc.com/2025/04/29/satya-nadella-says-as-much-a...
When I dug in to the latest outages, they were almost all in small newer, features like all the AI stuff. The actual core GitHub platform seems much more stable than the unofficial uptime trackers propose.
For instance, the UI at setups such as https://git.devuan.org/Daemonratte/gtk2-ng is quite ok-ish, in my opinion. Granted, it is mostly copy/paste from github but that still is about 1000000x better than sourceforge's interface - and gitlab's UI too (I just hate gitlab's UI, they seem to love complexity and a billion features only 0.000001% ever need; GitHub, with all its faults, is for the most part really simple - not everywhere, e. g. GitHub wiki setup sucks, but by and large I think it is simple overall).
No, it's not like "act," because it uses the standard Github runner, the difference is that the control plane is an emulation of api.github.com, because of this we can do all kinds of nice things:
Caching in ~0 ms. Pause on failure, so you can let your AI agent fix it and retry without pushing.
Is what it boils down to.
> codex "Fix this pipeline, use `act` to verify your changes"
I have tried to use act many times, and many times I've failed.
P.S. pause on failure is also helpful for humans, but I'm trying to be realistic about where the future of programming is going...
I like that it exists, but what a freaking mess that it's necessary and so difficult to do.
I started playing with proxmox VMs and containers in them (docker and tart) to see if I can build some local infrastructure to properly solve this…
The jobs runs via containers.
source: voices in my head. Not affiliated with MSFT.. anymore.
We're now considering Buildkite (apparently they have a GH actions migration tool) or self hosting something (GitLab CI, maybe even Jenkins), as it looks like that would've kept ticking over since we're still seeing webhooks being triggered today during the downtime.
I used to use Cirrus CI as an alternative to GitHub Actions and am looking for a new alternative. I wonder if Depot could fit in the same way for my needs. I need to run builds and tests in Windows, Linux and macOS.
Hope you don't mind the public ask, it seems useful for others.
If we're using depot runners, and want to use them directly, or move off of github actions being the controller for when things run: what do you suggest?
Trigger the workflows directly on depot via CLI?
We’d need more details around what you’re seeing. It is true that if auth across GitHub is broken than we can’t copy your actions out to be used by Depot CI. However, we have a solution in the works for that as well.
In short, Depot CI, our own engine and control plane is not dependent on upstream actions control plane. But still has to listen for commit events to know if/when to run jobs on things like PRs. This to is being removed in the future.
https://www.blacksmith.sh/ and https://runs-on.com/
They also say that they're much cheaper than github
It is relatively easy to scale a collection of simple things to extreme and exhibit complex behavior together. It is a lot harder to scale something complex to extreme. But too many times the latter is the default - designed wrong from the ground up and stuck in scaling hell.
If Google owned GitHub would they be better positioned to scale?
I much prefer Woodpecker CI, which is an open source fork of Drone.io. It supports multiple Git backends like GitHub, Gitea, Forgejo, Gitlab, Bitbucket. It supports running jobs locally, on Docker, and on Kubernetes. And there's autoscalers built in for AWS, Hetzner, Linode, Vultr, and Scaleway. There's a bunch of 3rd party plugins (https://woodpecker-ci.org/plugins) for custom integrations. The UX is also very simple, with OAuth used not only for authentication/authorization but also setting up & accessing repos. The system architecture is great, with separate components that run stateless connected to a database, and a custom plugin is any program that takes environment variables and does stdio. The config file is a good balance of ugly YAML and convenience syntax like shell-style parameter expansion variables.
It probably takes less than 15 minutes to install, set up, and run WoodpeckerCI for a small team, so it's not a big investment to try out or host. With the autoscaling plugins it lets you scale your workload up to whatever size. Honestly you could run it on a laptop since it's written Go.
(to clarify for beginners: the config file docs are found in a section called "workflow syntax" (https://woodpecker-ci.org/docs/usage/workflow-syntax) and variable parameter expansion is buried deep in an environment variables page called "string operations" (https://woodpecker-ci.org/docs/usage/environment#string-oper...). poorly organized docs aside, the system itself works well)
Jesus, that's both horrible and seems within reach.
The external page linked above goes the other extreme and considers it a bad status whenever any individual service is degraded.
In reality the majority of people only use 3 or 4 of the core services the majority of the time but since there's no "core services" SLA/uptime the usability of github for the majority of people is slightly obfuscated.
From their FAQs[0]:
> Codeberg's mission is to promote free/libre software. Keeping software private is obviously not our primary use case, but we acknowledge that private repositories are useful or necessary at times.
Perhaps not 100% physically shared infra but there's references of architecture overlap such as "The GitHub-hosted runner application is a fork of the Azure Pipelines Agent."
https://docs.github.com/en/actions/concepts/runners/github-h...
https://github.com/actions/runner-images#about
A few threads where blips have affected both services.
https://news.ycombinator.com/item?id=42781922
Setting it all up would have been tediously annoying eight months ago (Buildkite requires setting up GitHub webhooks for each repo).
Last week I just had codex set up everything, ephemeral vm runners and all, using a couple of low-spec refurb mac minis, Buildkite’s API, a short-lived API token, and migrate my repositories one by one.
So far so good, it’ll pay for itself within two to three months, and following today’s outage I suggested at work that we experiment with the same set up.
They’re considering it.
GitHub was, once upon a time, quite stable. Things have changed: more features, more usage, and automated agents.
"Microsoft’s GitHub was positioned to win the AI coding race. Outages got in the way" - https://www.cnbc.com/2026/05/22/microsoft-was-positioned-to-...
Even though it's selfhosted and we don't have a dedicated infrastructure team, I don't remember it ever being down in the last 12 years I have been working here.
Something’s wrong when my own infrastructure is more reliable than Microsoft’s.
EDIT: sorry i meant this rant at the one complaining for the free service not for the paid customers (which is unacceptable)
Technically this one was earlier but the other one has more traction.
"Well. It's got a nine in it"
"What percentage??"
"Nine"
Thanks for pointing out that nobody is using that thing
- GitHub
- Hiring budgets
- RAM (/personal computing in general)
- Electricity
- Media/Content
- Truth
Reminds me of the occasional “JavaScript developer tries to vibe debug a Linux kernel issue” comments we get here.
The open source contribution model as we once knew it is dead; you're not going to accept patches from random agents. The risk is way too high. And you can see that increasingly "AI Slop" makes it difficult to be a maintainer of any semblance of a popular repo.
So what's the value? A durable place to store work? hah.
Discovery? That part of Github has always been shitty.
So that leaves.. Github Actions? The thing that is down every other day and has been the subject of a few ~rug pulls~/attempted price hikes that are almost surely coming back?
This is a conservative estimate assuming linear growth, the actual number is likely going to be higher. Much higher.
It's not too hard to grow 14X YoY if you start from a hundred customers. If you have hundreds of millions? Yeah, not so easy.
I like being able to vote with my (teams) wallet and I'm tired of staying out of convenience
Or maybe it's before the GitHub internal devs are online and deploying changes.
We have already seen this in the last some weeks, but now this has become a meme that keeps on giving. GitHub down! GitHub up again. GitHub Down! GitHub ... ...
Perfect timing that we post https://www.jxd.dev/writing/building-plain just as this latest incident started.
I've done some hacky shit in CI scripts, but none made me more mad than that one.
With all the recent negativity – how are they not even TRYING to fix the damn thing?
Self hosted Gitlab with self hosted (or AWS) runners running your pipelines.. We only use Github as a mirror for our public repositories.
I am trying to refrain my "off topic" rants... but such microsoft github abuse is generating so much hate due to their dominant market position, it is hard.
This is why we don't use Github Actions, kids.
Seriously, its a proprietary build service that puts the keys to the kingdom in someone elses' control. Just: No!
Print this status page to PDF so you've got it handy next time someone castigates you for not using Github Actions, folks.
This time today it was caused by friendly fire by the automatic suspension of the GitHub Actions bot which is now a "Ghost" user. Since there is no CEO of GitHub to contact it we are just going to see more [1] of this again.
You might need to push a critical change soon, but now you cannot. You won't get any of these issues if you self hosted as I said 6 years ago...[2]
[0] https://www.githubstatus.com/incidents/g6ffrm0rfvz9
I don't want to delve into it any further - but something about it seems incongruous. It's not spam it's submarine marketing.
Apologies for the spam!
I'm guessing related to this? The blog post is dated 11 days ago but I just noticed a blue banner on my actions page today.
Which certainly made me shit myself, briefly.