this post was submitted on 14 Dec 2023
196 points (98.0% liked)
Asklemmy
43940 readers
550 users here now
A loosely moderated place to ask open-ended questions
Search asklemmy ๐
If your post meets the following criteria, it's welcome here!
- Open-ended question
- Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
- Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
- Not ad nauseam inducing: please make sure it is a question that would be new to most members
- An actual topic of discussion
Looking for support?
Looking for a community?
- Lemmyverse: community search
- sub.rehab: maps old subreddits to fediverse options, marks official as such
- !lemmy411@lemmy.ca: a community for finding communities
~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~
founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I broke the home page of a big tech (FAANG) company.
I added a call to an API created by another team. I did an initial test with 2% of production traffic + 50% of employee traffic, and it worked fine. After a day or two, I rolled out to 100% of users, and it broke the home page. It was broken for around 3 minutes until the deployment oncall found the killswitch I put in the code and turned it off. They noticed the issue quicker than I did.
What I didn't realise was that only some of the methods of this class had Memcache caching. The method I was calling did not. It turns out it was running a database query on a DB with a single shard and only 4 replicas, that wasn't designed for production traffic. As soon as my code rolled out to 100% of users. the DBs immediately fell over from tens of thousands of simultaneous connections.
Always use feature flags for risky work! It would have been broken for a lot longer if I didn't add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins.
This reminds me of the old saying: everyone has a test environment. Some people are lucky enough to have a separate production environment, too.
I work on a SOC team and were really trying to hammer the usage of feature flags into our devs.
What are feature flags?
Feature flags are just checks that let you enable or disable code paths at runtime. For example, say you're rewriting the profile page for your app. Instead of just replacing the old code with the new code, you'd do something like:
Then you'd have some UI to enable or disable the flag. If anything goes wrong with the new page after launch, flip the flag and it'll switch back to the old version without having to modify the code or redeploy the site.
Fancier gating systems let you do things like roll out to a subset of users (eg a percentage of all users, or to 50% of a particular country, 20% of people that use the site in English, etc) and also let you create a control group in order to compare metrics between users in the test group and users in the control group.
Larger companies all have custom in-house systems for this, but I'm sure there's some libraries that make it easy too.
At my workplace, we don't have any Git feature branches. Instead, all changes are merged directly to trunk/master, and new features are all gated using feature flags.
Wow that's so effing smart!
Everything Dan said and more. They're sometimes also called canaries, although thats not quite the same thing. There's been a ton of times where services have been down for hours instead of minutes because a dev never built in a feature flag.
Canaries, relating to mine work ?
Thats where the term derives from, yes
What language? PHP, python?