Most people think of feature flags as boolean on/off switches, maybe per user on/off switches.
If one is testing shades of colors for a "But Now!" button that may be OK. Regarding more complex tests my experience is that there are not a lot of users who tolerate experiments. Our solution was to represent feature flags as thresholds. We assigned a decimal number [0.0, 1.0) to each user (we called it courage) and a decimal number [0.0, 1.0] to a feature flags (we called it threshold). That way not only we enabled more experimental features for most experiment tolerant users, but these were the same users, so we could observe interaction between experimental features too. Also deploying a feature was as simple as rising it's threshold up to 1.0. User courage was 0.95 initially and could be updated manually. We tried to regenerate it daily based on surveys, but without much success.
Interesting. At one place I worked, employees were excluded from experiments (they had to enable the flag personally to see them) by default. At one point, we had so many experiments that literally nobody except employees were using "that version" of the software. Everyone else was using some slightly different version (if you counted each feature as a version), and there were thousands and thousands of versions in total.
We ended up creating just ~100 versions of our app (~100 experiment buckets), and then you could join a bucket. Teams could even reserve sets of buckets for exclusive experimentation purposes. We also ended up reserving a set of buckets that always got the control group.
You've approached it a different way, and probably a more sustainable way. It's interesting. How do you deal with the bias from your 'more courageous' people?
I mean that "courageous" people are more likely to take risks and accept new features and thus probably more likely to be attracted to novelty (see: Novelty Effect) and require longer experiments to understand the actual impact.
> At one point, we had so many experiments that literally nobody except employees were using "that version" of the software. Everyone else was using some slightly different version
> User courage was 0.95 initially and could be updated manually. We tried to regenerate it daily based on surveys, but without much success.
Based on this ending, the courage bit sounds clever but is misguided. It adds complexity in a whole other variable, yet you have no way of measuring it or even do a good assessment.
I thought you were going to describe how you calculated courage based on the statistical usage of new features vs old features when exposed to them to update courage, meaning people who still keep using the product when it changes have more courage so they see more changes more often. But surveying for courage (or how easy they deal with change) is probably the worse way to assess it.
But even that I don't know what purpose would have because now you destroyed your A/B test by selecting a very specific sub population, so your experiment / feature results won't be good. I'm assuming here a product experimentation approach being used, not just "does it work or not" flags.
Mostly functional changes. Like deploying a new parser, which may not support all the old files. There were users which will contact customer support in panic stating that their life is ruined by this change and there were users who's like that fixed by next quarter.
Sounds like an overengineered solution to something that can be solved as simple as with a checkbox "i would like to get access to experimental features" in the UI.
I respectfully disagree. Depends on number and severity of experiments. Comparing two decimals is really not harder than checking a boolean, still a single "if". I do not see much over-engineering here.
What does "tolerating experiments" mean? If they can tell it's an experiment, then isn't your change bad?
Do you mean "tolerate change"? But then you still eventually roll out the change to everyone anyway...
Or do you mean that users would see a different color for the "buy now" button every day?
From a purely statistical point of view, if you select users which "tolerate" your change before you measure how many users "like" your change, you can make up any outcome you want.
the tolerance score wouldn't be tied to a specific change. it's an estimate of how tolerant a person is of changes generally.
it's not that different from asking people if they want to be part of a beta testers group or if they would be open to being surveyed by market researchers.
targeting like that usually doesn't have a significant impact on the results of individual experiments.
If only people who like changes like your change, should you really go ahead?
Plus you don't know what that correlates to. Maybe being "tolerant of changes" correlates with being particularly computer-savvy, and you're rolling out changes that are difficult to navigate. Maybe it correlates to people who use your site only for a single task, it would appear they don't mind changes across the platform, but they don't see them. Maybe it correlates with people who hate your site now, and are happy you're changing it (but still hate it).
You can't use a selected subset that is not obviously uncorrelated from your target variable. This is selection bias as a service.
I could have clarified as well that I was leaning more towards the user-tolerance... or as I like to call it user-guess that this feature might be OK with them :)
Another thing I like about granular and flexible feature flag management is you can really dial in and learn from which features get used by whom, actually.... instead of building things that will collect dust.
This seems like it would skew the data significantly for certain use-cases.
Unless you're feature flagging to test infra backing an expensive feature (in which case, in a load-balancer / containerised world, bucketing is going to be much a much better approach than anything at application level), then you most likely want to collect data on acceptance of a feature. By skewing it toward a more accepting audience, you're getting less data on the userbase that you're more likely to lose. It's like avoiding polling swing states in an election.
From your naming, I would have done the opposite :) Start with courage 0.05 and show experiments whenever it is greater than the threshold. To enable a feature for everybody, you lower the threshold to 0.
This seems really complex, specifically in the area where I find product, CS & marketing least likely to want it: targeting and controlling their audience. Sounds like a cool thought experiment, fun and challening to implement and not really practical or useful.
If you have a huge userbase and deploy very frequently FFs are great for experiments, but for the rest of us they're primarily a way to decouple deploys from releases. They help with the disconnect between "Marketing wants to make every release a big event; Engineering wants to make it a non-event". I also find treating FFs as different from client toggles is very important for lifecycle management and proper use.
More than the binary nature I think the bigger challenge is FFs are almost always viewed as a one-way path "Off->On->Out" but what if you need to turn them off and then back on again? It can be very hard to do properly if a feature is more than UI, that might cause data to be created or updated that the old code then clobbers, or issues between subsystems, like microservices that aren't as "pure" as you thought.
Speaking as an open-source feature flag 'vendor' (https://github.com/flipt-io/flipt), the OpenFeature organization has been a joy to work with. They are very welcoming of new contributors (e.g., implementing a provider SDK in a new language).
Huge work to get everybody on the same page (About my previous example, it's not well engaged by example https://github.com/nodejs/node/issues/55419), but when done and right done, it's a huge win for developers.
Do you mean feature flags? This enable you to change the configuration at the runtime. Ex: A/B Testing and changing a behavior for a subset of users, disable feature when you want it (particularly useful when you are in Trunk Based Development and don't want to deploy a beta feature to everyone for example).
But why do you need an external service for that? Isn’t that basically a single DB table with a name and an on/off value for each flag (or maybe an integer for multiple options)?
If you have a single database than maybe you can (and should?) just start with a basic, single table approach, but as you grow in size and complexity FF management can become a challenge, with reporting gaps and feature release management. I usually see two charateristics with the former approach: growth in the # of FFs over time and a messy Excel report for what they are, do and if anyone still hits the old code. This might be fine for a while, or forever, but often gets painful.
Hey there - one of the Flagsmith founders here - yes we are supporting it, building adapters for our SDKs and I'm on the CNCF project governance board.
Hey there! Andrew here, Community Manager for OpenFeature and DevRel lead at DevCycle. We (DevCycle) have worked hard to ensure an OpenFeature Provider is available for every language supported by OpenFeature and for which we have an SDK (https://docs.devcycle.com/integrations/openfeature)
Yes there are, as I am part of the openFeature community, I have to point you to https://openfeature.dev/ecosystem where you'll see all kinds of providers which are supported (some officially, some by the community)
LaunchDarkly has a mix of OpenFeature providers they wrote, and quite reasonable community-contributed ones, depending on language. They are also very actively engaged with OF in meetings, discussions, etc.
We are evaluating new solutions at work and OpenFeature is something we're interested in. (I did the home grown solution that's in use by one product line)
I can see that this might be very useful, since it is more some kind of application configuration specification that goes far beyond simple flags. In the end the common provider that works securely across all services and clients is probably the real problem.
I hope that OpenFeature changes the feature flagging space the same way that OpenTelemetry impacted the o11y space, we are overdue for this (in my biased opinion)
Lombok is a very divisive framework in Java, with strong opinions on both sides.
Given that, it's a bold choice to include Lombok in a library that other developers will pull into their stack - it's likely to make this a non-starter from those in the 'no' camp.
As Lombok is just compiler sugar, when building an SDK for other developers, it's probably less alienating to just write the boilerplate that Lombok saves you from.
The symbols remain in the final library, necessitating either class exclusions within the scope of a JAR you don't control (which is a terrible idea) or the addition of a dependency which is irrelevant, inert, and has no place in your codebase.
It is embarrassing for a library to ship ABI-visible symbols from Lombok.
No, that is Pete Hudgson on martinfowler.com. Most articles on martinfowler.com haven't been written by Martin Fowler himself in years. It's best thought of as a publishing venue for Thoughtworks.
A little story from personal experience.
Most people think of feature flags as boolean on/off switches, maybe per user on/off switches.
If one is testing shades of colors for a "But Now!" button that may be OK. Regarding more complex tests my experience is that there are not a lot of users who tolerate experiments. Our solution was to represent feature flags as thresholds. We assigned a decimal number [0.0, 1.0) to each user (we called it courage) and a decimal number [0.0, 1.0] to a feature flags (we called it threshold). That way not only we enabled more experimental features for most experiment tolerant users, but these were the same users, so we could observe interaction between experimental features too. Also deploying a feature was as simple as rising it's threshold up to 1.0. User courage was 0.95 initially and could be updated manually. We tried to regenerate it daily based on surveys, but without much success.
Interesting. At one place I worked, employees were excluded from experiments (they had to enable the flag personally to see them) by default. At one point, we had so many experiments that literally nobody except employees were using "that version" of the software. Everyone else was using some slightly different version (if you counted each feature as a version), and there were thousands and thousands of versions in total.
We ended up creating just ~100 versions of our app (~100 experiment buckets), and then you could join a bucket. Teams could even reserve sets of buckets for exclusive experimentation purposes. We also ended up reserving a set of buckets that always got the control group.
You've approached it a different way, and probably a more sustainable way. It's interesting. How do you deal with the bias from your 'more courageous' people?
>> How do you deal with the bias from your 'more courageous' people?
That's a great question. We had no general solution for that. We tried to survey people, but results were inconclusive, not statistically significant.
I mean that "courageous" people are more likely to take risks and accept new features and thus probably more likely to be attracted to novelty (see: Novelty Effect) and require longer experiments to understand the actual impact.
> At one point, we had so many experiments that literally nobody except employees were using "that version" of the software. Everyone else was using some slightly different version
Was this at Spotify by any chance? :)
No.
[dead]
> User courage was 0.95 initially and could be updated manually. We tried to regenerate it daily based on surveys, but without much success.
Based on this ending, the courage bit sounds clever but is misguided. It adds complexity in a whole other variable, yet you have no way of measuring it or even do a good assessment.
I thought you were going to describe how you calculated courage based on the statistical usage of new features vs old features when exposed to them to update courage, meaning people who still keep using the product when it changes have more courage so they see more changes more often. But surveying for courage (or how easy they deal with change) is probably the worse way to assess it.
But even that I don't know what purpose would have because now you destroyed your A/B test by selecting a very specific sub population, so your experiment / feature results won't be good. I'm assuming here a product experimentation approach being used, not just "does it work or not" flags.
Mostly functional changes. Like deploying a new parser, which may not support all the old files. There were users which will contact customer support in panic stating that their life is ruined by this change and there were users who's like that fixed by next quarter.
What’s important is if it worked for you and your audience.
There’s no standard requiring something to work for everyone, and it being less value if it isn’t.
Sounds like an overengineered solution to something that can be solved as simple as with a checkbox "i would like to get access to experimental features" in the UI.
I'd go with that option too. I don't think users want to be surprised with being experimented on. Some users could take it worse than others.
I respectfully disagree. Depends on number and severity of experiments. Comparing two decimals is really not harder than checking a boolean, still a single "if". I do not see much over-engineering here.
Getting one or a few new features is one thing, getting too many might be too much.
Some granularity and agency for the user is valuable. Maybe let them pick everything as a whole or a few features at a time.
What does "tolerating experiments" mean? If they can tell it's an experiment, then isn't your change bad?
Do you mean "tolerate change"? But then you still eventually roll out the change to everyone anyway...
Or do you mean that users would see a different color for the "buy now" button every day?
From a purely statistical point of view, if you select users which "tolerate" your change before you measure how many users "like" your change, you can make up any outcome you want.
I think you might be mixing things up a bit.
the tolerance score wouldn't be tied to a specific change. it's an estimate of how tolerant a person is of changes generally.
it's not that different from asking people if they want to be part of a beta testers group or if they would be open to being surveyed by market researchers.
targeting like that usually doesn't have a significant impact on the results of individual experiments.
If only people who like changes like your change, should you really go ahead?
Plus you don't know what that correlates to. Maybe being "tolerant of changes" correlates with being particularly computer-savvy, and you're rolling out changes that are difficult to navigate. Maybe it correlates to people who use your site only for a single task, it would appear they don't mind changes across the platform, but they don't see them. Maybe it correlates with people who hate your site now, and are happy you're changing it (but still hate it).
You can't use a selected subset that is not obviously uncorrelated from your target variable. This is selection bias as a service.
I suspect it’s because some users will actually be pioneers and early adopters vs believing they are.
This kind of threshold adds some flexibility into the subjectivity of finding the best cohort to test a feature with.
Where the best cohort to test with is the one that agrees with you...
You can call this measure "courage" but that is not actually what you are measuring. What you measure is not that different from agreement.
I didn’t use the word courage, still I understand what you’re saying.
adontz did above, that's what they called this user-tolerance-for-experiments metric. I didn't mean to imply you would too, apologies.
Oh, no need to apologize at all.
I could have clarified as well that I was leaning more towards the user-tolerance... or as I like to call it user-guess that this feature might be OK with them :)
Another thing I like about granular and flexible feature flag management is you can really dial in and learn from which features get used by whom, actually.... instead of building things that will collect dust.
This seems like it would skew the data significantly for certain use-cases.
Unless you're feature flagging to test infra backing an expensive feature (in which case, in a load-balancer / containerised world, bucketing is going to be much a much better approach than anything at application level), then you most likely want to collect data on acceptance of a feature. By skewing it toward a more accepting audience, you're getting less data on the userbase that you're more likely to lose. It's like avoiding polling swing states in an election.
From your naming, I would have done the opposite :) Start with courage 0.05 and show experiments whenever it is greater than the threshold. To enable a feature for everybody, you lower the threshold to 0.
How did you measure "experiment tolerance"?
Yeah, naming was bad. Courage 0 is maximum courage.
>> How did you measure "experiment tolerance"?
Feedback from CS mostly. No formal method. We tried to survey clients to calculate courage metric, but failed to come up with anything useful.
This seems really complex, specifically in the area where I find product, CS & marketing least likely to want it: targeting and controlling their audience. Sounds like a cool thought experiment, fun and challening to implement and not really practical or useful.
If you have a huge userbase and deploy very frequently FFs are great for experiments, but for the rest of us they're primarily a way to decouple deploys from releases. They help with the disconnect between "Marketing wants to make every release a big event; Engineering wants to make it a non-event". I also find treating FFs as different from client toggles is very important for lifecycle management and proper use.
More than the binary nature I think the bigger challenge is FFs are almost always viewed as a one-way path "Off->On->Out" but what if you need to turn them off and then back on again? It can be very hard to do properly if a feature is more than UI, that might cause data to be created or updated that the old code then clobbers, or issues between subsystems, like microservices that aren't as "pure" as you thought.
Yes, it's not a good solution. Targeting was missing, good catch. I've just shared an unusual experience to inspire further experimenting.
Speaking as an open-source feature flag 'vendor' (https://github.com/flipt-io/flipt), the OpenFeature organization has been a joy to work with. They are very welcoming of new contributors (e.g., implementing a provider SDK in a new language).
If you're interested in this space I'd recommend lurking in their CNCF Slack Channel https://cloud-native.slack.com/archives/C0344AANLA1 or joining the bi-weekly community calls https://community.cncf.io/openfeature/.
A coding world with more standardization will be better world.
I met this week "Standardized Interface for SQL Database Drivers" https://github.com/halvardssm/stdext/pull/6 by example then https://github.com/WICG/proposals/issues too.
Huge work to get everybody on the same page (About my previous example, it's not well engaged by example https://github.com/nodejs/node/issues/55419), but when done and right done, it's a huge win for developers.
PHP PSR, RFC & co are the way.
I was thinking of PSR interfaces when I was reading this!
I don’t get it. Why is this needed above and beyond the standard ways of configuring deployed services?
Do you mean feature flags? This enable you to change the configuration at the runtime. Ex: A/B Testing and changing a behavior for a subset of users, disable feature when you want it (particularly useful when you are in Trunk Based Development and don't want to deploy a beta feature to everyone for example).
But why do you need an external service for that? Isn’t that basically a single DB table with a name and an on/off value for each flag (or maybe an integer for multiple options)?
In it's simplest incarnation, yes it could be just a single DB table with boolean flags.
However there are a lot of connected needs that most real world-usages run into:
- Per-user toggles of configuration values
- Per-user dynamic evaluation based on a set of rules
- Change history, to see what the flag value was at time of an incident
- A/B testing of features and associated setting of tracking parameters
- Should be controllable by e.g. a marketing/product manager and not only software engineers
That can quickly grow into something where it's a lot easier to reach for an existing well thought out solutions rather than trying to home-grow it.
In microservice world. Do you want to track features in each service or have a source of truth using flagd I prefer central source of truth
In a microservice world obviously you'd have a feature-flag service. But you still have a build/buy consideration.
Great summary. The more parties involved with more and more configurations the more management of details is needed.
No I mean an entire framework and set of software components for doing feature flags.
If you have a single database than maybe you can (and should?) just start with a basic, single table approach, but as you grow in size and complexity FF management can become a challenge, with reporting gaps and feature release management. I usually see two charateristics with the former approach: growth in the # of FFs over time and a messy Excel report for what they are, do and if anyone still hits the old code. This might be fine for a while, or forever, but often gets painful.
Clearly you haven't worked at an org that uses something like this extensively (LaunchDarkly for example.)
Are there any big feature flag SaaS vendors that support this? Like LaunchDarkly, Flagsmith, Unleash etc?
Hey there - one of the Flagsmith founders here - yes we are supporting it, building adapters for our SDKs and I'm on the CNCF project governance board.
We've got the core functionality pretty much down now, and so there's some more interesting/challenging components to think about now like Event Tracking (https://github.com/open-feature/spec/issues/276) and the Remote Evaluation Protocol (https://github.com/open-feature/protocol)
Hey there! Andrew here, Community Manager for OpenFeature and DevRel lead at DevCycle. We (DevCycle) have worked hard to ensure an OpenFeature Provider is available for every language supported by OpenFeature and for which we have an SDK (https://docs.devcycle.com/integrations/openfeature)
Yes there are, as I am part of the openFeature community, I have to point you to https://openfeature.dev/ecosystem where you'll see all kinds of providers which are supported (some officially, some by the community)
LaunchDarkly has a mix of OpenFeature providers they wrote, and quite reasonable community-contributed ones, depending on language. They are also very actively engaged with OF in meetings, discussions, etc.
(We are a big LD user at work.)
Looks like an interesting project. Really cute logo. :)
How much does the flagd sidecar cost? Seems like that could be a lot of overhead for this one bit of functionality.
Nice! Sometime ago i made a small poc with usage on Frontend (nextjs) Backcend (js) Flag provider (flagd + flag api that serves json flags from db)
Cool stuff
https://github.com/grmkris/openfeature-flagd-hono-nextjs
I am looking to maybe support this in
https://github.com/vhodges/ittybittyfeaturechecker
probably via https://openfeature.dev/specification/appendix-c (I don't have time to maintain a bunch of providers).
We are evaluating new solutions at work and OpenFeature is something we're interested in. (I did the home grown solution that's in use by one product line)
I can see that this might be very useful, since it is more some kind of application configuration specification that goes far beyond simple flags. In the end the common provider that works securely across all services and clients is probably the real problem.
I hope that OpenFeature changes the feature flagging space the same way that OpenTelemetry impacted the o11y space, we are overdue for this (in my biased opinion)
As someone who’s been thinking about feature toggles and continuous delivery often lately, OpenFeature has been helpful.
Kudos to the team!
Have they benchmarked against similar sized GGUF quants? How is it compared to them?
java version embeds lombok symbols lol
Forgive my ignorance, but what should it be doing instead?
Lombok is a very divisive framework in Java, with strong opinions on both sides.
Given that, it's a bold choice to include Lombok in a library that other developers will pull into their stack - it's likely to make this a non-starter from those in the 'no' camp.
As Lombok is just compiler sugar, when building an SDK for other developers, it's probably less alienating to just write the boilerplate that Lombok saves you from.
Lombok is a compile-time dependency. Consumers of a library using lombok don't need to depend on lombok, so I don't see why it would matter?
The symbols remain in the final library, necessitating either class exclusions within the scope of a JAR you don't control (which is a terrible idea) or the addition of a dependency which is irrelevant, inert, and has no place in your codebase.
It is embarrassing for a library to ship ABI-visible symbols from Lombok.
Where is the tldr? Anyone familiar…what does this do and why do we care about it being standards based?
This is a “standard” SDK for feature flags, allowing you to avoid vendor lock-in.
i.e., using feature flag SaaS ABC but want to try out XYZ? if you’re using ABC’s own DDK, refactor your codebase.
I appreciate that you can use the OpenFeature SDK with environment variables, and move into a SaaS (or custom) solution when you’re ready.
https://openfeature.dev/docs/reference/intro/
the laziness on this site never ceases to amaze
and the use of "we" to somehow give the impression that this person speaks for everyone
Martin Fowler about feature flags: https://martinfowler.com/articles/feature-toggles.html
No, that is Pete Hudgson on martinfowler.com. Most articles on martinfowler.com haven't been written by Martin Fowler himself in years. It's best thought of as a publishing venue for Thoughtworks.
Pete is a great guy, also on the OpenFeature governance board :)