-
Fix Fixed a timing issue where Gremlin attacks would incorrectly end up
HaltFailed
when multiple rollback events happened in a short time. Rollbacks already successfully clean up impact, but this fix ensures the result is notHaltFailed
.
-
Fix Removed the recommended package
kernel-modules-extra
from Gremlin rpms, as this would often trigger an upgrade of the user's kernel when installing Gremlin. - Info Updated dependencies for security patches.
- New Introducing experiment GPU, which will attempt to exhaust all compute resources for the system's GPUs.
- Info Updated dependencies
- New Checks for required kernel modules during installation, fails and provides suggestions if modules aren't found.
- Info Updated dependencies for security patches.
- New Support for container attacks on Linux versions older than 4.6 where the cgroup namespace isn't present.
- Info Updated dependencies for security patches.
-
Fix Disallow running more than one daemon at a time that caused
Connection Refused
errors. - Info Updated dependencies
- New Dependency Discovery now uses open connections in combination with DNS records to discover only hostnames for which the Gremlin Agent has seen an active connection. Users should no longer see dependencies discovered for services that only make DNS queries to a particular hostname, but never actually connect to them.
-
New When running either the
docker-linux
,containerd-linux
, orcrio-linux
container driver, Disk experiments against containers now correctly impact the target's volume resources. Both ephemeral storage and external volumes are supported. See the Disk experiment for more details. - New During impact, the Disk experiment no longer keeps the files it produces hidden. This allows external monitoring systems like the kubelet to correctly measure changes in disk usage. See Ephemeral storage consumption management
- Fix Fixed a bug where the Gremlin Agent would not find a target container's cgroup when running on cgroup1 with subsystems defined at different paths.
- Fix Fixed a bug where the Gremlin Agent could not parse cgroup data about a target container when both a cgroup2 hierarchy and a cgroup1 named hierarchy were present. This is common on systems like BottleRocket when the Admin Container is enabled.
- Info Updated dependencies
-
Fix Fixed a bug released in 2.51.0 where the Gremlin Agent's Shutdown attack would fail to shutdown a target container. This bug was specifc to agents configured with the
docker-runc
,containerd-runc
, andcrio-runc
drivers. Agents configured with the default container drivers were unaffected. - Info Updated dependencies
-
Fix Fixed a bug released in 2.51.0 where CPU attacks against containers would ignore the
--percent
argument, defaulting instead to100%
. - Info Updated dependencies
Note: this version contains important fixes for container targeting in both 2.50.0
and 2.51.0
releases.
- Fix Fixed a bug released in 2.50.0 which resulted in the Gremlin agent reporting false rollback failures against container targets. The Gremlin agent can sometimes get stuck trying to rollback experiments that have already been cleaned up, leading to excessive log messages and a failure to receive new experiments.
- Fix Fixed a bug released in 2.51.0 where network experiments against containers may attempt to target devices present on the host. Such experiments fail before they fully initialize and no impact to the host devices is made.
- Fix Fixed a bug released in 2.51.0 where network experiments against containers fail to detect if conflicting network traffic shapping is present. Experiments fail to initialize as a result. No impact is made to conflicting traffic shaping rules.
-
Fix Fixed a bug released in 2.51.0 where memory experiments against containers use the host's total memory when calculating desired consumption, leading to experiments that may consume more than the desired amount when the
--percent
argument is passed. - Info Improved logging around actions the Gremlin agent takes to rollback active experiments.
Note: this version was removed from Docker Hub after identifying bugs that impacted network and memory container attacks. A patch version will be released as a replacement
- New Blackhole experiments now automatically add the route to the Gremlin service to the exclusion rules so that connection is maintained during the attack. This can be disabled if needed via
--no-derived-exclusion-rules
. - Info Updated dependencies
-
Fix Fixed bugs with the Gremlin Agent's session renewal process, cleaning up spurrious
Unauthorized
errors and preventing rare instances where local session storage can get corrupted. -
Fix The
*-linux
container Drivers now validate whether the the Gremlin Agent was installed in the host's PID namespace (e.g. using thegremlin.hostPID=true
helm chart argument). - Info Updated dependencies
- New The Gremlin Agent no longer has the
collect_processes
option. Setting this value totrue
is now ignored. Dependency discovery features are now controlled only bycollect_dns
, which istrue
by default. - Fix Upon receiving a shutdown signal (e.g.
SIGTERM
), the Gremlin Agent will wait for any running attacks to finish halting before shutting down. This fixes an issue where attacks would end upFailed
orLostCommunication
instead ofClientAborted
when the Gremlin Agent was terminated during such attacks. - Info Updated dependencies
- New The Gremlin Agent now reports the DNS servers that are used by the host on which it is installed. This data is used to select a random DNS server to impact for our new
Redundancy: DNS
reliability test, available from the Well-Architected Cloud Test Suite. - Fix Enabled targeting of container paths for the IO attack
- Fix Improved the read I/O attack to bypass the page cache and read directly from disk
- Fix Enhanced logging
-
Fix Added missing capabilities on host installations for running container attacks:
SYS_ADMIN
,SYS_RESOURCE
,CAP_SYS_CHROOT
- Fix Suppressed dependency discovery log events that were too noisy. Events would occur when processing DNS traffic from containers that have since exited.
-
Fix Improved logging in situations where
gremlind
cannot open or parse configuration and certificate files. - Info Updated dependencies
-
Fix Critical Fix a bug introduced in 2.44.0 where Blackhole attacks failed to clean up impact on ingress traffic during a
Halt
orClientAborted
event. All users on affected versions are advised to upgrade as soon as possible to avoid any impact left behind from Gremlin attacks. - FixPrint details about eligible container drivers that failed to load due to missing requirements.
- FixResolve to IPv4 addresses over IPv6 addresses during cert expiry experiments.
- FixPrint full error when unable to inspect a network device.
- NewAdd support for targeting by zone and region based on the Kubernetes labels
topology.kubernetes.io/zone
andtopology.kubernetes.io/region
- NewIntroducing experiment Process Exhaustion, a way to consume processes to identify limits within the target system.
- New New container drivers are available:
docker-linux
,containerd-linux
,crio-linux
, which spawn attacks with significantly reduced CPU and IO system usage. Attacks against container processes no longer require direct integration withrunc
. These drivers can be enabled by removing volumeMounts from the Gremlin daemonset for/run/docker/runtime-runc/moby
,/run/containerd/runc/k8s.io
, and/run/runc
respectively. - Fix Rolling back network attacks no longer considers missing network devices as a critical error. This accounts for failure modes where the network device is torn down externally.
- Fix Better detection around pre-existing ingress rules which conflict with Gremlin blackhole attacks. This can happen with network integrations such as Cilium and Kata, or any networking integration which applies some level of traffic shapping on ingress network traffic. Gremlin now skips impact when conflicts are detected and prints a warning to the attack log.
- NewDuring a rollback, the
gremlind
process sends aSIGTERM
to the associated attack process before proceeding to clean up any remaining impact. - InfoRemoved
--attacker
and--target
arguments fromgremlin rollback-container
, the target container can still be supplied as the first argument (e.g.gremlin rollback-container $TARGET_ID
). - InfoImproved logging in
daemon.log
when attacks are rolled back. - InfoRaised the TCP connect timeout for API requests that transition attacks between stages from 1 second to 5 seconds.
- InfoEnabled DNS collection by default, disabled process collection by default.
- FixAddressed an issue where rollback would fail when no teardown was required.
- FixFixed a regression introduced in 2.38.0 that prevented automatic rollback of attacks when the Gremlin agent loses connection with its control plane.
- InfoRemoved dependency on the system
pgrep
utility during Process Killer attacks. Gremlin now identifies processes directly. - InfoA warning when is now emitted when
/proc/sysrq-trigger
is mounted in installations of the `gremlin/gremlin` and a shutdown attack is run. Install the gremlin agent container into the host's PID namespace instead to initiate a host-level shutdown. - InfoUpdated dependencies.
- NewAdded a new DNS-based dependency collection feature. Learn more about this feature here.
- NewAdded
CAP_NET_RAW
capability for systemd installs - FixPrint full error on rollback failures.
- InfoUpdated dependencies.
- NewBetter error messages for
no container driver
error messages that can occur during container attacks if the underlying container runtime becomes unreachable. Error messages now include the failures received from each container runtime for which a connection was attempted. - FixFixed a bug where Gremlin would sometimes choose the wrong container driver when multiple container runtimes are present, resulting in failed attacks that indicate the targeted container no longer exists.
- FixRemoved the file decompression steps that were introduced in 2.39.0 due to the memory overhead this optimization introduced. A future release will optimize container attack provisioning to a more significant degree.
- FixFixed an incomplete error message when the
gremlind
process receives API errors from AWS IMDSv2 endpoints. - InfoUpdated dependencies.
- New File system resources for Gremlin container attacks are decompressed on startup of the
gremlind
agent, which reducesgremlind
's CPU usage at attack time.
- New Provided Gremlin has access to a valid AWS credentials chain, it now interprets AWS ARN values in
GREMLIN_TEAM_ID
,GREMLIN_TEAM_SECRET
,GREMLIN_TEAM_CERTIFICATE_OR_FILE
,GREMLIN_TEAM_PRIVATE_KEY_OR_FILE
. Gremlin supports ARN values from AWS Secrets Manager or AWS Systems Manager Parameter Store. Gremlin can optionally be supplied withGREMLIN_IAM_ROLE
to specify a role to assume for the strict purpose of fetching secret values. - Fix More context is added to various error messages
- Fix Regression introduced in 2.37.0 where attacks with invalid arguments would end up
Lost Communication
instead ofFailed
- Info Updated dependencies
- Fix Fixed a bug where Gremlin would prevent sending arbitrary signals to PID 1. Now, only
SIGKILL
is prevented, which is unsupported against PID 1 on Linux.
- Info Security patches for: CVE-2023-29403, CVE-2023-44487, CVE-2023-39325, CVE-2023-29406, CVE-2023-39318
- New Attacks can sometimes fail to notify the Gremlin Control Plane when its connection is impacted by the attack itself. The Gremlin agent now tolerates these failures more often and attempts to resend failed notifications. This fixes attacks that end up in the
HaltFaled
stage that would otherwise finish in theSuccessful
stage.
- Fix Fixed an issue Certificate Expiry attacks against containers would fail when Gremlin was configured with
SSL_CERT_FILE
- Fix Fixed an issue where important errors from container attacks were not properly forwarded to the Gremlin control plane, leaving execution outputs from failed attacks without helpful troubleshooting information.
- New Improved the output of the Gremlin Agent validation routine that happens on startup. When validation fails, details about the failure are written to
daemon.log
- Info Updated dependencies
- Fix Fixed an issue where attacks were incorrectly labeled
HaltFailed
when Gremlin fails to notifyapi.gremlin.com
during teardown of the network impact. - Fix Fixed a class of issues where Gremlin would not retry requests that failed with transient network errors. This sometimes lead to failing container attacks that should otherwise succeed.
- New For users running Gremlin in the Docker container runtime, rollbacks against container targets no longer require provisioning a second container instance, which results in faster rollbacks.
- New Gremlin provides more context to errors stemming from failed http requests to
api.gremlin.com
. - New For users running Gremlin on AWS, more error information is printed to the log file when AWS metadata cannot be retrieved.
- Info Updated dependencies
- Fix Fixed an issue where the Gremlin agent would ignore changes to the
identifier
field inconfig.yaml
if a valid session has already been generated and is not yet expired. On startup, the Gremlin agent will now correctly regenerate a session using the intendedidentifier
value if it detects that its existing session belongs to a different value foridentifier
.
- Fix Fixed an issue with Certificate Expiry experiments against container targets, where the attack process would not have sufficient Linux capabilities (missing DAC_READ_SEARCH). This fix requires helm chart release
0.11.0
(See #86), however all other attacks will continue to work correctly without this chart update. - Fix Updated Certificate Expiry experiments to discover IPv4-mapped IPv6 addresses (e.g.,
::FFFF:192.168.1.1
) when a CIDR is specified. - Fix Fixed a regression introduced in 2.22.1 where the Process Killer experiment would incorrectly interpret the
interval
argument as milliseconds, instead of seconds as intended. - Info Updated dependencies
- New Running Certificate Expiry experiments against CIDR values (e.g.,
10.0.0.0/24
) will make several attempts to find an active IP address in use by the target system for evaluating certificate expiration characteristics within the duration specified by the argument--length
.
- New When installed directly on the host and launched with SystemD, Gremlin agent now runs with ambient capabilities (capabilities(7)). File capabilities are no longer set on
/usr/bin/gremlin
or/usr/sbin/gremlind
. - New When installed directly on the host, the suid bit is no longer set for installed binaries
/usr/bin/gremlin
and/usr/sbin/gremlind
. Additionally, these binaries are no longer owned by thegremlin
linux user, but owned byroot
instead. - Info To install Gremlin with file capabilities and
gremlin
Linux user ownership in accordance with previous Gremlin versions, set the appropriateGREMLIN_INSTALL_
configuration variables at install time:GREMLIN_INSTALL_USER=gremlin GREMLIN_INSTALL_GROUP=gremlin GREMLIN_INSTALL_BIN_MODE=6111 GREMLIN_INSTALL_BIN_CAPABILITIES=1 sudo -E yum install gremlin gremlind
. See Customize Gremlin's Linux User and Group
- New Previously,
gremlind
would emit snapshots of process and socket data to Gremlin's control plane over 2 minute intervals. This release significantly reduces network overhead for this data asgremlind
now batches up process data over 15 minute intervals, deduplicating all network and process data detected over this interval.
- New Gremlin now uploads discovered process data at a slower rate, reducing network overhead.
- Fix Fixed a regression released in 2.31.0 where the gremlin agent would set the Host header to an incorrect value for outgoing requests to the Gremlin control plane. This can lead to authentication failures for some intermediate web proxies that use this host header for authorizing requests.
- Fix Errors related to spawning subprocesses now have more detailed information useful for troubleshooting.
- Fix IO Errors related to Gremlin container attacks now have more detailed information useful for troubleshooting.
- Fix Gremlin provisions fewer file resources for its attack sidecar processes, reducing the time it takes to launch container attacks.
- Fix For hostnames supplied to network attacks, Gremlin delegates DNS queries to the operating system. When this query fails, Gremlin now attemps to resolve the name completely within the running process in an attempt to overcome operating system failures. This allows Gremlin network attacks to continue in the face of failed DNS processing.
- Fix Fixed a comment in Gremlin's
config.yaml
which incorrectly stated thatcollect_processes
was disabled by default.
- Fix Fixed an out-of-memory error caused by a 3rd party library during process collection.
- Fix Fixed a regression instoduced in 2.29.0 where containers for each attack execution incorrectly bind-mounted the file system of every other attack container running on the host. Given enough attack executions running at the same time, a new attack execution container receives a
no space left on device
error when attempting such mounts, despite space available. Gremlin no longer makes such mounts. - Fix When running the gremlin/gremlin container image, attack containers no longer run in the
hostPath
mount/var/lib/gremlin
. This would producepermission denied
errors on systems where this file system is mounted with thenoexec
flag, such as GKE COS
- New The Certificate Expiry attack's
ipaddress
argument now correctly processes CIDR values (e.g.10.0.0.0/24
). When passed, Gremlin will attempt to find an active IP Address in use by the target system and use it for evaluating certificate expiration characteristics. - New The gremlin/gremlin Dockerhub image now contains the
strace
utility as a convenience for operators that cannot install this utility from the internet.
- New The Blackhole attack skips impact on ingress traffic when it detects third-party ingress traffic manipulation rules, such as those installed by a CNI like cilium. This allows egress impact to be applied without failing the attack with errors like
Exclusivity flag on, cannot modify
. - Info Updated dependencies.
- Fix Gremlin's calls to getaddrinfo now fallback to TCP when a nameserver replies with a truncated answer. For more info, see musl libc 1.2.4.
- Info Updated dependencies.
- Fix Gremlin now tears down the TCP connection pool with
api.gremlin.com
after successive timeout failures. - Fix Gremlin includes the name of the targeted network interface in execution log events related to applying network impact.
- Info Updated dependencies.
- Fix Fixed an issue where Gremlin would not report back to the control plane the detailed error that occurred during a failed attack. Users encountering this bug may see
http: 415: 415
in their execution log. - Fix Fixed an issue where
gremlin check api
would incorrectly report connection failures, including an error message ofhttp 403
. - Fix Fixed several instances where errors were suppressed from http interactions made by the Gremlin agent. All failed http interactions now show the method and path of the attempted call, along with descriptive error messages.
- Fix Fixed an issue where Gremlin would run requested attack executions in a way that was detached from the original attack request. This leads to the original attack request ending in a
LostCommunication
stage, while the detached attacks continue to run.
- Fix Corrected the
ExecStartPre
option in thegremlind.service
file which resulted in nuisance errors. - Info Updated dependencies.
- Fix Fixed a bug introduced in 2.31.0 where
gremlin init
would fail unless the environment variableGREMLIN_TRANSPORT=direct
was set. - Info Added support for tag values to be any simple YAML datatype (boolean, integer, float, string). Previously only strings were supported.
- Info Updated dependencies.
- New Gremlin can now target container and Kubernetes targets, even when those targets lack network access to
api.gremlin.com
. - New All network traffic from Gremlin attack processes are now routed through
/var/lib/gremlin/gremlin.sock
. To disable this behavior, provide the following environment variable to the Gremlin agent:GREMLIN_TRANSPORT=direct
- Fix Fixed an issue that prevented Gremlin from ingesting Azure Tags.
- Fix Fixed an issue that made Gremlin validation unreliable.
- Info Updated dependencies.
- Fix Addressed an issue where Gremlin agents enabled with
GREMLIN_TEAM_SECRET
would fail to start when also configured withGREMLIN_TRANSPORT=domain-socket
- New Gremlin's version command now prints more build information.
- New Gremlin can now target container and Kubernetes targets, even when those targets lack network access to
api.gremlin.com
. See more information at Preview: Gremlin in Kubernetes Restricted Networks
- Fix Addressed performance issues that were seen with
gremlind
whencollect_processes=true
which would lead to high CPU usage and agents becomingIDLE
. Symptoms occurred on systems running many processes and active network connections (over 1K of each). - New Various metrics around data collection have beed added to the output of
gremlin check daemon
for benchmarking purposes. - New A warning is now supplied in execution logs when the
device
argument specifies a device that does not exist.
- Fix Fixed a regression in 2.30.0 in which network attacks running in a container without targeting a specific network interface failed to have any impact.
- New Improved the strategy for selecting the target network interfaces.
- New Multiple network interface attacks are now supported. Details are available in Network device selection.
- New IP address and network interface data is collected to improve distributed network attacks.
- Info Updated dependencies.
- New Gremlin Container attacks no longer create a new Linux mount namespace for the attack. Instead,
gremlin
attack processes now run in the namespace of thegremlind
agent. For Kubernetes environments running AppArmor, this release requires a helm chart update. - Info Updated dependencies.
- Fix Fix a bug in collect_certs when the target dropped the network connection before completing the TLS setup.
- Info Updated help URLs.
- Info Updated dependencies.
- New Add support for containerd builds that do not provide versioning metadata.
- Info Updated dependencies.
- Fix Fix a bug that prevented collect_certs from working when run against a container.
- Info Updated dependencies.
- New Add a short argument (-n) for the not_less_than option.
- Info Updated dependencies.
- Fix Fixed an issue affecting Docker CRI on cgroupv2; Gremlin previously failed to roll back network attacks if the target container was killed during the attack.
- New Gremlin now supports OpenShift 4.9+ and CRI-O 1.22+
- Fix Fixed an issue affecting containerd and CRI-O on cgroupv2; Gremlin previously failed to roll back network attacks if the target container was killed during the attack.
- Fix Fixed an issue where Gremlin was not resolving internal hostnames in some instances.
- New Introduce Certificate Expiry test for Reliability Management.
- Info Updated dependencies.
- New Agent interactions with AWS APIs now use IMDSv2.
- Fix Fixed a bug where Gremlin would not properly launch attacks that resolve to a large amount of IP addresses / blocks.
- Info All Gremlin container drivers now work with
cgroup2
-enabled kernels. - Info Updated dependencies.
- Info Process Collection is now automatically enabled. Process Collection gathers information about the processes running on Linux machines where the Gremlin Agent is installed to detect system dependencies. To disable Process Collection, see Disable Process Collection.
- Fix Fixed a bug where Gremlin's dependency discovery features would not work when IPv6 was disabled.
- Fix Fixed a bug where Gremlin would not properly include swap in free memory calculations, leading to incorrect attack results.