Remove `max_number_of_messages` for SQS+S3-based inputs #12101

strawgate · 2024-12-13T16:11:21Z

With elastic/beats#40699 the max_number_of_messages setting is now ignored for SQS inputs as of 8.15. It is still exposed in integration settings. It is still relevant for <8.15.

The new setting is number_of_workers, which the SQS-based input shares with the S3 bucket polling input in integration settings.

Unfortunately, the default value for number_of_workers is just 1. Users using an SQS-based S3 input upgrading to 8.15 agents will find that they've gone from having five workers (or whatever number they had customized it to) for processing objects to just a single worker processing objects.

We should consider one or more of the following:

Using the max_number_of_messages setting for the number_of_workers value if queue_url is present to maintain backwards compatibility while indicating that the setting will be removed in a future version.
Indicating that max_number_of_messages is no longer applicable for 8.15 and to use number_of_workers when using an SQS queue
Bumping the number_of_workers default value to 5

The text was updated successfully, but these errors were encountered:

flexitrev · 2025-02-19T19:03:39Z

@faec What do you think we should do here since you implemented the SQS changes. I think maintaining backwards compatibility is probably the best solution, but it means incurring a little tech debt.

faec · 2025-02-24T19:30:26Z

Unfortunately, the default value for number_of_workers is just 1

I assume you're referring here to the defaults in the integrations? As far as I know the default in the input itself is 5 workers. Does this mean that integrations still set an explicit default of 1 for number_of_workers even for SQS queues where the flag was previously unsupported? Or is there a miscommunication somewhere?

I think maintaining backwards compatibility is probably the best solution, but it means incurring a little tech debt.

I don't think there is anything to be gained by trying to recover backwards compatibility by emulating max_number_of_messages or similar. The reason is that the refactor fundamentally changed the meaning of "workers" in a way that makes them no longer analogous to any configuration of the old input. Removing it entirely from the configuration struct was an intentional tradeoff of the design, based on the fact that the concrete behavior of the new worker count has no equivalent in the old model.

In the new model, one "worker" means one actual worker goroutine that continually downloads and processes objects (which is probably how most users intuitively understand "number of workers"), whereas before, it meant a goroutine that would download one object and then halt all activity until the pipeline reported that that object was fully ingested upstream (meaning that "workers" were often idle even when there was still work available).

Because of this, any intentional changes that users did make to max_number_of_messages was calibrated around a worker model that no longer exists, and will not get comparable performance with the new code. In particular, it was not uncommon for users with performance issues to set max_number_of_messages to 500+, which was a (reasonably effective) workaround for the limitations of the old worker model, but is never advantageous with the new code, and could flood the system with too many parallel downloads.

Since there's no way for us to infer the actual behavior a user was targeting based on the old max_number_of_messages flag, I don't think we should try. I think it could make sense to report a warning if configurations try to use it, but I think we should fully adopt the new default of 5 workers, and make sure messaging in the integration configs focuses on appropriate ranges for that flag.

flexitrev · 2025-02-27T20:23:14Z

Thanks for that detail @faec. That effectively rules out 1, so option 2 is the best solution.

https://www.elastic.co/guide/en/beats/libbeat/8.17/release-notes-8.16.0.html
Adding something to the info log to note the deprecation of the max_number_of_messages parameter if it is present in the configuration and the agent version is =>8.16 seems like the best way to correct set expectations within the correct context.

Additionally we should make sure the owners of the relevant integrations are aware of the change. Many relevant labels have been applied to this issue already.

leandrojmp · 2025-03-19T19:45:13Z

Hello @strawgate is there any know issues about the agent ignoring the value of number_of_workers ?

I'm having some performance issues and after some troubleshoot found that I need to use the [S3] Number of Workers setting instead of the [SQS] Maximum Concurrent SQS Messages setting on 8.17.3, but the agent ignores the value for the number of workers setting and just uses the default of 5, which is not enough to drain the queue without generating a delay in the ingestion.

This is in Fleet Managed agents.

strawgate · 2025-03-19T19:53:10Z

Can you let me know the specific integration you're using?

Can you inspect the running elastic agent config (you can also grab from a diagnostic generated in fleet) and share the block of configuration that the agent believes it is running?

We may need to move this to the forum or its own issue as this issue is about a documentation issue

leandrojmp · 2025-03-19T20:24:35Z

@strawgate I'm using the Cloudflare Logpush integration, the Firewall Events and HTTP Requests dataset.

I have multiple Inputs with the same configuration only the queue change, I'm sharing one of them here:

  - id: aws-s3-cloudflare-f56bb89c-97bd-441e-ae4b-071d5296cd1d
    name: cloudflare-REDACT
    revision: 6
    type: aws-s3
    use_output: c20b778c-0220-40f4-a8ca-03b974ada63b
    meta:
      package:
        name: cloudflare_logpush
        version: 1.35.0
    data_stream:
      namespace: cloudflare
    package_policy_id: f56bb89c-97bd-441e-ae4b-071d5296cd1d
    streams:
      - id: >-
          aws-s3-cloudflare_logpush.firewall_event-f56bb89c-97bd-441e-ae4b-071d5296cd1d
        data_stream:
          dataset: cloudflare_logpush.firewall_event
          type: logs
        queue_url: >-
          https://sqs.us-east-1.amazonaws.com/REDACTED
        visibility_timeout: 300s
        api_timeout: 120s
        max_number_of_messages: 70
        file_selectors:
          - regex: fw/
        access_key_id: '${SECRET_3}'
        secret_access_key: '${SECRET_4}'
        tags:
          - collect_sqs_logs
          - cloudflare-firewall
          - cloudflare-fw-REDACT
        processors:
          - drop_event:
              when:
                or:
                  - contains:
                      message: REDACTED
      - id: >-
          aws-s3-cloudflare_logpush.http_request-f56bb89c-97bd-441e-ae4b-071d5296cd1d
        data_stream:
          dataset: cloudflare_logpush.http_request
          type: logs
        queue_url: >-
          https://sqs.us-east-1.amazonaws.com/REDACTED
        visibility_timeout: 300s
        api_timeout: 120s
        max_number_of_messages: 70
        file_selectors:
          - regex: http/
        access_key_id: '${SECRET_3}'
        secret_access_key: '${SECRET_4}'
        tags:
          - collect_sqs_logs
          - cloudflare-http
          - cloudflare-http-REDACT

As you can see, there is the max_number_of_messages with the value we set, but not the number of workers, which is set in the integration.

Inspecting the input of the integration here, it seems that the number_of_workers setting will only get the value defined by the user only if the s3 collection of logs is enabled. (polling direct from the buckets)

{{#if collect_s3_logs}}
{{! shared S3 bucket polling options }}
{{#if number_of_workers }}
number_of_workers: {{ number_of_workers }}
{{/if}}

Which I assume is populated by this setting in the UI:

But since I'm using SQS, I cannot enable this setting as it will complain that bucket arn and other settings are missing.

It seems an issue on the integration, so we may need to open another issue.

In the Elastic Agent logs I have this, with the default value of 5.

leandrojmp · 2025-03-19T21:17:14Z

@strawgate should I reference the above comment on a new issue?

strawgate · 2025-03-19T21:27:59Z

@strawgate should I reference the above comment on a new issue?

Yes please, I've already tagged someone internally to follow up @flexitrev

leandrojmp · 2025-03-19T21:41:07Z

@strawgate no problem! I will create the issue now.

edit: issue #13179

flexitrev · 2025-03-20T15:12:01Z

Here's the list of integrations that have number_of_workers set within an S3 bucket conditional

aws
amazon_security_lake
aws_bedrock
aws_logs
canva
carbon_black_cloud
f5_bigip
imperva_cloud_waf
jamf_protect
sentinel_one_cloud_funnel
servicenow
sublime_security
symantec_endpoint_security
tanium
trellix_edr_cloud

There is one integration/data_stream that only uses SQS and lacks the workers setting
crowdstrike/fdr

We need to make a bulk edit to all of these integrations similar to the changes in #13181

Update aws-s3.yml.hbs to place number_of_workers outside of a conditional
update number_of_workers to indicate it is applicable to SQS and S3
Add deprecation notice to max_number_of_messages

leandrojmp · 2025-03-20T15:26:01Z

@flexitrev I think that the main aws package also needs to be adjusted, at least the cloudtrail dataset.

In the cloudtrail dataset the number_of_workers setting is also nested under a conditional.

There are other datasets on the main aws package that have the same issue like apigateway_logs, waf, vpcflow and others, they can use both SQS or S3 polling.

leandrojmp · 2025-03-20T15:44:15Z

There is one integration/data_stream that only uses SQS and lacks the workers setting
crowdstrike/fdr

I can make a PR to solve this one.

PR for Crowdstrike FDR: #13196

flexitrev self-assigned this Jan 9, 2025

leandrojmp mentioned this issue Mar 19, 2025

[Cloudflare Logpush] Setting number_of_workers is ignored if SQS is used. #13179

Open

flexitrev linked a pull request Mar 20, 2025 that will close this issue

Aws-s3-input_bulk-fix #13199

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `max_number_of_messages` for SQS+S3-based inputs #12101

Remove `max_number_of_messages` for SQS+S3-based inputs #12101

strawgate commented Dec 13, 2024

flexitrev commented Feb 19, 2025

faec commented Feb 24, 2025

flexitrev commented Feb 27, 2025 •

edited

Loading

leandrojmp commented Mar 19, 2025 •

edited

Loading

strawgate commented Mar 19, 2025 •

edited

Loading

leandrojmp commented Mar 19, 2025 •

edited

Loading

leandrojmp commented Mar 19, 2025

strawgate commented Mar 19, 2025

leandrojmp commented Mar 19, 2025 •

edited

Loading

flexitrev commented Mar 20, 2025 •

edited

Loading

leandrojmp commented Mar 20, 2025 •

edited

Loading

leandrojmp commented Mar 20, 2025 •

edited

Loading

Remove max_number_of_messages for SQS+S3-based inputs #12101

Remove max_number_of_messages for SQS+S3-based inputs #12101

Comments

strawgate commented Dec 13, 2024

flexitrev commented Feb 19, 2025

faec commented Feb 24, 2025

flexitrev commented Feb 27, 2025 • edited Loading

leandrojmp commented Mar 19, 2025 • edited Loading

strawgate commented Mar 19, 2025 • edited Loading

leandrojmp commented Mar 19, 2025 • edited Loading

leandrojmp commented Mar 19, 2025

strawgate commented Mar 19, 2025

leandrojmp commented Mar 19, 2025 • edited Loading

flexitrev commented Mar 20, 2025 • edited Loading

leandrojmp commented Mar 20, 2025 • edited Loading

leandrojmp commented Mar 20, 2025 • edited Loading

Remove `max_number_of_messages` for SQS+S3-based inputs #12101

Remove `max_number_of_messages` for SQS+S3-based inputs #12101

flexitrev commented Feb 27, 2025 •

edited

Loading

leandrojmp commented Mar 19, 2025 •

edited

Loading

strawgate commented Mar 19, 2025 •

edited

Loading

leandrojmp commented Mar 19, 2025 •

edited

Loading

leandrojmp commented Mar 19, 2025 •

edited

Loading

flexitrev commented Mar 20, 2025 •

edited

Loading

leandrojmp commented Mar 20, 2025 •

edited

Loading

leandrojmp commented Mar 20, 2025 •

edited

Loading