Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove max_number_of_messages for SQS+S3-based inputs #12101

Open
strawgate opened this issue Dec 13, 2024 · 12 comments · May be fixed by #13199
Open

Remove max_number_of_messages for SQS+S3-based inputs #12101

strawgate opened this issue Dec 13, 2024 · 12 comments · May be fixed by #13199

Comments

@strawgate
Copy link
Contributor

With elastic/beats#40699 the max_number_of_messages setting is now ignored for SQS inputs as of 8.15. It is still exposed in integration settings. It is still relevant for <8.15.

The new setting is number_of_workers, which the SQS-based input shares with the S3 bucket polling input in integration settings.

Unfortunately, the default value for number_of_workers is just 1. Users using an SQS-based S3 input upgrading to 8.15 agents will find that they've gone from having five workers (or whatever number they had customized it to) for processing objects to just a single worker processing objects.

We should consider one or more of the following:

  1. Using the max_number_of_messages setting for the number_of_workers value if queue_url is present to maintain backwards compatibility while indicating that the setting will be removed in a future version.
  2. Indicating that max_number_of_messages is no longer applicable for 8.15 and to use number_of_workers when using an SQS queue
  3. Bumping the number_of_workers default value to 5
@flexitrev
Copy link
Contributor

@faec What do you think we should do here since you implemented the SQS changes. I think maintaining backwards compatibility is probably the best solution, but it means incurring a little tech debt.

@faec
Copy link

faec commented Feb 24, 2025

Unfortunately, the default value for number_of_workers is just 1

I assume you're referring here to the defaults in the integrations? As far as I know the default in the input itself is 5 workers. Does this mean that integrations still set an explicit default of 1 for number_of_workers even for SQS queues where the flag was previously unsupported? Or is there a miscommunication somewhere?

I think maintaining backwards compatibility is probably the best solution, but it means incurring a little tech debt.

I don't think there is anything to be gained by trying to recover backwards compatibility by emulating max_number_of_messages or similar. The reason is that the refactor fundamentally changed the meaning of "workers" in a way that makes them no longer analogous to any configuration of the old input. Removing it entirely from the configuration struct was an intentional tradeoff of the design, based on the fact that the concrete behavior of the new worker count has no equivalent in the old model.

In the new model, one "worker" means one actual worker goroutine that continually downloads and processes objects (which is probably how most users intuitively understand "number of workers"), whereas before, it meant a goroutine that would download one object and then halt all activity until the pipeline reported that that object was fully ingested upstream (meaning that "workers" were often idle even when there was still work available).

Because of this, any intentional changes that users did make to max_number_of_messages was calibrated around a worker model that no longer exists, and will not get comparable performance with the new code. In particular, it was not uncommon for users with performance issues to set max_number_of_messages to 500+, which was a (reasonably effective) workaround for the limitations of the old worker model, but is never advantageous with the new code, and could flood the system with too many parallel downloads.

Since there's no way for us to infer the actual behavior a user was targeting based on the old max_number_of_messages flag, I don't think we should try. I think it could make sense to report a warning if configurations try to use it, but I think we should fully adopt the new default of 5 workers, and make sure messaging in the integration configs focuses on appropriate ranges for that flag.

@flexitrev
Copy link
Contributor

flexitrev commented Feb 27, 2025

Thanks for that detail @faec. That effectively rules out 1, so option 2 is the best solution.

https://www.elastic.co/guide/en/beats/libbeat/8.17/release-notes-8.16.0.html
Adding something to the info log to note the deprecation of the max_number_of_messages parameter if it is present in the configuration and the agent version is =>8.16 seems like the best way to correct set expectations within the correct context.

Additionally we should make sure the owners of the relevant integrations are aware of the change. Many relevant labels have been applied to this issue already.

@leandrojmp
Copy link
Contributor

leandrojmp commented Mar 19, 2025

Hello @strawgate is there any know issues about the agent ignoring the value of number_of_workers ?

I'm having some performance issues and after some troubleshoot found that I need to use the [S3] Number of Workers setting instead of the [SQS] Maximum Concurrent SQS Messages setting on 8.17.3, but the agent ignores the value for the number of workers setting and just uses the default of 5, which is not enough to drain the queue without generating a delay in the ingestion.

This is in Fleet Managed agents.

@strawgate
Copy link
Contributor Author

strawgate commented Mar 19, 2025

Can you let me know the specific integration you're using?

Can you inspect the running elastic agent config (you can also grab from a diagnostic generated in fleet) and share the block of configuration that the agent believes it is running?

We may need to move this to the forum or its own issue as this issue is about a documentation issue

@leandrojmp
Copy link
Contributor

leandrojmp commented Mar 19, 2025

@strawgate I'm using the Cloudflare Logpush integration, the Firewall Events and HTTP Requests dataset.

I have multiple Inputs with the same configuration only the queue change, I'm sharing one of them here:

  - id: aws-s3-cloudflare-f56bb89c-97bd-441e-ae4b-071d5296cd1d
    name: cloudflare-REDACT
    revision: 6
    type: aws-s3
    use_output: c20b778c-0220-40f4-a8ca-03b974ada63b
    meta:
      package:
        name: cloudflare_logpush
        version: 1.35.0
    data_stream:
      namespace: cloudflare
    package_policy_id: f56bb89c-97bd-441e-ae4b-071d5296cd1d
    streams:
      - id: >-
          aws-s3-cloudflare_logpush.firewall_event-f56bb89c-97bd-441e-ae4b-071d5296cd1d
        data_stream:
          dataset: cloudflare_logpush.firewall_event
          type: logs
        queue_url: >-
          https://sqs.us-east-1.amazonaws.com/REDACTED
        visibility_timeout: 300s
        api_timeout: 120s
        max_number_of_messages: 70
        file_selectors:
          - regex: fw/
        access_key_id: '${SECRET_3}'
        secret_access_key: '${SECRET_4}'
        tags:
          - collect_sqs_logs
          - cloudflare-firewall
          - cloudflare-fw-REDACT
        processors:
          - drop_event:
              when:
                or:
                  - contains:
                      message: REDACTED
      - id: >-
          aws-s3-cloudflare_logpush.http_request-f56bb89c-97bd-441e-ae4b-071d5296cd1d
        data_stream:
          dataset: cloudflare_logpush.http_request
          type: logs
        queue_url: >-
          https://sqs.us-east-1.amazonaws.com/REDACTED
        visibility_timeout: 300s
        api_timeout: 120s
        max_number_of_messages: 70
        file_selectors:
          - regex: http/
        access_key_id: '${SECRET_3}'
        secret_access_key: '${SECRET_4}'
        tags:
          - collect_sqs_logs
          - cloudflare-http
          - cloudflare-http-REDACT

As you can see, there is the max_number_of_messages with the value we set, but not the number of workers, which is set in the integration.

Image

Inspecting the input of the integration here, it seems that the number_of_workers setting will only get the value defined by the user only if the s3 collection of logs is enabled. (polling direct from the buckets)

{{#if collect_s3_logs}}
{{! shared S3 bucket polling options }}
{{#if number_of_workers }}
number_of_workers: {{ number_of_workers }}
{{/if}}

Which I assume is populated by this setting in the UI:
Image

But since I'm using SQS, I cannot enable this setting as it will complain that bucket arn and other settings are missing.

It seems an issue on the integration, so we may need to open another issue.

In the Elastic Agent logs I have this, with the default value of 5.

Image

@leandrojmp
Copy link
Contributor

@strawgate should I reference the above comment on a new issue?

@strawgate
Copy link
Contributor Author

@strawgate should I reference the above comment on a new issue?

Yes please, I've already tagged someone internally to follow up @flexitrev

@leandrojmp
Copy link
Contributor

leandrojmp commented Mar 19, 2025

@strawgate no problem! I will create the issue now.

edit: issue #13179

@flexitrev
Copy link
Contributor

flexitrev commented Mar 20, 2025

Here's the list of integrations that have number_of_workers set within an S3 bucket conditional

  • aws
  • amazon_security_lake
  • aws_bedrock
  • aws_logs
  • canva
  • carbon_black_cloud
  • f5_bigip
  • imperva_cloud_waf
  • jamf_protect
  • sentinel_one_cloud_funnel
  • servicenow
  • sublime_security
  • symantec_endpoint_security
  • tanium
  • trellix_edr_cloud

There is one integration/data_stream that only uses SQS and lacks the workers setting
crowdstrike/fdr

We need to make a bulk edit to all of these integrations similar to the changes in #13181

  • Update aws-s3.yml.hbs to place number_of_workers outside of a conditional
  • update number_of_workers to indicate it is applicable to SQS and S3
  • Add deprecation notice to max_number_of_messages

@leandrojmp
Copy link
Contributor

leandrojmp commented Mar 20, 2025

@flexitrev I think that the main aws package also needs to be adjusted, at least the cloudtrail dataset.

In the cloudtrail dataset the number_of_workers setting is also nested under a conditional.

There are other datasets on the main aws package that have the same issue like apigateway_logs, waf, vpcflow and others, they can use both SQS or S3 polling.

@leandrojmp
Copy link
Contributor

leandrojmp commented Mar 20, 2025

There is one integration/data_stream that only uses SQS and lacks the workers setting
crowdstrike/fdr

I can make a PR to solve this one.

PR for Crowdstrike FDR: #13196

@flexitrev flexitrev linked a pull request Mar 20, 2025 that will close this issue
3 tasks
@flexitrev flexitrev linked a pull request Mar 20, 2025 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants