Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MdcPatternConverter - update output format for log process tool friendly and smaller size #3435

Closed
amosshi opened this issue Feb 4, 2025 · 11 comments
Assignees
Labels
enhancement Additions or updates to features layouts Affects one or more Layout plugins

Comments

@amosshi
Copy link

amosshi commented Feb 4, 2025

Warning!
It is highly recommended to discuss feature requests in the mailing lists first.

Background

When a log string contains key=value format, in modern log search tool, like Splunk or ELK:

  • The key will become a searchable field
  • The value become the value of the corresponding key, and we can do visualization easily

The Problem

The current output for the MdcPatternConverter is in a similar format to the java.util.Hashtable.toString(), like {key1=value1, key2=value2, key3=value3}

If we have the config bellow:

  • %notEmpty{ %MDC}

log4j2 timestamp=%date{ISO8601} log.level=%level{length=5} log.logger=%logger %notEmpty{ %MDC} %highlight{%message%notEmpty{ throwableMessage="%throwable{short.message}"}}%n

The MDC part sample output will look like

{TID=8C6976E5B54104, company=TEST-COMPANY-1, companyName=Test Company One, hashUserId=8C6976E5B, ipAddress=127.0.0.1, sessionId=45F0AB50, request=11}

Where

  • { and } are not useful for the log search tools (Splunk, ELK, or other tools)
  • , are in fact not needed

The Feature Request

We need an option, named like mdcflatformat=true, to indicate MdcPatternConverter will generate a flat format of:

  • key1=value1 key2=value2 key3=value3

And, then the value contains a space ( ) , the value will be double quoted:

  • key1=value1 key2=value2 key3=value3 key4="value 4"

We will config as

  • %notEmpty{ %MDC}{mdcflatformat}
    • Where mdcflatformat is the new flag we introduced

The sample output above will become

TID=8C6976E5B54104 company=TEST-COMPANY-1 companyName="Test Company One" hashUserId=8C6976E5B ipAddress=127.0.0.1 sessionId=45F0AB50 request=11

Benefits

  • The format will be more log search tool friendly
  • The log generated will be smaller, because we removed the characters of {, , and }
    • Small is important, because in a big system, say we have 1 billion transactions per week, every less character in log will save cost

Action Items

When the community accept this feature request, we can provide PR and test case.

@amosshi amosshi changed the title MdcPatternConverter - change output format for logger process tool friendly MdcPatternConverter - update output format for log process tool friendly and smaller size Feb 4, 2025
@ppkarwasz
Copy link
Contributor

ppkarwasz commented Feb 5, 2025

@amosshi,

Background

When a log string contains key=value format, in modern log search tool, like Splunk or ELK:

  • The key will become a searchable field
  • The value become the value of the corresponding key, and we can do visualization easily

Benefits

  • The format will be more log search tool friendly

Sorry, but I do not believe this is a valid concern. The PatternLayout is not a structured layout and most patterns do not give something that can be parsed by the Grok processor, which is basically a regular expression parser. For example:

  • The %d{yyyy-MM-dd'T'HH:mm:ss} %enc{%m}{CRLF}%n%replace{%ex}{\r?\n}{\n } pattern can be parsed unambiguously into a timestamp, message and exception.
  • The %d %t %m%n pattern can not be parsed unambiguously, since you don't know which space character is part of the thread name and which one is part of the message. Besides, if you couple it with the ELK parser, you application will have a CRLF injection vulnerability.

If you want the output of Log4j Core to be machine-readable, use a structured layout such as JSON Template Layout or RFC 5424 Layout and parse the output using the appropriate parser.

Note: the behavior you are proposing might be useful to emulate Logback's %kvp pattern.

@ppkarwasz ppkarwasz added waiting-for-user More information is needed from the user and removed waiting-for-maintainer labels Feb 5, 2025
@vy
Copy link
Member

vy commented Feb 6, 2025

When a log string contains key=value format, in modern log search tool, like Splunk or ELK:

@amosshi, it sounds counter intuitive to me to serialize the entire log event into one big string instead of using individual fields wherever necessary. As a matter of fact, both ELK (via its Elastic Common Schema) and Splunk (in its logging best-practices page) recommend using structured logging, which is best delivered by JSON Template Layout. Would you mind helping us to understand your motivation from deviating these standards? Could using a structured layout instead of Pattern Layout be an option for you?

@amosshi
Copy link
Author

amosshi commented Feb 9, 2025

Hi @vy,

The reason do not want to adopt JSON Template Layout in production yet is

Either work with one of the default JSON layouts or a customized JSON layout , JSON log size will be 20% - 100% bigger than the Pattern Layout depending on which additional fields we added to log and how application code logs.

So bigger log size may translate to millions of additional dollars of Splunk license cost, depending on x/y/z reasons.

@github-actions github-actions bot added waiting-for-maintainer and removed waiting-for-user More information is needed from the user labels Feb 9, 2025
@ppkarwasz
Copy link
Contributor

  • Splunk license cost: splunk is charged by log size in current case.

Looking at your example you implement some sort of tracing:

TID=8C6976E5B54104 company=TEST-COMPANY-1 companyName="Test Company One" hashUserId=8C6976E5B ipAddress=127.0.0.1 sessionId=45F0AB50 request=11

Since most of those properties (company, companyName, etc.) are constant for the whole HTTP request, have you considered adding them as attributes of the span instead of printing them for each log event? This seems like a better strategy to reduce Splunk licensing costs.

@ppkarwasz ppkarwasz added waiting-for-user More information is needed from the user and removed waiting-for-maintainer labels Feb 9, 2025
@amosshi
Copy link
Author

amosshi commented Feb 9, 2025

HI @ppkarwasz,

The term "structured layout" here, means to "key=value" format.

When the log message is in key=value format, Both Splunk and ELK can recognize the keys automatically as a searchable filed. So we can do filter, reporting, dashboard based on the key.

Sample: we may have a Pattern Layout log looks like bellow, and the row bellow can be considered as "Structured" somehow, because Splunk/ELK can extract the keys automatically

timestamp=202501082308311 level=W logger=com.mycomany.MyApp company=abcenterprise message="the user cannot be found" targetUserHash=e0d123e5f316bef7 {dbschema=abcschema, traceid=1234567, contextid=qwertyu, sessionid=asdfghj}

Where

  • {dbschema=abcschema, traceid=1234567, contextid=qwertyu, sessionid=asdfghj} is from the MdcPatternConverter

It will be more perfect if MdcPatternConverter can generate a format like bellow, where the {, ,, } characters are removed:

  • dbschema=abcschema traceid=1234567 contextid=qwertyu sessionid=asdfghj

Benefits

  • It will make Splunk/ELK parse easier
  • It will generate (a little) smaller log - so will save disk/network/cpu, use less energy and make the world greener
  • Smaller log size will also save Splunk license cost - could make application developers get more bonus

@github-actions github-actions bot added waiting-for-maintainer and removed waiting-for-user More information is needed from the user labels Feb 9, 2025
@ppkarwasz
Copy link
Contributor

ppkarwasz commented Feb 9, 2025

The term "structured layout" here, means to "key=value" format.

When the log message is in key=value format, Both Splunk and ELK can recognize the keys automatically as a searchable filed. So we can do filter, reporting, dashboard based on the key.

We use the term "structured layout" for formats that can be deterministically deserialized. The simple key=value used by MdcPatternConverter is not one of them and the way you parse your log files will expose you to log injections:

  • If a user puts user login=root as login, you'll end up with login=user login=root in your logs.
  • If a user puts u a=a b=b c=c..., you'll end up with login=u a=a b=b c=c... which will cause Splunk/ELK to index all those keys and increase the size of their indexes.

If you are looking for a structured layout that resembles the Pattern layout, try the RFC5424 layout. The STRUCTURED-DATA part of the message is similar to what you are looking for, except keys and values are properly escaped:

[mdc@32473 dbschema="abcschema" traceid="1234567" contextid="qwertyu" user="Complex\"entry \\"]

@amosshi
Copy link
Author

amosshi commented Feb 9, 2025

@ppkarwasz Thanks a lot for the feedback

@vy
Copy link
Member

vy commented Feb 10, 2025

@amosshi, note that you can implement your custom MDC converter for Pattern Layout too.

@vy vy added enhancement Additions or updates to features layouts Affects one or more Layout plugins and removed waiting-for-maintainer labels Feb 10, 2025
@vy vy self-assigned this Feb 10, 2025
@amosshi
Copy link
Author

amosshi commented Feb 25, 2025

@vy

  • I will create the custom MDC converter for Pattern Layout in SAP BizX codebase and verify
  • The code will be put back into log4j2 if community accepts

@ppkarwasz
Copy link
Contributor

Let us know, when you make your first release, so we can advertise the plugin in our documentation.

@vy
Copy link
Member

vy commented Feb 27, 2025

Closing this discussion as resolved since Pattern Layout is not intended for structured logging and a solution is provided for the user.

@vy vy closed this as not planned Won't fix, can't repro, duplicate, stale Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Additions or updates to features layouts Affects one or more Layout plugins
Projects
None yet
Development

No branches or pull requests

3 participants