[BUG] HttpCollectImpl XML parsing assumes UTF-8 #2852

pjfanning · 2024-12-02T12:27:51Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

hertzbeat/hertzbeat-collector/hertzbeat-collector-basic/src/main/java/org/apache/hertzbeat/collector/collect/http/HttpCollectImpl.java

Line 251 in 1fe70a6

    
           Document document = db.parse(new ByteArrayInputStream(resp.getBytes(StandardCharsets.UTF_8)));

If you have a String, you don't need to convert to byte array (which is almost a waste of memory).

DocumentBuilder has a parse(InputSource) method.
https://docs.oracle.com/javase/8/docs/api/javax/xml/parsers/DocumentBuilder.html#parse-org.xml.sax.InputSource-

InputSources can be constructed to wrap StringWriters that wrap the String.

Expected Behavior

Don't convert Strings to byte arrays unnecessarily wasting memory and causing parse issues. Imagine if the XML has an XML declaration that has an encoding that is not UTF-8. If you already have the String, the parser will ignore the value. If you convert to a byte array, the parser will use the XML encoding value but you have explicitly converted to UTF-8 in your code so these encodings may not match.

Steps To Reproduce

No response

Environment

HertzBeat version(s): latest

Debug logs

No response

Anything else?

No response

pjfanning · 2024-12-02T12:32:54Z

The underlying issue is more that you convert to a String in the first place.

hertzbeat/hertzbeat-collector/hertzbeat-collector-basic/src/main/java/org/apache/hertzbeat/collector/collect/http/HttpCollectImpl.java

Lines 135 to 139 in 1fe70a6

    
           // todo This code converts an InputStream directly to a String. For large data in Prometheus exporters, 
        
           // this could create large objects, potentially impacting JVM memory space significantly. 
        
           // Option 1: Parse using InputStream, but this requires significant code changes; 
        
           // Option 2: Manually trigger garbage collection, similar to how it's done in Dubbo for large inputs. 
        
           String resp = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

Using an InputStream or a byte array should be more more efficient than a String. It would definitely not be worse.

pjfanning added the bug Something isn't working label Dec 2, 2024

github-project-automation bot moved this to To do in Apache HertzBeat (Incubating) Dec 2, 2024

github-project-automation bot added this to Apache HertzBeat (Incubating) and hertzbeat-v1.0 Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] HttpCollectImpl XML parsing assumes UTF-8 #2852

[BUG] HttpCollectImpl XML parsing assumes UTF-8 #2852

pjfanning commented Dec 2, 2024

pjfanning commented Dec 2, 2024

[BUG] HttpCollectImpl XML parsing assumes UTF-8 #2852

[BUG] HttpCollectImpl XML parsing assumes UTF-8 #2852

Comments

pjfanning commented Dec 2, 2024

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Debug logs

Anything else?

pjfanning commented Dec 2, 2024