E2E integration tests are flaky #25423

hiltontj · 2024-10-03T00:08:38Z

From time to time, some of the integration tests fail for strange reasons. It may be due to how a port is being selected for the running influxdb3 serve binary that is spun up in the test harness.

There is a function used to select a random available port:

influxdb/influxdb3/tests/server/main.rs

Lines 305 to 319 in 7d37bbb

    
           /// Get an available bind address on localhost 
        
           /// 
        
           /// This binds a [`TcpListener`] to 127.0.0.1:0, which will randomly 
        
           /// select an available port, and produces the resulting local address. 
        
           /// The [`TcpListener`] is dropped at the end of the function, thus 
        
           /// freeing the port for use by the caller. 
        
           fn get_local_bind_addr() -> SocketAddr { 
        
               let ip = std::net::Ipv4Addr::new(127, 0, 0, 1); 
        
               let port = 0; 
        
               let addr = SocketAddrV4::new(ip, port); 
        
               TcpListener::bind(addr) 
        
                   .expect("bind to a socket address") 
        
                   .local_addr() 
        
                   .expect("get local address") 
        
           }

However, since the bind address is dropped before it is passed in to spawn the server (it needs to be, otherwise the server would not be able to bind that address and would fail to start), then there is a chance that another process or integration test could take over that port before the binary is started here:

influxdb/influxdb3/tests/server/main.rs

Line 136 in 7d37bbb

    
           let server_process = command.spawn().expect("spawn the influxdb3 server process");

Here are some examples of failures that seem rather odd:

The text was updated successfully, but these errors were encountered:

hiltontj · 2024-10-03T12:46:35Z

One option would be to forego running the actual binary by spawning the influxdb3 serve command, and just call the code to run the service directly, as is done in this function:

influxdb/influxdb3/src/commands/serve.rs

Line 320 in 7d37bbb

pub async fn command(config: Config) -> Result<()> {

This would require some refactoring to make sure that the test harness is starting things exactly as is done for the actual running binary, but would allow us to pass in a bound TcpListener/SocketAddr directly, and not have the issue described above.

One problem I see with this is that, with the way we generate IDs for, e.g., databases and tables, using static atomics, if we were to have multiple test harnesses running in a single test, then they could be clashing for IDs.

hiltontj · 2024-10-03T12:58:06Z

Another option would be to have the binary log the port it is listening on, and scrape if from the STDOUT in the test harness code.

hiltontj · 2024-10-03T13:00:09Z

Another option would be to have an option in the influxdb3 serve command to write its port to a file, or notify some other service of the port it is listening on, and then gather that info from the test harness code after it has spawned the command.

pauldix · 2024-10-03T13:29:46Z

I think I like the log option

hiltontj added the v3 label Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E integration tests are flaky #25423

E2E integration tests are flaky #25423

hiltontj commented Oct 3, 2024

hiltontj commented Oct 3, 2024

hiltontj commented Oct 3, 2024 •

edited

Loading

hiltontj commented Oct 3, 2024 •

edited

Loading

pauldix commented Oct 3, 2024

E2E integration tests are flaky #25423

E2E integration tests are flaky #25423

Comments

hiltontj commented Oct 3, 2024

hiltontj commented Oct 3, 2024

hiltontj commented Oct 3, 2024 • edited Loading

hiltontj commented Oct 3, 2024 • edited Loading

pauldix commented Oct 3, 2024

hiltontj commented Oct 3, 2024 •

edited

Loading

hiltontj commented Oct 3, 2024 •

edited

Loading