Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency check keeps removing and inserting flows on a switch during the first N runs #124

Closed
italovalcy opened this issue Apr 22, 2021 · 6 comments · Fixed by #125
Closed

Comments

@italovalcy
Copy link
Contributor

Hi,

The consistency check keeps inserting and removing flows on a switch, even in the simplest running scenario: one switch, only OF_LLDP flows. This seems to happen in the first N runs (from my tests, in the first 5 runs), then it gets "stable."

How to reproduce:

  1. Have kytos running:
docker pull kytos/nightly:latest
docker run -d --name kytos --privileged kytos/nightly:latest /usr/bin/tail -f /dev/null
docker exec -it kytos bash
for cada in storehouse of_core flow_manager topology of_lldp pathfinder mef_eline maintenance; do python3 -m pip install -e git+https://github.com/kytos/$cada#egg=kytos-$cada; done
sed -i 's/CONSISTENCY_INTERVAL = 60/CONSISTENCY_INTERVAL = 3/g' /var/lib/kytos/napps/kytos/flow_manager/settings.py
kytosd -E
  1. Run Mininet single topology:
mn --topo single,2 --controller=remote,ip=127.0.0.1
  1. check the duration of the flows on the Mininet console:
sh while true; do ovs-ofctl dump-flows s1; sleep 1; done

Expected output: the "duration" of the LLDP flows should be only increasing.

Actual output: the flows are recreated during the first N runs and then gets "stable", i.e., the duration start increasing:

mininet> sh while true; do ovs-ofctl dump-flows s1; sleep 1; done
 cookie=0x0, duration=0.053s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=1.061s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=2.069s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=0.069s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=1.074s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=2.080s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=0.077s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=1.083s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=2.091s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=0.093s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=1.100s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=2.105s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=0.100s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=1.105s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=2.111s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=0.103s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=1.116s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=2.127s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=3.140s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=4.151s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=5.162s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=6.174s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=7.186s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=8.197s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=9.209s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535

From Kytos logs:

2021-04-22 13:04:48,985 - INFO [kytos.napps.kytos/flow_manager] (flow_manager) Flows loaded.
2021-04-22 13:04:48,989 - INFO [kytos.napps.kytos/flow_manager] (flow_manager) A consistency problem was detected in switch 00:00:00:00:00:00:00:01.
2021-04-22 13:04:49,000 - INFO [kytos.napps.kytos/flow_manager] (flow_manager) Flow forwarded to switch 00:00:00:00:00:00:00:01 to be installed.
2021-04-22 13:04:52,007 - INFO [kytos.napps.kytos/flow_manager] (flow_manager) Flows loaded.
2021-04-22 13:04:52,011 - INFO [kytos.napps.kytos/flow_manager] (flow_manager) A consistency problem was detected in switch 00:00:00:00:00:00:00:01.
2021-04-22 13:04:52,022 - INFO [kytos.napps.kytos/flow_manager] (flow_manager) Flow forwarded to switch 00:00:00:00:00:00:00:01 to be installed.
2021-04-22 13:04:55,028 - INFO [kytos.napps.kytos/flow_manager] (flow_manager) Flows loaded.

This behavior is not correct because it means that the network traffic will be unstable (flapping) every time the controller gets restarted.

@italovalcy
Copy link
Contributor Author

Hi folks, just some additional information:

@cmagnobarbosa
Copy link
Member

Hi @italovalcy, I'll check why that happens

@cmagnobarbosa cmagnobarbosa self-assigned this Apr 22, 2021
@italovalcy
Copy link
Contributor Author

Hi @cmagnobarbosa and folks,

I think I narrowed down why this is happening. The consistency routine in the way it is implemented today depends on the FLOW_STATS reply to be able to check the stored flows against the switch's flow table. The FLOW_STATS is run by of_core and has nothing to do with flow_manager consistency check, which means that it can be run in a totally different frequency. Suppose the flow_stats run less frequently than consistency_check. The problem then happens because consistency_check compares with an empty array for the switch's flow table and then send a FLOW_MOD to install the apparently missing flows (they are not actually missing, but they were just not seen by of_core at that point).

In fact, the FLOW_STATS is run according to the setting of_core.settings.STATS_INTERVAL and consistency check is run based on flow_manager.settings.CONSISTENCY_INTERVAL.

One way to work around this issue is by changing the of_core.settings.STATS_INTERVAL to something lower than CONSISTENCY_INTERVAL (keeping in mind that the consistency interval could be run right after or right before flow_stats)...

In terms of a proposal for this issue, my suggestions would be:

  • Instead of flow_manager consistency check has its own frequency and execution loop, it should be notified by of_core when a new flow_stats is available and then check the consistency. Another way would be flow_manager request the flow_stats instead of of_core (I preferer the first, because I suppose of_core needs that information as well). Having an event on of_core to notify whoever is interest on fresh flow stats could benefit other napps as well: they could run async instead of doing polling on of_core to get the most updated flow statistics
  • As soon as the switch connects (on kytos/of_core.handshake.completed), flow_manager should request a flow_stats update and the consistency check run over it
  • Remove the setting STATS_INTERVAL from flow_manager, since it is not used in any way

@cmagnobarbosa
Copy link
Member

Hello @italovalcy, I agree with you, I'll implement a new event in the of_core (kytos/of_core#122) NApp to notify interested NApps when new flow statistics are available, making it possible to improve and fix the consistency check.

@italovalcy
Copy link
Contributor Author

Hi folks,

Another thing is the way flow_manager persists the flows: looks like it has a list of flows and the flow_mod command (add, delete, delete_strict). This is dangerous because then flow_manager will have to guarantee the order in which the commands should be run and looks like it is not the case (see below).

I've created some debug messages to understand how consistency check works:

diff --git a/main.py b/main.py
index c737f85..ab86ff4 100644
--- a/main.py
+++ b/main.py
@@ -88,6 +98,8 @@ class Main(KytosNApp):
         """Check consistency of installed flows for a specific switch."""
         dpid = switch.dpid

+        log.info("ITALO-DEBUG check_switch_consistency dpid=%s stored_flows=%s switch_flowtable=%s" % (dpid, self.stored_flows.get(d
pid, None), switch.flows))
+
         # Flows stored in storehouse
         stored_flows = self.stored_flows[dpid]['flow_list']
@@ -117,6 +129,7 @@ class Main(KytosNApp):
         """Check consistency of installed flows for a specific switch."""
         dpid = switch.dpid

+        log.info("ITALO-DEBUG check_storehouse_consistency dpid=%s stored_flows=%s switch_flowtable=%s" % (dpid, self.stored_flows.g
et(dpid, None), switch.flows))
         for installed_flow in switch.flows:
             if dpid not in self.stored_flows:
                 log.info('A consistency problem was detected in '

Then one of the messages shows the following information:

2021-04-23 17:13:12,159 - INFO [kytos.napps.kytos/flow_manager] (Thread-178) ITALO-DEBUG check_switch_consistency dpid=00:00:00:00:00:00:00:01 stored_flows={'flow_list': [{'command': 'add', 'flow': {'priority': 1000, 'table_id': 0, 'match': {'dl_type': 35020, 'dl_vlan': 3799}, 'actions': [{'action_type': 'output', 'port': 4294967293}]}}, {'command': 'delete', 'flow': {'cookie': 12280154957751673672, 'cookie_mask': 18446744073709551615}}, {'command': 'add', 'flow': {'match': {'in_port': 1, 'dl_vlan': 101}, 'cookie': 12280154957751673672, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 101}, {'action_type': 'output', 'port': 2}]}}, {'command': 'add', 'flow': {'match': {'in_port': 2, 'dl_vlan': 101}, 'cookie': 12280154957751673672, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 101}, {'action_type': 'output', 'port': 1}]}}]} switch_flowtable=[<napps.kytos.of_core.v0x04.flow.Flow object at 0x7fb51427d438>]

As you can see in the "stored_flows" there is a list of commands [add, delete, add, add]. Those commands/flows were created by the following sequence of events: start kytos (so the LLDP will be created), create an intra-switch EVC (so the two add flows), and then disable the EVC. The order in the stored_flows doesn't match with the sequence of events. Indeed, after disabling the EVC, the flows are recreated:

mininet> sh while true; do ovs-ofctl dump-flows s1; echo; sleep 1; done
...
 cookie=0xaa6bdfdefb8a6348, duration=47.632s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=101 actions=mod_vlan_vid:101,output:"s
1-eth2"
 cookie=0xaa6bdfdefb8a6348, duration=47.619s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth2",dl_vlan=101 actions=mod_vlan_vid:101,output:"s
1-eth1"
 cookie=0x0, duration=81.748s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535

 cookie=0x0, duration=82.752s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535

 cookie=0xaa6bdfdefb8a6348, duration=0.702s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=101 actions=mod_vlan_vid:101,output:"s1
-eth2"
 cookie=0xaa6bdfdefb8a6348, duration=0.688s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth2",dl_vlan=101 actions=mod_vlan_vid:101,output:"s1
-eth1"
 cookie=0x0, duration=83.756s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535

@italovalcy
Copy link
Contributor Author

Hi folks,

Another thing is the way flow_manager persists the flows: looks like it has a list of flows and the flow_mod command (add, delete, delete_strict). This is dangerous because then flow_manager will have to guarantee the order in which the commands should be run and looks like it is not the case (see below).

I've created some debug messages to understand how consistency check works:

diff --git a/main.py b/main.py
index c737f85..ab86ff4 100644
--- a/main.py
+++ b/main.py
@@ -88,6 +98,8 @@ class Main(KytosNApp):
         """Check consistency of installed flows for a specific switch."""
         dpid = switch.dpid

+        log.info("ITALO-DEBUG check_switch_consistency dpid=%s stored_flows=%s switch_flowtable=%s" % (dpid, self.stored_flows.get(d
pid, None), switch.flows))
+
         # Flows stored in storehouse
         stored_flows = self.stored_flows[dpid]['flow_list']
@@ -117,6 +129,7 @@ class Main(KytosNApp):
         """Check consistency of installed flows for a specific switch."""
         dpid = switch.dpid

+        log.info("ITALO-DEBUG check_storehouse_consistency dpid=%s stored_flows=%s switch_flowtable=%s" % (dpid, self.stored_flows.g
et(dpid, None), switch.flows))
         for installed_flow in switch.flows:
             if dpid not in self.stored_flows:
                 log.info('A consistency problem was detected in '

Then one of the messages shows the following information:

2021-04-23 17:13:12,159 - INFO [kytos.napps.kytos/flow_manager] (Thread-178) ITALO-DEBUG check_switch_consistency dpid=00:00:00:00:00:00:00:01 stored_flows={'flow_list': [{'command': 'add', 'flow': {'priority': 1000, 'table_id': 0, 'match': {'dl_type': 35020, 'dl_vlan': 3799}, 'actions': [{'action_type': 'output', 'port': 4294967293}]}}, {'command': 'delete', 'flow': {'cookie': 12280154957751673672, 'cookie_mask': 18446744073709551615}}, {'command': 'add', 'flow': {'match': {'in_port': 1, 'dl_vlan': 101}, 'cookie': 12280154957751673672, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 101}, {'action_type': 'output', 'port': 2}]}}, {'command': 'add', 'flow': {'match': {'in_port': 2, 'dl_vlan': 101}, 'cookie': 12280154957751673672, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 101}, {'action_type': 'output', 'port': 1}]}}]} switch_flowtable=[<napps.kytos.of_core.v0x04.flow.Flow object at 0x7fb51427d438>]

As you can see in the "stored_flows" there is a list of commands [add, delete, add, add]. Those commands/flows were created by the following sequence of events: start kytos (so the LLDP will be created), create an intra-switch EVC (so the two add flows), and then disable the EVC. The order in the stored_flows doesn't match with the sequence of events. Indeed, after disabling the EVC, the flows are recreated:

mininet> sh while true; do ovs-ofctl dump-flows s1; echo; sleep 1; done
...
 cookie=0xaa6bdfdefb8a6348, duration=47.632s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=101 actions=mod_vlan_vid:101,output:"s
1-eth2"
 cookie=0xaa6bdfdefb8a6348, duration=47.619s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth2",dl_vlan=101 actions=mod_vlan_vid:101,output:"s
1-eth1"
 cookie=0x0, duration=81.748s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535

 cookie=0x0, duration=82.752s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535

 cookie=0xaa6bdfdefb8a6348, duration=0.702s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=101 actions=mod_vlan_vid:101,output:"s1
-eth2"
 cookie=0xaa6bdfdefb8a6348, duration=0.688s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth2",dl_vlan=101 actions=mod_vlan_vid:101,output:"s1
-eth1"
 cookie=0x0, duration=83.756s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535

@cmagnobarbosa I was working deeper with this and I realized it is related to something else: archiving the EVC through PATCH and not through the DELETE method. So please, forget about the above comment. I will register a new issue in mef_eline for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants