Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eventually a removed flow is recreated when Kytos is under high usage (possible race condition) #128

Open
italovalcy opened this issue May 11, 2021 · 1 comment
Labels

Comments

@italovalcy
Copy link
Contributor

Hi folks,

The consistency check routine eventually recreates a flow when Kytos is under high usage (many requests in a short period of time), which indicates a possible race condition.

More specifically, the following end-to-end tests sometimes pass and sometimes fail:

_______________________________ TestE2EMefEline.test_070_repeat_creation_and_removal_of_ten_circuits_ten_times ________________________________

self = <tests.test_e2e_10_mef_eline.TestE2EMefEline instance at 0x7f757b5bd5f0>

    def test_070_repeat_creation_and_removal_of_ten_circuits_ten_times(self):
        for x in range(0, 10):
            evcs = {}
            for i in range(400, 410):
                payload = {
                    "name": "evc_%s" % i,
                    "enabled": True,
                    "dynamic_backup_path": True,
                    "uni_a": {
                        "interface_id": "00:00:00:00:00:00:00:01:1",
                        "tag": {"tag_type": 1, "value": i}
                    },
                    "uni_z": {
                        "interface_id": "00:00:00:00:00:00:00:02:1",
                        "tag": {"tag_type": 1, "value": i}
                    }
                }
                api_url = KYTOS_API + '/mef_eline/v2/evc/'
                response = requests.post(api_url, data=json.dumps(payload), headers={'Content-type': 'application/json'})
                assert response.status_code == 201
                data = response.json()
                assert 'circuit_id' in data
                evcs[i] = data['circuit_id']

            time.sleep(10)

            # make sure the evcs are active and the flows were created
            s1, s2 = self.net.net.get('s1', 's2')
            flows_s1 = s1.dpctl('dump-flows')
            flows_s2 = s2.dpctl('dump-flows')
            for vid in evcs:
                evc_id = evcs[vid]
                api_url = KYTOS_API + '/mef_eline/v2/evc/' + evc_id
                response = requests.get(api_url)
                assert response.status_code == 200
                evc = response.json()
                #should be active
                assert evc["active"] is True
                # search for the vlan id
                assert "dl_vlan=%s" % vid in flows_s1
                assert "dl_vlan=%s" % vid in flows_s2
                # search for the cookie, should have two flows
                assert len(re.findall(evc['id'], flows_s1, flags=re.IGNORECASE)) == 2
                assert len(re.findall(evc['id'], flows_s2, flags=re.IGNORECASE)) == 2

            # Delete the circuits
            for vid in evcs:
                evc_id = evcs[vid]
                api_url = KYTOS_API + '/mef_eline/v2/evc/' + evc_id
                response = requests.delete(api_url)
                assert response.status_code == 200

            time.sleep(10)

            # make sure the circuits were deleted
            api_url = KYTOS_API + '/mef_eline/v2/evc/'
            response = requests.get(api_url)
            assert response.status_code == 200
            assert response.json() == {}
            flows_s1 = s1.dpctl('dump-flows')
            flows_s2 = s2.dpctl('dump-flows')
>           assert len(flows_s1.split('\r\n ')) == 1, "round=%d - should have only 1 flow but had: \n%s" % (x, flows_s1)
E           AssertionError: round=8 - should have only 1 flow but had:
E              cookie=0xaa28bc57b4ba8f4a, duration=8.110s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=405 actions=strip_vlan,mod_vlan_vid:3461,output:"s1-eth3"
E              cookie=0xaa28bc57b4ba8f4a, duration=8.079s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth3",dl_vlan=3461 actions=strip_vlan,mod_vlan_vid:405,output:"s1-eth1"
E              cookie=0x0, duration=220.814s, table=0, n_packets=144, n_bytes=6048, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
E
E           assert 3 == 1
E            +  where 3 = len([' cookie=0xaa28bc57b4ba8f4a, duration=8.110s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=405 actions=...20.814s, table=0, n_packets=144, n_bytes=6048, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535\r\n'])
E            +    where [' cookie=0xaa28bc57b4ba8f4a, duration=8.110s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=405 actions=...20.814s, table=0, n_packets=144, n_bytes=6048, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535\r\n'] = <built-in method split of str object at 0x564093f85fd0>('\r\n ')
E            +      where <built-in method split of str object at 0x564093f85fd0> = ' cookie=0xaa28bc57b4ba8f4a, duration=8.110s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=405 actions=s...220.814s, table=0, n_packets=144, n_bytes=6048, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535\r\n'.split

As noted, the test basically creates 10 EVCs, waits a couple of seconds, makes sure the flows were created, removes the 10 evcs, waits a couple of seconds, makes sure the flows were removed, and so on. We repeat this procedure 10 times.

This test was run using the AmLight Kytos image (https://hub.docker.com/r/amlight/kytos), which is basically a master branch daily build image, with some PRs applied (e.g., #117 and more: https://github.com/amlight/kytos-docker/tree/master/patches). Also, we change the default STATS_INTERVAL setting to 3s and leave the CONSISTENCY_INTERVAL=0 (which means the consistency_check will run every 3s). The LINK_UP_TIMER is also changed to 1s.

This problem seems to be a race-condition situation because while the consistency check is running, other requests to remove EVCs are received for different switches and each thread seems to change the self.stored_flows in a uncontrolled manner (the shared resource).

I've enabled some logging and this is how it looks like (logs for the above test, cookie 0xaa28bc57b4ba8f4a == 12261257070396477258 ==> evc ID 28bc57b4ba8f4a):

May 11 01:35:28 da6a340a8416 kytos.napps.kytos/mef_eline:INFO main:246:  Removing EVC(28bc57b4ba8f4a, evc_405)

May 11 01:35:28 da6a340a8416 kytos.napps.kytos/flow_manager:INFO main:375:  _install_flows switches=[Switch('00:00:00:00:00:00:00:01')] command=delete flows={'flows': [{'cookie': 12261257070396477258, 'cookie_mask': 18446744073709551615}]}

May 11 01:35:29 da6a340a8416 kytos.napps.kytos/flow_manager:INFO main:375:  _install_flows switches=[Switch('00:00:00:00:00:00:00:02')] command=delete flows={'flows': [{'cookie': 12261257070396477258, 'cookie_mask': 18446744073709551615}]}

May 11 01:35:31 da6a340a8416 kytos.napps.kytos/flow_manager:INFO main:375:  _install_flows switches=[Switch('00:00:00:00:00:00:00:01')] command=add flows={'flows': [{'match': {'in_port': 1, 'dl_vlan': 405}, 'cookie': 12261257070396477258, 'actions': [{'action_type': 'pop_vlan'}, {'action_type': 'push_vlan', 'tag_type': 's'}, {'action_type': 'set_vlan', 'vlan_id': 3461}, {'action_type': 'output', 'port': 3}]}]}

May 11 01:35:31 da6a340a8416 kytos.napps.kytos/flow_manager:INFO main:375:  _install_flows switches=[Switch('00:00:00:00:00:00:00:01')] command=add flows={'flows': [{'match': {'in_port': 3, 'dl_vlan': 3461}, 'cookie': 12261257070396477258, 'actions': [{'action_type': 'pop_vlan'}, {'action_type': 'push_vlan', 'tag_type': 'c'}, {'action_type': 'set_vlan', 'vlan_id': 405}, {'action_type': 'output', 'port': 1}]}]}

In between those events, I could see that the function _store_changed_flows() sometimes overwrite a content previously changed. Attached the full log.
syslog-2021-05-10.log.gz

@italovalcy
Copy link
Contributor Author

Hi folks,

Arturo (@ArturoQuintana) and I were testing Kytos/flow_manager and we found another way to demonstrate the racing condition: create 100 flows with parallel requests.

How to reproduce:

  1. Run Kytos with the most update version and Mininet (any topology)
  2. Request 100 flows with parallel requests:
for i in $(seq 100 200); do curl -X POST -H 'Content-type: application/json' http://127.0.0.1:8181/api/kytos/flow_manager/v2/flows/00:00:00:00:00:00:00:01 -d "{\"flows\": [{\"priority\": 100, \"match\": {\"in_port\": 1, \"dl_vlan\": $i}, \"actions\": [{\"action_type\": \"output\", \"port\": 2}]}]}" & true; done

Expected result: 100 flows should be created

Actual results: some of the flows are lost... some of them are removed by the consistency check, just a few remain on the flow_manager storehouse. Example:

mininet> sh ovs-ofctl dump-flows s1 | wc -l
94
mininet> sh curl http://gitlab-runner.ampath.net:18181/api/kytos/storehouse/v1/kytos.flow.persistence/
["5ef0843843034743858a1bb485453ab4"]
mininet> sh curl http://gitlab-runner.ampath.net:18181/api/kytos/storehouse/v1/kytos.flow.persistence/5ef0843843034743858a1bb485453ab4 | jq -r
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15181  100 15181    0     0  1235k      0 --:--:-- --:--:-- --:--:-- 1235k
{
  "flow_persistence": {
    "00:00:00:00:00:00:00:01": {
      "flow_list": [
        {
          "command": "add",
          "flow": {
            "actions": [
              {
                "action_type": "output",
                "port": 4294967293
              }
            ],
            "match": {
              "dl_type": 35020,
              "dl_vlan": 3799
            },
            "priority": 1000,
            "table_id": 0
          }
        },
        {
          "command": "add",
          "flow": {
            "actions": [
              {
                "action_type": "output",
                "port": 2
              }
            ],
            "match": {
              "dl_vlan": 107,
              "in_port": 1
            },
            "priority": 100
          }
        },
...
        {
          "command": "delete_strict",
          "flow": {
            "actions": [
              {
                "action_type": "output",
                "port": 2
              }
            ],
            "cookie": 0,
            "hard_timeout": 0,
            "id": "02d16e433ace04b2b2e42c92f02f7361",
            "idle_timeout": 0,
            "match": {
              "dl_vlan": 135,
              "in_port": 1
            },
            "priority": 100,
            "stats": {
              "byte_count": 0,
              "duration_nsec": 986000000,
              "duration_sec": 0,
              "packet_count": 0
            },
            "switch": "00:00:00:00:00:00:00:01",
            "table_id": 0
          }
        },
        {
          "command": "delete_strict",
          "flow": {
            "actions": [
              {
                "action_type": "output",
                "port": 2
              }
            ],
            "cookie": 0,
            "hard_timeout": 0,
            "id": "b48e58e183537226f3eef48d88da9b97",
            "idle_timeout": 0,
            "match": {
              "dl_vlan": 134,
              "in_port": 1
            },
            "priority": 100,
            "stats": {
              "byte_count": 0,
              "duration_nsec": 981000000,
              "duration_sec": 0,
              "packet_count": 0
            },
            "switch": "00:00:00:00:00:00:00:01",
            "table_id": 0
          }
        },
        {
          "command": "delete_strict",
          "flow": {
            "actions": [
              {
                "action_type": "output",
                "port": 2
              }
            ],
            "cookie": 0,
            "hard_timeout": 0,
            "id": "abab36e16d5734ad41be7916093fa919",
            "idle_timeout": 0,
            "match": {
              "dl_vlan": 142,
              "in_port": 1
            },
            "priority": 100,
            "stats": {
              "byte_count": 0,
              "duration_nsec": 964000000,
              "duration_sec": 0,
              "packet_count": 0
            },
            "switch": "00:00:00:00:00:00:00:01",
            "table_id": 0
          }
        },
        {
          "command": "delete_strict",
          "flow": {
            "actions": [
              {
                "action_type": "output",
                "port": 2
              }
            ],
            "cookie": 0,
            "hard_timeout": 0,
            "id": "4ba2b0b7a10ea9646d597dc86aae071f",
            "idle_timeout": 0,
            "match": {
              "dl_vlan": 149,
              "in_port": 1
            },
            "priority": 100,
            "stats": {
              "byte_count": 0,
              "duration_nsec": 961000000,
              "duration_sec": 0,
              "packet_count": 0
            },
            "switch": "00:00:00:00:00:00:00:01",
            "table_id": 0
          }
        },
        {
          "command": "delete_strict",
          "flow": {
            "actions": [
              {
                "action_type": "output",
                "port": 2
              }
            ],
            "cookie": 0,
            "hard_timeout": 0,
            "id": "264becad26820b07dedff014f9c206bc",
            "idle_timeout": 0,
            "match": {
              "dl_vlan": 145,
              "in_port": 1
            },
            "priority": 100,
            "stats": {
              "byte_count": 0,
              "duration_nsec": 957000000,
              "duration_sec": 0,
              "packet_count": 0
            },
            "switch": "00:00:00:00:00:00:00:01",
            "table_id": 0
          }
        },
        {
          "command": "delete_strict",
          "flow": {
            "actions": [
              {
                "action_type": "output",
                "port": 2
              }
            ],
            "cookie": 0,
            "hard_timeout": 0,
            "id": "6558340bf51f0760d8fbbd14132d13be",
            "idle_timeout": 0,
            "match": {
              "dl_vlan": 175,
              "in_port": 1
            },
            "priority": 100,
            "stats": {
              "byte_count": 0,
              "duration_nsec": 909000000,
              "duration_sec": 0,
              "packet_count": 0
            },
            "switch": "00:00:00:00:00:00:00:01",
            "table_id": 0
          }
        },
        {
          "command": "delete_strict",
          "flow": {
            "actions": [
              {
                "action_type": "output",
                "port": 2
              }
            ],
            "cookie": 0,
            "hard_timeout": 0,
            "id": "a293736e85ad0a914a6ec0250fd2c8e4",
            "idle_timeout": 0,
            "match": {
              "dl_vlan": 171,
              "in_port": 1
            },
            "priority": 100,
            "stats": {
              "byte_count": 0,
              "duration_nsec": 906000000,
              "duration_sec": 0,
              "packet_count": 0
            },
            "switch": "00:00:00:00:00:00:00:01",
            "table_id": 0
          }
        },
        {
          "command": "delete_strict",
          "flow": {
            "actions": [
              {
                "action_type": "output",
                "port": 2
              }
            ],
            "cookie": 0,
            "hard_timeout": 0,
            "id": "99c5b57b96af0a94e99e9247fd0ebc33",
            "idle_timeout": 0,
            "match": {
              "dl_vlan": 198,
              "in_port": 1
            },
            "priority": 100,
            "stats": {
              "byte_count": 0,
              "duration_nsec": 870000000,
              "duration_sec": 0,
              "packet_count": 0
            },
            "switch": "00:00:00:00:00:00:00:01",
            "table_id": 0
          }
        },
        {
          "command": "delete_strict",
          "flow": {
            "actions": [
              {
                "action_type": "output",
                "port": 2
              }
            ],
            "cookie": 0,
            "hard_timeout": 0,
            "id": "6e49986ef369a1a13dbb2c8027c1f28e",
            "idle_timeout": 0,
            "match": {
              "dl_vlan": 197,
              "in_port": 1
            },
            "priority": 100,
            "stats": {
              "byte_count": 0,
              "duration_nsec": 867000000,
              "duration_sec": 0,
              "packet_count": 0
            },
            "switch": "00:00:00:00:00:00:00:01",
            "table_id": 0
          }
        }
      ]
    },
    "00:00:00:00:00:00:00:02": {
      "flow_list": [
        {
          "command": "add",
          "flow": {
            "actions": [
              {
                "action_type": "output",
                "port": 4294967293
              }
            ],
            "match": {
              "dl_type": 35020,
              "dl_vlan": 3799
            },
            "priority": 1000,
            "table_id": 0
          }
        }
      ]
    },
    "00:00:00:00:00:00:00:03": {
      "flow_list": [
        {
          "command": "add",
          "flow": {
            "actions": [
              {
                "action_type": "output",
                "port": 4294967293
              }
            ],
            "match": {
              "dl_type": 35020,
              "dl_vlan": 3799
            },
            "priority": 1000,
            "table_id": 0
          }
        }
      ]
    }
  }
}
mininet>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants