Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make creation of extraction opportunities faster. #96

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

aravij
Copy link
Contributor

@aravij aravij commented Nov 12, 2020

Changed algorithm of creation of extraction opportunities to speed it up.

Now algorithm is working in the following way:

  1. Calculate for each statement location of next similar statement. We store a list of steps (int), where adding a step to statement index, we get index of next similar statement. Not all statements may have next similar statements.
  2. Create initial statement ranges. Split a sequence of statements int a non overlapping sorted sequence of statements ranges without gaps between them. Initial ranges are ranges where each statement, except first one, is similar to the previous one. That way we split all statements and add such ranges to extraction opportunities. They correspond to opportunities created during step one.
  3. Collect all similarity gaps - statements, which next similar statement does not follow them immediately.
  4. For each such gap:
    1. Identify ranges of statements where first and second statements belong.
    2. Merge those two ranges and all between them into a single one.
    3. If previous opportunity, created due to handling gap of the same size, starts from the same statement as newly created one, overwrite that opportunity with new one. Otherwise append new one, to already created. This step is done, because some gaps may overlap, i.e. range of second statement of first gap is equals to range of first statement of second gap. If that happens, both such gaps should belong to the same opportunity, as running previous version of algorithm would pass through them at once, because they are of the same size. We identify overlapping of gaps, as second opportunity would be large than the first one, but starts from the same statement. So if newly created opportunity starts from the same statements, created during handling of the gap of the same size, we simply overwrite that opportunity with newly created one, as it contains both gaps.

Applying new version of algorithm we get the following gain:

  • For file InternalMetaDataParser with 1721 methods the average speed up of create_extraction_opportunities step was 88.6% or 0.0086 seconds. The total time saved on that step is 14.8 seconds. The total processing of this file with SEMI algorithm takes 2.5 minutes.
  • For file TomlParser with 87 methods the average speed up of create_extraction_opportunities step was 68.3% or 0.0052 seconds. The total time saved on that step is 0.45 seconds. The total processing of this file with SEMI algorithm takes 7 seconds.

The relative speed up us quite good, while in absolute numbers it is quite irrelevant.

Further speeding up the algorithms might be done through seeding up other steps and, may be, ast framework.
Here is comparison of time taken by create_extraction_opportunities to other steps.

step name InternalMetaDataParser
old version
InternalMetaDataParser
new version
TomlParser
old version
TomlParser
new version
Extract semantic 3.4 ms 3.4 ms 4.2 ms 4 ms
Create opportunities 9.4 ms 0.8 ms 5.9 ms 0.7 ms
Filter opportunities 13 ms 14 ms 18.7 ms 18 ms
Rank opportunities 51 ms 52 ms 47.8 ms 47 ms

@aravij aravij self-assigned this Nov 12, 2020
@aravij aravij linked an issue Nov 12, 2020 that may be closed by this pull request
@lyriccoder
Copy link
Member

@aravij Let's discuss it on Monday. It is necessary to test it on large number of files.

@lyriccoder
Copy link
Member

With increased speed:
Elapsed: 7889 secs
Soon, I will count without increased speed

@lyriccoder
Copy link
Member

Without increased speed:
Elapsed: 8415

Copy link
Member

@lyriccoder lyriccoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it has become faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SEMI Baseline. Finding opportunities takes too much time
2 participants