Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instruction scheduler #19

Open
doe300 opened this issue Dec 14, 2017 · 5 comments
Open

Add instruction scheduler #19

doe300 opened this issue Dec 14, 2017 · 5 comments
Assignees
Labels
enhancement optimization related to an optimization step
Milestone

Comments

@doe300
Copy link
Owner

doe300 commented Dec 14, 2017

Not sure how to implement this, but here a few notes on an instruction scheduler:

Goal

Reorder instructions within a basic block to utilize the delay introduced by certain operations by inserting meaningful instructions minimizing the number of cycles spent waiting (via nop or on periphery registers).

Target features

  • remove delays introduced by accessing periphery (waiting on SFU/TMU/VPM)
  • remove delay introduced by read-after-write of physical register
  • allow queueing up to 4 memory load-requests per TMU and balance load on both TMUs (see Speed-up memory access #15 )
  • re-order instructions to maximize the possible candidates for instruction-combinations without violating the above distances

Implementation

  • create graph of dependencies between instructions
  • dependencies have weight determining the minimum distance between the two instructions to not require delay-slots.

See also the Wikipedia.

@nomaddo
Copy link
Collaborator

nomaddo commented Dec 14, 2017

Can you assign me to the issue?
Now I am trying to implement list scheduling. The algorithm is as follows:

  • Preparation: create DAG (Dependency Graph)
    DAG represents dependency of instructions
    Sending signals is also considered as dependency not to change the orders (same as other side effects)
  • Start scheduling for each basic block:
    • prepare cycle-counter and ready-queue: set of instructions that can be issued
      Aft first, ready-queue has instructions that can be issued without any executions
      (use the DAG to construct it)
    • pick up one instruction from ready-queue with some heuristics
      • if both add-op and mul-op instructions can be issued, fuse them
      • if both ALU instruction and sending signal can be issued, fuse them
      • convert or a, b, b to v8min a, b, b if another add-op instruction can be issued, and fuse them
      • if branch instruction can be issued, and at most three instructions can be issued, issue branch and fill delay slots by them
    • increment cycle-counter
    • if no instruction can be issued, increment the cycle-counter, and calculate ready-queue

This scheduling can be applied for one basic block, not for across basic blocks.
This implementation is relatively simple and understandable.

The problem is, the heuristics that now we adopt to choose an instruction from ready-queue.
Probably, we need to care memory latency and register-file or so on.
This must be constructed in trial and error...

@nomaddo nomaddo self-assigned this Dec 15, 2017
@doe300
Copy link
Owner Author

doe300 commented Dec 15, 2017

Preparation: create DAG (Dependency Graph)

You could re-use the vc4c::Graph class, if it suits your needs. It is already used for the basic-block dependency graph and the colored graph for register mapping. Also, the class DebugGraph provides an easy way to print the contents of the graph into graphviz file.

@nomaddo
Copy link
Collaborator

nomaddo commented Dec 19, 2017

Question:
I am wondering when VC4C translate SSA to non-SSA before register-allocation.
Come to think of it, Instruction Scheduling fits non-SSA, and equal to an assembly except that is it not register allocated.

It may be possible to implement the scheduler for SSA-form language, but a few improvable points may be remain.

@doe300
Copy link
Owner Author

doe300 commented Dec 19, 2017

The SSA from the intermediate language (LLVM-IR or SPIR-V) is relaxed very early (before the optimization steps are run) to resolve phi-instructions (function eliminatePhiNodes). So the internal representation of the optimization steps (up to the register-mapping) is SSA-ish (most of the instructions are SSA, but it is not guaranteed that a local is only written once).

@nomaddo nomaddo added this to the v0.1 milestone Mar 5, 2018
@doe300 doe300 added the optimization related to an optimization step label Apr 7, 2018
@doe300 doe300 self-assigned this Sep 15, 2018
doe300 added a commit that referenced this issue Sep 16, 2018
@doe300
Copy link
Owner Author

doe300 commented Sep 22, 2018

Intermediate status update:

  • Instruction scheduler implemented and mostly working
  • Performance improvements required
  • Some tweaking required what instructions to prefer(decrease pressure on VPM lock, combine more instructions)
  • More tests required

Some statistics (based on TestVC4C --test-emulator):

  • ✗ Takes 46s in total (instead of 36s without scheduler), scheduler itself takes 15s
  • ✗ Increases pressure on VPM lock (from 116k waits) to 139k waits
  • ✓ Greatly decreases pressure on waiting for TMU result (from 48k delay cycles) to 3k delay cycles
  • ✓/✗ Changes pressure on VPM DMA wait (from 162k/83k delay cycles) to 129k/99k delay cycles
  • ✗ Converts 20k less instructions (moves) from add ALU to mul ALU instructions and therefore does not combine them
  • ✓ Total emulation cycle count is reduced from 612k to 592k

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement optimization related to an optimization step
Projects
None yet
Development

No branches or pull requests

2 participants