Add instruction scheduler #19

doe300 · 2017-12-14T11:42:08Z

Not sure how to implement this, but here a few notes on an instruction scheduler:

Goal

Reorder instructions within a basic block to utilize the delay introduced by certain operations by inserting meaningful instructions minimizing the number of cycles spent waiting (via nop or on periphery registers).

Target features

remove delays introduced by accessing periphery (waiting on SFU/TMU/VPM)
remove delay introduced by read-after-write of physical register
allow queueing up to 4 memory load-requests per TMU and balance load on both TMUs (see Speed-up memory access #15 )
re-order instructions to maximize the possible candidates for instruction-combinations without violating the above distances

Implementation

create graph of dependencies between instructions
dependencies have weight determining the minimum distance between the two instructions to not require delay-slots.

See also the Wikipedia.

The text was updated successfully, but these errors were encountered:

nomaddo · 2017-12-14T19:59:47Z

Can you assign me to the issue?
Now I am trying to implement list scheduling. The algorithm is as follows:

Preparation: create DAG (Dependency Graph)
DAG represents dependency of instructions
Sending signals is also considered as dependency not to change the orders (same as other side effects)
Start scheduling for each basic block:
- prepare cycle-counter and ready-queue: set of instructions that can be issued
  Aft first, ready-queue has instructions that can be issued without any executions
  (use the DAG to construct it)
- pick up one instruction from ready-queue with some heuristics
  - if both add-op and mul-op instructions can be issued, fuse them
  - if both ALU instruction and sending signal can be issued, fuse them
  - convert or a, b, b to v8min a, b, b if another add-op instruction can be issued, and fuse them
  - if branch instruction can be issued, and at most three instructions can be issued, issue branch and fill delay slots by them
- increment cycle-counter
- if no instruction can be issued, increment the cycle-counter, and calculate ready-queue

This scheduling can be applied for one basic block, not for across basic blocks.
This implementation is relatively simple and understandable.

The problem is, the heuristics that now we adopt to choose an instruction from ready-queue.
Probably, we need to care memory latency and register-file or so on.
This must be constructed in trial and error...

doe300 · 2017-12-15T07:29:59Z

Preparation: create DAG (Dependency Graph)

You could re-use the vc4c::Graph class, if it suits your needs. It is already used for the basic-block dependency graph and the colored graph for register mapping. Also, the class DebugGraph provides an easy way to print the contents of the graph into graphviz file.

nomaddo · 2017-12-19T19:15:46Z

Question:
I am wondering when VC4C translate SSA to non-SSA before register-allocation.
Come to think of it, Instruction Scheduling fits non-SSA, and equal to an assembly except that is it not register allocated.

It may be possible to implement the scheduler for SSA-form language, but a few improvable points may be remain.

doe300 · 2017-12-19T19:54:21Z

The SSA from the intermediate language (LLVM-IR or SPIR-V) is relaxed very early (before the optimization steps are run) to resolve phi-instructions (function eliminatePhiNodes). So the internal representation of the optimization steps (up to the register-mapping) is SSA-ish (most of the instructions are SSA, but it is not guaranteed that a local is only written once).

see #19

doe300 · 2018-09-22T06:54:37Z

Intermediate status update:

Instruction scheduler implemented and mostly working
Performance improvements required
Some tweaking required what instructions to prefer(decrease pressure on VPM lock, combine more instructions)
More tests required

Some statistics (based on TestVC4C --test-emulator):

✗ Takes 46s in total (instead of 36s without scheduler), scheduler itself takes 15s
✗ Increases pressure on VPM lock (from 116k waits) to 139k waits
✓ Greatly decreases pressure on waiting for TMU result (from 48k delay cycles) to 3k delay cycles
✓/✗ Changes pressure on VPM DMA wait (from 162k/83k delay cycles) to 129k/99k delay cycles
✗ Converts 20k less instructions (moves) from add ALU to mul ALU instructions and therefore does not combine them
✓ Total emulation cycle count is reduced from 612k to 592k

doe300 added enhancement help wanted labels Dec 14, 2017

nomaddo self-assigned this Dec 15, 2017

nomaddo removed the help wanted label Dec 15, 2017

nomaddo added this to the v0.1 milestone Mar 5, 2018

doe300 added the optimization related to an optimization step label Apr 7, 2018

doe300 self-assigned this Sep 15, 2018

doe300 added a commit that referenced this issue Sep 16, 2018

Adds instruction scheduler (WIP)

8b64f54

see #19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add instruction scheduler #19

Add instruction scheduler #19

doe300 commented Dec 14, 2017 •

edited

Loading

nomaddo commented Dec 14, 2017 •

edited

Loading

doe300 commented Dec 15, 2017

nomaddo commented Dec 19, 2017 •

edited

Loading

doe300 commented Dec 19, 2017

doe300 commented Sep 22, 2018 •

edited

Loading

Add instruction scheduler #19

Add instruction scheduler #19

Comments

doe300 commented Dec 14, 2017 • edited Loading

Goal

Target features

Implementation

nomaddo commented Dec 14, 2017 • edited Loading

doe300 commented Dec 15, 2017

nomaddo commented Dec 19, 2017 • edited Loading

doe300 commented Dec 19, 2017

doe300 commented Sep 22, 2018 • edited Loading

doe300 commented Dec 14, 2017 •

edited

Loading

nomaddo commented Dec 14, 2017 •

edited

Loading

nomaddo commented Dec 19, 2017 •

edited

Loading

doe300 commented Sep 22, 2018 •

edited

Loading