development questions #90

sthalik · 2018-10-08T04:48:49Z

Hey,

I haven't had an opportunity to ask few of these:

Is the comment about cache misses in node.h still relevant? Does it make sense to implement a tagged allocator or some other per-thread page allocator? Is there some lower or upper bound heuristic for the node count?
Can the self-eye-fill code be simplified, e.g. replaced with a pattern database that compiles to C code? It's really ugly and complicated.
Do the DCNN changes regress the non-DCNN version, with DCNN disabled?
How is DCNN compared to AlphaGo in a best-case scenario (unlimited tensor units)? The DCNN code was developed prior to DeepMind papers so does it use a different technique?
Is pondering still disabled with DCNN?
@lemonsqueeze are you the current maintainer?

(edited since initial posting)

The text was updated successfully, but these errors were encountered:

lemonsqueeze · 2018-10-12T05:20:07Z

Hi Stan,

Is the comment about cache misses in node.h still relevant? Does it make sense to implement a tagged allocator or some other per-thread page allocator? Is there some lower or upper bound heuristic for the node count?

You mean tree.h ? I guess so, this part of the code hasn't changed in a long time. I haven't looked into cache issues closely but it's clear that performance drops a bit once threads / tree search enter the picture. Comparing number of playouts/s between raw playouts and minimal tree search:

pachi -u t-unit/blank.t
./pachi --nodcnn --nopatterns --nojoseki -t =10000 threads=1,max_tree_size=40 < gtp/genmove.gtp
./pachi --nodcnn --nopatterns --nojoseki -t =10000 threads=4,max_tree_size=40 < gtp/genmove.gtp

[edited]
-> Getting about 20% performance drop. Can probaly improve on that.

Can the self-eye-fill code be simplified, e.g. replaced with a pattern database that compiles to C code? It's really ugly and complicated.

You mean the eyefill check in playout_moggy_permit() ? Can probably be improved but I'd be more concerned about is_bad_selfatari() which is far more ugly, expensive and used by other parts as well...

Do the DCNN changes regress the non-DCNN version, with DCNN disabled?

Not afaik, except pondering which is disabled by default. Play-testing welcome, haven't checked in a while. For low playouts it does better now actually because of MM (#81).

How is DCNN compared to AlphaGo in a best-case scenario (unlimited tensor units)? The DCNN code was developed prior to DeepMind papers so does it use a different technique?

Not sure i understand the question. What do you mean by 'unlimited tensor units' ?

Is pondering still disabled with DCNN?

Yes

@lemonsqueeze are you the current maintainer?

Yes, time permitting.

sthalik · 2018-10-13T14:10:13Z

Hey @lemonsqueeze,

Thanks for taking the time to answer.

You mean tree.h? [...] I haven't looked into cache issues closely but it's clear that performance drops a bit once threads / tree search enter the picture.

Is it guaranteed for tree nodes to be deallocated by the same thread that created them? If so, we can have a per-thread allocator. Given nodes are of the same size, a bitmap will suffice for bookkeeping purposes.

Getting about 20% performance drop. Can probaly improve on that.

Try reverting my old commit that reduced padding, in effect 10% of struct tree_node's size. That commit alone gave me a linear 11% boost on Sandy Bridge with 8 threads. The workstation I'm using right now doesn't have hyperthreading so there's no such pronounceable loss.

You mean the eyefill check in playout_moggy_permit() ? Can probably be improved but I'd be more concerned about is_bad_selfatari() which is far more ugly, expensive and used by other parts as well...

Both are hairy... For the latter, could there possibly be a lookup table that rejects many cases without having too much overhead on its own? Getting an understanding for tactics/selfatari.c sure takes a long time from just skimming the file. If it's worth it to even try, I can do a selfatari histogram from KGS games for 4x4 patterns or something.

How is DCNN compared to AlphaGo in a best-case scenario (unlimited tensor units)? The DCNN code was developed prior to DeepMind papers so does it use a different technique?

Not sure i understand the question. What do you mean by 'unlimited tensor units' ?

Sorry, I phrased it badly. Let's try again: is the machine learning structure in Pachi's DCNN much worse than AlphaGo in principle? What is the general principle behind the current code? I'm too much familiar with classical symbolic programming and this can be a great opportunity for getting into machine learning stuff.

Are there any particular entry points (files, functions) for the ML bits to familiarize oneself with it?

lemonsqueeze · 2018-10-13T20:38:42Z

Is it guaranteed for tree nodes to be deallocated by the same thread that created them? If so, we can have a per-thread allocator. Given nodes are of the same size, a bitmap will suffice for bookkeeping purposes.

iirc nodes are only deallocated when the tree is pruned at the end of genmove, so shouldn't be an issue.

Try reverting my old commit that reduced padding, in effect 10% of struct tree_node's size. That commit alone gave me a linear 11% boost on Sandy Bridge with 8 threads. The workstation I'm using right now doesn't have hyperthreading so there's no such pronounceable loss.

ok, will try. good to know.

Both are hairy... For the latter, could there possibly be a lookup table that rejects many cases without having too much overhead on its own? Getting an understanding for tactics/selfatari.c sure takes a long time from just skimming the file. If it's worth it to even try, I can do a selfatari histogram from KGS games for 4x4 patterns or something.

imo performance isn't so much an issue right now, playouts are pretty fast already. Improving the quality of playouts would be much better, they're very noisy right now so it needs lots of them ... Maybe time to try other approaches, using a learned policy or RL policy training maybe...

Another area if you want to help would be finding good parameters for MM (#93).

Sorry, I phrased it badly. Let's try again: is the machine learning structure in Pachi's DCNN much worse than AlphaGo in principle? What is the general principle behind the current code? I'm too much familiar with classical symbolic programming and this can be a great opportunity for getting into machine learning stuff.

Are there any particular entry points (files, functions) for the ML bits to familiarize oneself with it?

Right now Pachi uses Detlef's 54% dcnn which is similar to what the first alphago used as policy network (the one trained from human games, not the RL one). The big difference is Pachi uses it for root node only whereas alphago and others use it for every node. There's no ML dcnn code in Pachi itself actually. I believe Detlef trained it more or less like in maddison, huang & silver's paper ("move evaluation in go using deep convolutional neural networks"). There have been many improvements since then of course. If you plug in Leela-zero's dcnn it will probably be around 5d, even root node only.

sthalik · 2018-10-14T00:51:59Z

difference is Pachi uses it for root node only whereas alphago and others use it for every node

Would it be better to consult the network for at least couple dozen moves? Is this a TODO item or is there a deeper reasoning behind it? Do you prefer fast or heavy playouts?

Another area if you want to help would be finding good parameters for MM (#93).

How many mid-range-gaming GPUs do you need rolling 24/7 and for how long?

lemonsqueeze · 2018-10-14T03:11:49Z

Would it be better to consult the network for at least couple dozen moves? Is this a TODO item or is there a deeper reasoning behind it? Do you prefer fast or heavy playouts?

In theory yes. I did experiments with that early on, the problem is it introduces time and prior unbalance between the dcnn and non-dcnn nodes, but maybe there's a way to make it work.

The big TODO item here is gpu mode Pachi with dcnn at every node, this would make it about 2 stones stronger. iirc from the darkforest paper there are some issues with parallel dcnn to work around but shouldn't be too hard. I don't have a gpu so i'm not interested in that but that's the way forward strength-wise. Would also make it possible to use value network for evaluation instead of playouts, but at this point might as well just use leela-zero =)

sthalik · 2019-01-07T22:59:34Z

Hey,

Thanks for pointing me toward the leela-zero project. Apparently I've been living in the dark for the past few months 😃.

Using CNN for each move can definitely change things if using a heavy multi-layer network on a GPU. Leela-zero still has problems with ladders and seki when using few playouts that pachi never had from my experience. Note that other programs are also using Leela's network, or Leela itself with networks trained with proprietary compute resources.

Cheers, and sorry for leaving this conversation unresolved for a bit.

lemonsqueeze · 2019-02-16T20:00:40Z

Oh, just realized there's an area that could use optimization: spatial pattern lookup in pattern.c:
Right now it's expensive, we hash and lookup every point on the board (twice if no good local matches!)
genmove is 20% faster on my laptop if i comment out pattern_match_spatial().
Good incremental patterns would probably help, maybe time to resurrect board_spathash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

development questions #90

development questions #90

sthalik commented Oct 8, 2018 •

edited

Loading

lemonsqueeze commented Oct 12, 2018 •

edited

Loading

sthalik commented Oct 13, 2018

lemonsqueeze commented Oct 13, 2018

sthalik commented Oct 14, 2018

lemonsqueeze commented Oct 14, 2018

sthalik commented Jan 7, 2019

lemonsqueeze commented Feb 16, 2019

development questions #90

development questions #90

Comments

sthalik commented Oct 8, 2018 • edited Loading

lemonsqueeze commented Oct 12, 2018 • edited Loading

sthalik commented Oct 13, 2018

lemonsqueeze commented Oct 13, 2018

sthalik commented Oct 14, 2018

lemonsqueeze commented Oct 14, 2018

sthalik commented Jan 7, 2019

lemonsqueeze commented Feb 16, 2019

sthalik commented Oct 8, 2018 •

edited

Loading

lemonsqueeze commented Oct 12, 2018 •

edited

Loading