Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

development questions #90

Open
sthalik opened this issue Oct 8, 2018 · 7 comments
Open

development questions #90

sthalik opened this issue Oct 8, 2018 · 7 comments

Comments

@sthalik
Copy link
Contributor

sthalik commented Oct 8, 2018

Hey,

I haven't had an opportunity to ask few of these:

  • Is the comment about cache misses in node.h still relevant? Does it make sense to implement a tagged allocator or some other per-thread page allocator? Is there some lower or upper bound heuristic for the node count?
  • Can the self-eye-fill code be simplified, e.g. replaced with a pattern database that compiles to C code? It's really ugly and complicated.
  • Do the DCNN changes regress the non-DCNN version, with DCNN disabled?
  • How is DCNN compared to AlphaGo in a best-case scenario (unlimited tensor units)? The DCNN code was developed prior to DeepMind papers so does it use a different technique?
  • Is pondering still disabled with DCNN?
  • @lemonsqueeze are you the current maintainer?

(edited since initial posting)

@lemonsqueeze
Copy link
Collaborator

lemonsqueeze commented Oct 12, 2018

Hi Stan,

  • Is the comment about cache misses in node.h still relevant? Does it make sense to implement a tagged allocator or some other per-thread page allocator? Is there some lower or upper bound heuristic for the node count?

You mean tree.h ? I guess so, this part of the code hasn't changed in a long time. I haven't looked into cache issues closely but it's clear that performance drops a bit once threads / tree search enter the picture. Comparing number of playouts/s between raw playouts and minimal tree search:

pachi -u t-unit/blank.t
./pachi --nodcnn --nopatterns --nojoseki -t =10000 threads=1,max_tree_size=40 < gtp/genmove.gtp
./pachi --nodcnn --nopatterns --nojoseki -t =10000 threads=4,max_tree_size=40 < gtp/genmove.gtp

[edited]
-> Getting about 20% performance drop. Can probaly improve on that.

  • Can the self-eye-fill code be simplified, e.g. replaced with a pattern database that compiles to C code? It's really ugly and complicated.

You mean the eyefill check in playout_moggy_permit() ? Can probably be improved but I'd be more concerned about is_bad_selfatari() which is far more ugly, expensive and used by other parts as well...

  • Do the DCNN changes regress the non-DCNN version, with DCNN disabled?

Not afaik, except pondering which is disabled by default. Play-testing welcome, haven't checked in a while. For low playouts it does better now actually because of MM (#81).

  • How is DCNN compared to AlphaGo in a best-case scenario (unlimited tensor units)? The DCNN code was developed prior to DeepMind papers so does it use a different technique?

Not sure i understand the question. What do you mean by 'unlimited tensor units' ?

  • Is pondering still disabled with DCNN?

Yes

Yes, time permitting.

@sthalik
Copy link
Contributor Author

sthalik commented Oct 13, 2018

Hey @lemonsqueeze,

Thanks for taking the time to answer.

You mean tree.h? [...] I haven't looked into cache issues closely but it's clear that performance drops a bit once threads / tree search enter the picture.

Is it guaranteed for tree nodes to be deallocated by the same thread that created them? If so, we can have a per-thread allocator. Given nodes are of the same size, a bitmap will suffice for bookkeeping purposes.

Getting about 20% performance drop. Can probaly improve on that.

Try reverting my old commit that reduced padding, in effect 10% of struct tree_node's size. That commit alone gave me a linear 11% boost on Sandy Bridge with 8 threads. The workstation I'm using right now doesn't have hyperthreading so there's no such pronounceable loss.

You mean the eyefill check in playout_moggy_permit() ? Can probably be improved but I'd be more concerned about is_bad_selfatari() which is far more ugly, expensive and used by other parts as well...

Both are hairy... For the latter, could there possibly be a lookup table that rejects many cases without having too much overhead on its own? Getting an understanding for tactics/selfatari.c sure takes a long time from just skimming the file. If it's worth it to even try, I can do a selfatari histogram from KGS games for 4x4 patterns or something.

How is DCNN compared to AlphaGo in a best-case scenario (unlimited tensor units)? The DCNN code was developed prior to DeepMind papers so does it use a different technique?

Not sure i understand the question. What do you mean by 'unlimited tensor units' ?

Sorry, I phrased it badly. Let's try again: is the machine learning structure in Pachi's DCNN much worse than AlphaGo in principle? What is the general principle behind the current code? I'm too much familiar with classical symbolic programming and this can be a great opportunity for getting into machine learning stuff.

Are there any particular entry points (files, functions) for the ML bits to familiarize oneself with it?

@lemonsqueeze
Copy link
Collaborator

Is it guaranteed for tree nodes to be deallocated by the same thread that created them? If so, we can have a per-thread allocator. Given nodes are of the same size, a bitmap will suffice for bookkeeping purposes.

iirc nodes are only deallocated when the tree is pruned at the end of genmove, so shouldn't be an issue.

Try reverting my old commit that reduced padding, in effect 10% of struct tree_node's size. That commit alone gave me a linear 11% boost on Sandy Bridge with 8 threads. The workstation I'm using right now doesn't have hyperthreading so there's no such pronounceable loss.

ok, will try. good to know.

Both are hairy... For the latter, could there possibly be a lookup table that rejects many cases without having too much overhead on its own? Getting an understanding for tactics/selfatari.c sure takes a long time from just skimming the file. If it's worth it to even try, I can do a selfatari histogram from KGS games for 4x4 patterns or something.

imo performance isn't so much an issue right now, playouts are pretty fast already. Improving the quality of playouts would be much better, they're very noisy right now so it needs lots of them ... Maybe time to try other approaches, using a learned policy or RL policy training maybe...

Another area if you want to help would be finding good parameters for MM (#93).

Sorry, I phrased it badly. Let's try again: is the machine learning structure in Pachi's DCNN much worse than AlphaGo in principle? What is the general principle behind the current code? I'm too much familiar with classical symbolic programming and this can be a great opportunity for getting into machine learning stuff.

Are there any particular entry points (files, functions) for the ML bits to familiarize oneself with it?

Right now Pachi uses Detlef's 54% dcnn which is similar to what the first alphago used as policy network (the one trained from human games, not the RL one). The big difference is Pachi uses it for root node only whereas alphago and others use it for every node. There's no ML dcnn code in Pachi itself actually. I believe Detlef trained it more or less like in maddison, huang & silver's paper ("move evaluation in go using deep convolutional neural networks"). There have been many improvements since then of course. If you plug in Leela-zero's dcnn it will probably be around 5d, even root node only.

@sthalik
Copy link
Contributor Author

sthalik commented Oct 14, 2018

difference is Pachi uses it for root node only whereas alphago and others use it for every node

Would it be better to consult the network for at least couple dozen moves? Is this a TODO item or is there a deeper reasoning behind it? Do you prefer fast or heavy playouts?

Another area if you want to help would be finding good parameters for MM (#93).

How many mid-range-gaming GPUs do you need rolling 24/7 and for how long?

@lemonsqueeze
Copy link
Collaborator

Would it be better to consult the network for at least couple dozen moves? Is this a TODO item or is there a deeper reasoning behind it? Do you prefer fast or heavy playouts?

In theory yes. I did experiments with that early on, the problem is it introduces time and prior unbalance between the dcnn and non-dcnn nodes, but maybe there's a way to make it work.

The big TODO item here is gpu mode Pachi with dcnn at every node, this would make it about 2 stones stronger. iirc from the darkforest paper there are some issues with parallel dcnn to work around but shouldn't be too hard. I don't have a gpu so i'm not interested in that but that's the way forward strength-wise. Would also make it possible to use value network for evaluation instead of playouts, but at this point might as well just use leela-zero =)

@sthalik
Copy link
Contributor Author

sthalik commented Jan 7, 2019

Hey,

Thanks for pointing me toward the leela-zero project. Apparently I've been living in the dark for the past few months 😃.

Using CNN for each move can definitely change things if using a heavy multi-layer network on a GPU. Leela-zero still has problems with ladders and seki when using few playouts that pachi never had from my experience. Note that other programs are also using Leela's network, or Leela itself with networks trained with proprietary compute resources.

Cheers, and sorry for leaving this conversation unresolved for a bit.

@lemonsqueeze
Copy link
Collaborator

Oh, just realized there's an area that could use optimization: spatial pattern lookup in pattern.c:
Right now it's expensive, we hash and lookup every point on the board (twice if no good local matches!)
genmove is 20% faster on my laptop if i comment out pattern_match_spatial().
Good incremental patterns would probably help, maybe time to resurrect board_spathash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants