-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
manual benchmark to crunch dataypes #11
base: master
Are you sure you want to change the base?
Conversation
Very promising !! I'm surprised by this one :
Linestrength values are very low but definitly not integers. I'm quite sure such a change breaks the spectrum. We can also hardcode the type change based on the column name ; given that the number of supported databases remains limited. This may be more robust eventually. Your function becomes useful to have an idea of which type to use. Do you know how to check for accuracy ? You may need to override the |
@erwanp "int" was getting converted to # test if column can be converted to an integer
asint = props[col].fillna(0).astype(np.int64)
result = (props[col] - asint)
result = result.sum()
# lines with the problem
if result > -0.01 and result < 0.01:
IsInt = True I just changed the bound of result to much smaller value and I was able to get a no assertion errors upon checking with assert_frame_equal assert_frame_equal(df1, df0, check_exact=False, rtol=0.1e-50, atol=0.1e-50) I would like to proceed with implementing this lossless crunching to the main codebase. The idea would be to crunch all the data types and also resolve the current issues we are facing with parsing quanta (like #280). I tried to pass the These errors aside, how would you like me to implement this in the codebase? Would it be good to hardcode the values of the datatypes and just set the NaN values to -1 and avoid dropping the lines itself as you were suggesting in the discussion of #280 ? |
Hello ! Unfortunately I'll have very limited time this week to provide any insightful answer ; maybe @dcmvdbekerom can help ? |
Do you have more information on the errors you're facing ? |
@erwanp few are just assertions that check the datatype of columns. I don't think this would be an issue. Other than that, I could conclude from the errors that it is mainly due to the generation of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!! Requested a few small changes.
The NaN issue should be solved when loading the database, but I'm not sure yet what's the best way to do this.. Will let you know when I think of something!
" if not np.isfinite(props[col]).all(): \n", | ||
" NAlist.append(col)\n", | ||
" props[col].fillna(-1,inplace=True) \n", | ||
" \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Erwan that we should do this conversion at an earlier step, namely when loading the database. When that is implemented, at this point you can assume all data is indeed finite.
" elif mx < 4294967295:\n", | ||
" props[col] = props[col].astype(np.uint32)\n", | ||
" else:\n", | ||
" props[col] = props[col].astype(np.uint64)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, I would write the limits as (1 << 8) - 1, (1 << 16) - 1, (1 << 32) -1, etc.
I believe that the -1 part is not even necessary, since it's integers and you're using "Lesser than" (without the "or equal to"), so you could use (1<<8), (1<<16), and (1<<32) as limits.
a << b means "bitshift integer a by b positions to the left", this means that 1 << b is shorthand for 2**b.
" # Print new column type\n", | ||
" print(\"dtype after: \",props[col].dtype)\n", | ||
" print(\"******************************\")\n", | ||
" \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Hajime Kawahara recently pointed out, conversion to float32 may be too restrictive for the line position (v0) https://radis-radiation.slack.com/archives/C01G3117DJS/p1625716717008500?thread_ts=1625633254.001800&cid=C01G3117DJS. For all other decimals float32 is probably fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, cool function : np.finfo . Shows the relative accuracy, it may be used to check if our dtype reduction are valid? (as we know the number of significant digits used in the HITRAN database)
https://stackoverflow.com/questions/39134956/accuracy-of-float32
Related : see excellent benchmark f32/f64 in Exojax : HajimeKawahara/exojax#106 |
Brief Description
Contains a .ipynb that with a function to crunch datatypes. Was able to crunch datatypes of size
565.2839660644531 MB
to349.76945400238037
around61.875%
of the original size. The compression is lossless.Edit @erwanp : file visible here https://github.com/radis/radis-benchmark/blob/e512ecfcdd6d399f10d2fca19d9b0ab6ae475904/manual_benchmarks/test_datatypes_crunch.ipynb
Proposed solution for implementation in Radis
Pass the pandas data frame through this function before returning it.