Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sds performance peculiarities #284

Open
zhangyingmath opened this issue Jan 11, 2022 · 1 comment
Open

sds performance peculiarities #284

zhangyingmath opened this issue Jan 11, 2022 · 1 comment

Comments

@zhangyingmath
Copy link

zhangyingmath commented Jan 11, 2022

This is not necessarily a bug, but it really surprised me as a user.
Create a test dataset of 80Million rows, one column of float64,
ds = rt.Dataset({'f' : [1.0 for i in range(int(8e7))]})
save to sds: 139ms. Load back: 220ms. Size on disk: 60KB.

Create a similar dataset, except a column of int32,
save to sds: 69ms. Load back: 116ms. Size on disk: 31K.

Create a similar dataset, except a column of int64,
ds = rt.Dataset({'f' : [1 for i in range(int(8e7))]})
save to sds: 1.5s. Load back: 1.1s. Size on disk: 20M.

Q: why is int 10 times slower than float and 300 larger in file size?
TO compare, when I use h5, regardless of float or int, the performance is similar: save is 460ms, load back is 350ms, disk size is 2.6MB.

@zhangyingmath zhangyingmath changed the title why sds has big performance difference on float versus int? why sds has big performance difference on int64 versus int32? Jan 11, 2022
@zhangyingmath
Copy link
Author

rt version 1.0.57. I expanded the above experiment a bit. Here is a table:

N = int(80e6)

value dtype format save total time save wall time load total time load wall time size on disk size on ram
const 1, repeat N times np.int32 sds 48.5ms 48.3ms 223ms 112ms 31KB 305MB
const 1, repeat N times np.int64 sds 3s 1.5s 2s 1s 20MB 610MB
const 1, repeat N times np.float32 sds 97ms 49ms 225ms 112ms 31KB 305MB
const 1, repeat N times np.float64 sds 199ms 100ms 453ms 227ms 60KB 610MB
const '111111', repeat N times S6 sds 371ms 186ms 215ms 107ms 17KB 458MB
range(N) np.int32 sds 1s 542ms 630ms 316ms 207MB 305MB
range(N) np.int64 sds 3s 1.6s 1.8s 923ms 78MB 610MB
range(N) np.float32 sds 907ms 460ms 728ms 364ms 70MB 305MB
range(N) np.float64 sds 1.5s 1.5s 1.9s 960ms 59MB 610MB

To compare, here is a similar table with h5. Write with the hdf5 package, with default compression blosc:lz4, comp level 6, written as a dict of numpy arrays. Load with rt.load_h5, which calls the same underlying package.

value dtype format save total time save wall time load total time load wall time size on disk size on ram
const 1, repeat N times np.int32 h5 233ms 233ms 180ms 180ms 1.4MB 305MB
const 1, repeat N times np.int64 h5 480ms 480ms 404ms 424ms 2.6MB 610MB
const 1, repeat N times np.float32 h5 232ms 232ms 185ms 184ms 1.4MB 305MB
const 1, repeat N times np.float64 h5 525ms 451ms 357ms 357ms 2.6MB 610MB
const '111111', repeat N times S6 h5 438ms 438ms 440ms 440ms 2.0MB 458MB
range(N) np.int32 h5 233ms 233ms 180ms 180ms 2.8MB 305MB
range(N) np.int64 h5 499ms 500ms 367ms 367ms 4.1MB 610MB
range(N) np.float32 sds 236ms 236ms 171ms 174ms 3.0MB 305MB
range(N) np.float64 sds 525ms 527ms 368ms 374ms 5.2MB 610MB

@zhangyingmath zhangyingmath changed the title why sds has big performance difference on int64 versus int32? sds performance peculiarities Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant