sds performance peculiarities #284

zhangyingmath · 2022-01-11T02:43:46Z

This is not necessarily a bug, but it really surprised me as a user.
Create a test dataset of 80Million rows, one column of float64,
ds = rt.Dataset({'f' : [1.0 for i in range(int(8e7))]})
save to sds: 139ms. Load back: 220ms. Size on disk: 60KB.

Create a similar dataset, except a column of int32,
save to sds: 69ms. Load back: 116ms. Size on disk: 31K.

Create a similar dataset, except a column of int64,
ds = rt.Dataset({'f' : [1 for i in range(int(8e7))]})
save to sds: 1.5s. Load back: 1.1s. Size on disk: 20M.

Q: why is int 10 times slower than float and 300 larger in file size?
TO compare, when I use h5, regardless of float or int, the performance is similar: save is 460ms, load back is 350ms, disk size is 2.6MB.

zhangyingmath · 2022-01-11T16:35:25Z

rt version 1.0.57. I expanded the above experiment a bit. Here is a table:

N = int(80e6)

value	dtype	format	save total time	save wall time	load total time	load wall time	size on disk	size on ram
const 1, repeat N times	np.int32	sds	48.5ms	48.3ms	223ms	112ms	31KB	305MB
const 1, repeat N times	np.int64	sds	3s	1.5s	2s	1s	20MB	610MB
const 1, repeat N times	np.float32	sds	97ms	49ms	225ms	112ms	31KB	305MB
const 1, repeat N times	np.float64	sds	199ms	100ms	453ms	227ms	60KB	610MB
const '111111', repeat N times	S6	sds	371ms	186ms	215ms	107ms	17KB	458MB
range(N)	np.int32	sds	1s	542ms	630ms	316ms	207MB	305MB
range(N)	np.int64	sds	3s	1.6s	1.8s	923ms	78MB	610MB
range(N)	np.float32	sds	907ms	460ms	728ms	364ms	70MB	305MB
range(N)	np.float64	sds	1.5s	1.5s	1.9s	960ms	59MB	610MB

To compare, here is a similar table with h5. Write with the hdf5 package, with default compression blosc:lz4, comp level 6, written as a dict of numpy arrays. Load with rt.load_h5, which calls the same underlying package.

value	dtype	format	save total time	save wall time	load total time	load wall time	size on disk	size on ram
const 1, repeat N times	np.int32	h5	233ms	233ms	180ms	180ms	1.4MB	305MB
const 1, repeat N times	np.int64	h5	480ms	480ms	404ms	424ms	2.6MB	610MB
const 1, repeat N times	np.float32	h5	232ms	232ms	185ms	184ms	1.4MB	305MB
const 1, repeat N times	np.float64	h5	525ms	451ms	357ms	357ms	2.6MB	610MB
const '111111', repeat N times	S6	h5	438ms	438ms	440ms	440ms	2.0MB	458MB
range(N)	np.int32	h5	233ms	233ms	180ms	180ms	2.8MB	305MB
range(N)	np.int64	h5	499ms	500ms	367ms	367ms	4.1MB	610MB
range(N)	np.float32	sds	236ms	236ms	171ms	174ms	3.0MB	305MB
range(N)	np.float64	sds	525ms	527ms	368ms	374ms	5.2MB	610MB

zhangyingmath changed the title ~~why sds has big performance difference on float versus int?~~ why sds has big performance difference on int64 versus int32? Jan 11, 2022

zhangyingmath changed the title ~~why sds has big performance difference on int64 versus int32?~~ sds performance peculiarities Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sds performance peculiarities #284

sds performance peculiarities #284

zhangyingmath commented Jan 11, 2022 •

edited

Loading

zhangyingmath commented Jan 11, 2022

sds performance peculiarities #284

sds performance peculiarities #284

Comments

zhangyingmath commented Jan 11, 2022 • edited Loading

zhangyingmath commented Jan 11, 2022

zhangyingmath commented Jan 11, 2022 •

edited

Loading