Sparse matrix format for count and binary data & setting dtype of tensors


#1

Hello,

I am working on factor analysis model using greta on large count and binary matrices where most values are zero. I know it is possible to convert R Matrix dgCMatrix format to numpy sparse arrays with reticulate. So, I am wondering if it is straightforward to add support for sparse matrices in greta::as_data.

Also on the topic of large data, in the current version is it possible to change the dtype of tensor objects from float64 to float32 or from int64 to int32/int16 to reduce memory footprint of the data and large parameter tensors?

Thanks a lot,

Vitalii


#2

Unfortunately that’s not possible yet. We’re hoping to add support for sparse greta arrays in the future, but it probably won’t be for a while.

Tensorflow doesn’t have great sparse matrix support (not as good as numpy unfortunately), but it can do an efficient matrix multiplication between a dense and a sparse matrix. If this efficiency is important for your model, and you don’t mind digging a little deeper into greta’s internals, I could probably help you to put together a custom greta operation to do this efficiently. What is the operation exactly?

You can switch to single precision easily enough with the precision argument to greta::model(). That only does floats though, greta doesn’t explicitly handle integer greta arrays at the moment.


#3

Thanks for your explanation. I do not mind digging deeper into greta’s internals especially if that’s faster than rewriting my model directly in tensorflow or pytorch.
Without going into too much detail, I need to multiply a dense matrix of parameters by a sparse matrix of data c %*% data and then define poisson distribution over a sparse matrix of data distribution(data) = poisson(mu). This operation is done only once in the model so my main reason for using the sparse matrix format is data size ~5000 * 20000 (dimensions*data points) and can be even larger. So, would it possible to define custom operation for sparse data rather than parameters?

Another way to address my memory problem is to supply less data points in each tensorflow batch (e.g. 20 batches of 1000 points). As far as I understand, the batch size currently is all of the data to enable your way of doing MCMC. Would it be possible to change the batch size at least for likelihood/posterior optimisation with opt()?