Concurrent read and write


#1

Dear all,

Could tell me whether SciDB support the concurrent read and write?

Now I am using netCDF to store my large data in parallel python. netCDF support the parallel read, but not parallel write.

Best regards,

Xujun Han


#2

SciDB supports …

( 1 ) Concurrent readers over the same data.

( 2 ) One writer at a time.

( 3 ) Readers can still be working while data is being written to a SciDB array, but the writer’s data is only visible to readers once the writer’s transaction is complete and committed.

In other words, we support concurrent readers with at most one writer.

I wasn’t aware that netCDF provided any transactional guarantees. Would you be so kind as to post a link to a location where this is described? Doing transactions right is not easy and my hat would be off to that team if they’ve achieved it.


#3

Dear Plumber,

Thanks for your reply. So It is not necessary to move to SciDB for me. Same condition as netCDF.

https://www.unidata.ucar.edu/software/netcdf/docs/netcdf/Limitations.html,One writer and multiple readers may access data in a single dataset simultaneously, but there is no support for multiple concurrent writers.

I don’t know what is transactional guarantee. But the concurrent read works very well.


#4

[quote=“Hanxujun”]Dear Plumber,

Thanks for your reply. So It is not necessary to move to SciDB for me. Same condition as netCDF.

https://www.unidata.ucar.edu/software/netcdf/docs/netcdf/Limitations.html,One writer and multiple readers may access data in a single dataset simultaneously, but there is no support for multiple concurrent writers.

I don’t know what is transactional guarantee. But the concurrent read works very well.[/quote]
The scopt of transaction is only array.I guess the Write operation on array is a atomic operation.


#5

Well … just to clear up what might develop into some confusion …

Transactional quality of service guarantees mean a lot more than concurrent readers / writers. It means …

  1. If the change you’re making to your data files fails for some reason … such as the writer program crashes … transactional systems guarantee that it will leave the data unaffected. This is atomicity.
  2. If you have a series of small writes – appending data to the end of the data set, say – concurrent readers are completely unaware of these changes. In fact, concurrent readers may have different views of the data, depending on when they start, relative to write operations. This is transactional isolation.
  3. Once you’ve written the data, SciDB makes copies of it to ensure that if one copy disappears when a piece of your hardware blows up, we have a spare. That’s durability.
  4. And when you write to a SciDB array, we ensure that any data in the system complies with any of the rules you’ve said the array should obey. That’s consistency.

The important point is that all of this happens without developers and users needing to do anything at all.

Now … after reading your note, I went and had a look at netCDF with specific attention being paid to the way this concurrent reader / writer is handled. I first headed here … unidata.ucar.edu/software/ne … brary.html … where it says this:

Then I went to have a look at this nc_sync() facility … unidata.ucar.edu/software/n … fsync.html … where it says;

So what this says (to me) is that if you’re really careful, you can implement a one-writer / multiple readers access to your data in netCDF files. But the actual implementing is entirely up to you. And for various technical reasons … for example, suppose you have multiple concurrent readers accessing a data set, each of them will require a distinct copy of the data in memory, and it’s not clear to me how a reader ought to determine when to call nc_sync as that might be a pretty expensive operation … I don’t think the underlying mechanic is especially viable for large numbers of concurrent readers.

Mind you, none of these observations should be read as me pointing at netCDF and laughing. Scientific file formats are designed and implemented with very specific use-cases in mind, and they prioritize the interests of different kinds of users to the ones who generally use SciDB. If you’ve got the time and skill to implement your own concurrent access control in your own programs and doing so is critical to your application – I can think of all kinds of reasons that it might be – then you should absolutely use those tools in the best way you can.

But most of SciDB’s users aren’t programmers. So they appreciate not having to deal with all these details. :wink:

Good luck with your netCDF work! I’m always curious to learn more about how people use netCDF.


#6

This is good lessen for me. Now I understand the difference between netCDF and SciDB.
My code is not very complex and I use the netCDF file to share the data between different processors because of the limited RAM problem.
These is also one netcdf4-python library like SciDB, quite simple.
Hope the concurrent write could be implemented in SciDB :smiley: .