-
-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added stream to CSV #337
Added stream to CSV #337
Conversation
Could you please rebase onto master? Thanks 🎉 |
Hey @timofurrer I rebased from master. |
Could that feature be tested to avoid regressions? |
@claudep |
I'm no maintainer, but if a maintainer asked you to rebase, that's a good sign the API is fine! |
There won't be a final API - but we'll adjust the versioning accordingly to API breaks, so we don't break your code. |
@claudep @timofurrer |
In order to do a real stream, there also need to be a streaming input and streaming processing. Recently I've implemented following StreamingDataset, which is (in cooperation with custom streaming formats) capable of working with iterator (generator) as data. It allows me to export large database tables with constant memory requirements. class SubscriptableIterable:
def __init__(self, iterable):
iterator = iter(iterable)
self._first_item = next(iterator)
self._iter = chain([self._first_item], iterator)
def __getitem__(self, index):
if index == 0:
return self._first_item
raise IndexError
def __iter__(self):
return self
def __next__(self):
return next(self._iter)
class StreamingDataset(Dataset):
def __init__(self, data=None, **kwargs):
super().__init__(**kwargs)
if data is not None:
self.set_data(data)
def set_data(self, data):
self._data = _SubscriptableIterable(map(Row, data))
def __iter__(self):
return self
def __next__(self):
return next(self._data)
def __repr__(self):
try:
return "<%s streaming dataset>" % (self.title.lower())
except AttributeError:
return "<streaming dataset>"
def _apply_formatters(self):
if not self._formatters:
yield from self._data
for row in self._data:
for col, callback in self._formatters:
try:
if col is None:
for j, c in enumerate(row):
row[j] = callback(c)
else:
row[col] = callback(row[col])
except IndexError:
raise InvalidDatasetIndex
yield row
def _package(self, dicts=True, ordered=True):
"""Packages Dataset into lists of dictionaries for transmission."""
if ordered:
dict_pack = OrderedDict
else:
dict_pack = dict
data = self._apply_formatters()
if self.headers:
if dicts:
data = (dict_pack(zip(self.headers, data_row)) for data_row in data)
else:
data = chain([self.headers], data)
return data |
If you want to stream large sets of data, I would recommend looking into petl, which is also a pure Python tabular data library. |
Hello
Here's a proof of concept of what I talk in #207 .
This PR just add the possibility to get a stream or its content.
I didn't change the API, just added
seek(0)
, to have a file ready to use.I use the same code here https://github.com/django-import-export/django-import-export/pull/821/files#diff-9c335e3895cab39c4af9e510b328c2f0R157
It's a Django app for import/export from DB.
To have a good integration we need to get file-like object.