-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datafusion-python integration #3334
Comments
I have a question, how to expose _rowid and _rowaddr, it seems that datafusion api and duckdb don't support these pseudo columns. |
For the duckdb integration you can create a dataset with default scan options. You can't filter on the column yet unfortunately because pyarrow and datafusion have interpreted unsigned integers slightly differently in the filtering language (Substrait) and so there is a DF change needed.
For datafusion you choose whether you want these columns to appear when you create the table provider:
|
cannt filter on rowid or any column ? I tested the following ut. def test_duckdb_rowid(tmp_path):
duckdb = pytest.importorskip("duckdb")
tbl = create_table_for_duckdb()
ds = lance.write_dataset(tbl, str(tmp_path))
ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})
duckdb.query("SELECT id, meta, price FROM ds WHERE id==1000").to_df() # error
duckdb.query("SELECT _rowid, meta, price FROM ds WHERE id==1000").to_df() # error
duckdb.query("SELECT _rowid, id, meta, price FROM ds").to_df() # error
duckdb.query("SELECT id, meta, price FROM ds").to_df() # OK |
Yes, with_row_id, with_row_addr these flags will always work. but I think spark's SupportsMetadataColumns interface is much better. |
I created a PR for datafusion to illustrate my idea for _rowid support apache/datafusion#14057 |
The datafusion-python project recently added support for "foreign table providers" in apache/datafusion-python#921.
We should be able to utilize this to create a foreign table provider from lance. This would make it very easy to query lance datasets using python and would be comparable to our duckdb integration.
The text was updated successfully, but these errors were encountered: