Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clib.conversion: Remove the array_to_datetime function, which is no longer needed #3507

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

seisman
Copy link
Member

@seisman seisman commented Oct 11, 2024

After a series of PRs (especially #3760), the _to_numpy function takes the responsibility to convert any array-like objects into numpy arrays with recognized dtypes. So, if _to_numpy returns a np.object dtype string, it means the original array cannot to converted to a numpy array that can be passed to GMT.

After #3760, the array_to_datetime function is no longer used and can be removed.

@seisman seisman added the run/benchmark Trigger the benchmark workflow in PRs label Oct 11, 2024
@seisman seisman force-pushed the remove/array-to-datetime branch from e5cf4f1 to 705aa4a Compare October 13, 2024 14:57
@seisman seisman force-pushed the remove/array-to-datetime branch from 705aa4a to d7503d1 Compare October 13, 2024 15:31
@seisman seisman force-pushed the remove/array-to-datetime branch from d7503d1 to 20b9215 Compare October 13, 2024 15:47
@@ -1388,7 +1378,7 @@ def virtualfile_from_vectors(self, *vectors):
# Assumes that first 2 columns contains coordinates like longitude
# latitude, or datetime string types.
for col, array in enumerate(arrays[2:]):
if pd.api.types.is_string_dtype(array.dtype):
if array.dtype.type == np.str_:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.api.types.is_string_dtype was added in PR #684. Before PR #684, we used np.issubdtype(array.dtype, np.str_), which was added in #520. Any specific reason that we should use np.issubdtype(array.dtype, np.str_), not array.dtype.type == np.str_?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need to check if this can handle pandas.StringDtype and pyarrow.StringArray (xref #2933).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both can be converted to the numpy string dtype by the vectors_to_arrays method, so in virtualfile_from_vectors, we only need to check np.str_:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import pyarrow as pa

In [4]: x = pd.Series(["abc", "defghi"], dtype="string")

In [5]: np.asarray(x)
Out[5]: array(['abc', 'defghi'], dtype=object)

In [6]: np.asarray(x, dtype=str)
Out[6]: array(['abc', 'defghi'], dtype='<U6')

In [7]: y = pa.array(["abc", "defghi"])

In [8]: np.asarray(y)
Out[8]: array(['abc', 'defghi'], dtype=object)

In [9]: np.asarray(y, dtype=str)
Out[9]: array(['abc', 'defghi'], dtype='<U6')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main idea of this PR is to let vectors_to_arrays do the conversion work from any dtypes (including pd.StringDtype and pa.StringArray) into the numpy dtypes, so that the virtualfile_from_vectors only need to care about how to map numpy dtypes into GMT data types.

For any special dtypes that we know how to convert it to numpy dtype, we can maintain a mapping dictionary, just like what you did to support pyarrow's date32[day] and date64[ms] in #2845:

dtypes = {
"date32[day][pyarrow]": np.datetime64,
"date64[ms][pyarrow]": np.datetime64,
}

Copy link
Member Author

@seisman seisman Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 83673cf, I've moved most of the doctests into a separate test file pygmt/tests/test_clib_vectors_to_arrays.py. A test test_vectors_to_arrays_pandas_string is added to check that the function can handle pd.StringDtype correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any special dtypes that we know how to convert it to numpy dtype, we can maintain a mapping dictionary, just like what you did to support pyarrow's date32[day] and date64[ms] in #2845:

dtypes = {
"date32[day][pyarrow]": np.datetime64,
"date64[ms][pyarrow]": np.datetime64,
}

Based on the tests below, I think we should add the entry "string": np.str_ to the dictionary:

In [1]: import pandas as pd

In [2]: x = pd.Series(["abc", "12345"])

In [3]: x.dtype
Out[3]: dtype('O')

In [4]: str(x.dtype)
Out[4]: 'object'

In [5]: x = pd.Series(["abc", "12345"], dtype="string")

In [6]: x.dtype
Out[6]: string[python]

In [7]: str(x.dtype)
Out[7]: 'string'

In [8]: x = pd.Series(["abc", "12345"], dtype="string[pyarrow]")

In [9]: x.dtype
Out[9]: string[pyarrow]

In [10]: str(x.dtype)
Out[10]: 'string'

In [11]: import pyarrow as pa

In [12]: x = pa.array(["abc", "defghi"])

In [13]: x.type
Out[13]: DataType(string)

In [14]: str(x.type)
Out[14]: 'string'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PR #2774 and #2774, we only checked if PyGMT supports pandas with the pyarrow backend, but didn't check if the original pyarrow arrays works. For example, for a pyarrow date32 array, we need to check array.type rather than array.dtype:

In [1]: import datetime

In [2]: import pyarrow as pa

In [3]: x = pa.array([datetime.date(2020, 1, 1), datetime.date(2021, 12, 31)])

In [4]: str(x.type)
Out[4]: 'date32[day]'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, for a pyarrow date32 array, we need to check array.type

Yes, raw pyarrow arrays of date32/date64 type are not supported yet. That's why it's marked 🚧 in #2800 (I was planning to modify array_to_datetime to handle it).

pygmt/clib/conversion.py Outdated Show resolved Hide resolved
@seisman seisman changed the title WIP: Refactor vectors_to_arrays and deprecate the array_to_datetime function clib.conversion: Deal with np.object dtype in vectors_to_arrays and deprecate the array_to_datetime function Oct 13, 2024
@seisman seisman added maintenance Boring but important stuff for the core devs needs review This PR has higher priority and needs review. labels Oct 13, 2024
@seisman seisman added this to the 0.14.0 milestone Oct 13, 2024
@seisman seisman marked this pull request as ready for review October 13, 2024 16:24
@seisman seisman force-pushed the remove/array-to-datetime branch from aa3ffc3 to 6338cde Compare October 14, 2024 10:47
@seisman seisman force-pushed the remove/array-to-datetime branch from a029191 to 54160bf Compare October 14, 2024 11:23
@seisman seisman removed the needs review This PR has higher priority and needs review. label Oct 15, 2024
@seisman seisman marked this pull request as draft October 16, 2024 03:21
@seisman seisman marked this pull request as ready for review October 17, 2024 09:45
@seisman seisman added the needs review This PR has higher priority and needs review. label Oct 28, 2024
@seisman seisman marked this pull request as ready for review October 28, 2024 04:33
@seisman seisman self-assigned this Oct 30, 2024
@seisman seisman removed the needs review This PR has higher priority and needs review. label Nov 5, 2024
pygmt/tests/test_clib_vectors_to_arrays.py Outdated Show resolved Hide resolved
pygmt/clib/conversion.py Outdated Show resolved Hide resolved
@seisman
Copy link
Member Author

seisman commented Nov 6, 2024

Will revisit changes in this PR after #3583, #3584, #3585.

@seisman seisman marked this pull request as draft November 6, 2024 02:40
@seisman seisman changed the title clib.conversion: Deal with np.object dtype in vectors_to_arrays and deprecate the array_to_datetime function clib.conversion: Remove the array_to_datetime function, which is not longer needed Nov 28, 2024
@weiji14 weiji14 changed the title clib.conversion: Remove the array_to_datetime function, which is not longer needed clib.conversion: Remove the array_to_datetime function, which is no longer needed Nov 28, 2024
@seisman seisman removed this from the 0.14.0 milestone Dec 18, 2024
@seisman seisman added this to the 0.15.0 milestone Jan 13, 2025
@seisman seisman added skip-changelog Skip adding Pull Request to changelog needs review This PR has higher priority and needs review. and removed run/benchmark Trigger the benchmark workflow in PRs labels Jan 13, 2025
Comment on lines -937 to -940
# For 1-D arrays, try to convert unknown object type to np.datetime64.
if ndim == 1 and array.dtype.type is np.object_:
with contextlib.suppress(ValueError):
array = array_to_datetime(array)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to the _to_numpy function in PR #3760.

If the datetime string is invalid.

Examples
--------
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These doctests are already covered by the tests in test_clib_to_numpy.py

@seisman seisman marked this pull request as ready for review January 13, 2025 05:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs needs review This PR has higher priority and needs review. skip-changelog Skip adding Pull Request to changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants