-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clib.conversion: Remove the array_to_datetime function, which is no longer needed #3507
base: main
Are you sure you want to change the base?
Conversation
e5cf4f1
to
705aa4a
Compare
705aa4a
to
d7503d1
Compare
d7503d1
to
20b9215
Compare
pygmt/clib/session.py
Outdated
@@ -1388,7 +1378,7 @@ def virtualfile_from_vectors(self, *vectors): | |||
# Assumes that first 2 columns contains coordinates like longitude | |||
# latitude, or datetime string types. | |||
for col, array in enumerate(arrays[2:]): | |||
if pd.api.types.is_string_dtype(array.dtype): | |||
if array.dtype.type == np.str_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll need to check if this can handle pandas.StringDtype
and pyarrow.StringArray
(xref #2933).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both can be converted to the numpy string dtype by the vectors_to_arrays
method, so in virtualfile_from_vectors
, we only need to check np.str_
:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import pyarrow as pa
In [4]: x = pd.Series(["abc", "defghi"], dtype="string")
In [5]: np.asarray(x)
Out[5]: array(['abc', 'defghi'], dtype=object)
In [6]: np.asarray(x, dtype=str)
Out[6]: array(['abc', 'defghi'], dtype='<U6')
In [7]: y = pa.array(["abc", "defghi"])
In [8]: np.asarray(y)
Out[8]: array(['abc', 'defghi'], dtype=object)
In [9]: np.asarray(y, dtype=str)
Out[9]: array(['abc', 'defghi'], dtype='<U6')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main idea of this PR is to let vectors_to_arrays
do the conversion work from any dtypes (including pd.StringDtype
and pa.StringArray
) into the numpy dtypes, so that the virtualfile_from_vectors
only need to care about how to map numpy dtypes into GMT data types.
For any special dtypes that we know how to convert it to numpy dtype, we can maintain a mapping dictionary, just like what you did to support pyarrow's date32[day] and date64[ms] in #2845:
pygmt/pygmt/clib/conversion.py
Lines 208 to 211 in c2e429c
dtypes = { | |
"date32[day][pyarrow]": np.datetime64, | |
"date64[ms][pyarrow]": np.datetime64, | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 83673cf, I've moved most of the doctests into a separate test file pygmt/tests/test_clib_vectors_to_arrays.py
. A test test_vectors_to_arrays_pandas_string
is added to check that the function can handle pd.StringDtype
correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For any special dtypes that we know how to convert it to numpy dtype, we can maintain a mapping dictionary, just like what you did to support pyarrow's date32[day] and date64[ms] in #2845:
pygmt/pygmt/clib/conversion.py
Lines 208 to 211 in c2e429c
dtypes = { "date32[day][pyarrow]": np.datetime64, "date64[ms][pyarrow]": np.datetime64, }
Based on the tests below, I think we should add the entry "string": np.str_
to the dictionary:
In [1]: import pandas as pd
In [2]: x = pd.Series(["abc", "12345"])
In [3]: x.dtype
Out[3]: dtype('O')
In [4]: str(x.dtype)
Out[4]: 'object'
In [5]: x = pd.Series(["abc", "12345"], dtype="string")
In [6]: x.dtype
Out[6]: string[python]
In [7]: str(x.dtype)
Out[7]: 'string'
In [8]: x = pd.Series(["abc", "12345"], dtype="string[pyarrow]")
In [9]: x.dtype
Out[9]: string[pyarrow]
In [10]: str(x.dtype)
Out[10]: 'string'
In [11]: import pyarrow as pa
In [12]: x = pa.array(["abc", "defghi"])
In [13]: x.type
Out[13]: DataType(string)
In [14]: str(x.type)
Out[14]: 'string'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PR #2774 and #2774, we only checked if PyGMT supports pandas with the pyarrow backend, but didn't check if the original pyarrow arrays works. For example, for a pyarrow date32
array, we need to check array.type
rather than array.dtype
:
In [1]: import datetime
In [2]: import pyarrow as pa
In [3]: x = pa.array([datetime.date(2020, 1, 1), datetime.date(2021, 12, 31)])
In [4]: str(x.type)
Out[4]: 'date32[day]'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, for a pyarrow
date32
array, we need to checkarray.type
Yes, raw pyarrow
arrays of date32/date64
type are not supported yet. That's why it's marked 🚧 in #2800 (I was planning to modify array_to_datetime
to handle it).
aa3ffc3
to
6338cde
Compare
a029191
to
54160bf
Compare
# For 1-D arrays, try to convert unknown object type to np.datetime64. | ||
if ndim == 1 and array.dtype.type is np.object_: | ||
with contextlib.suppress(ValueError): | ||
array = array_to_datetime(array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to the _to_numpy
function in PR #3760.
If the datetime string is invalid. | ||
|
||
Examples | ||
-------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These doctests are already covered by the tests in test_clib_to_numpy.py
After a series of PRs (especially #3760), the
_to_numpy
function takes the responsibility to convert any array-like objects into numpy arrays with recognized dtypes. So, if_to_numpy
returns anp.object
dtype string, it means the original array cannot to converted to a numpy array that can be passed to GMT.After #3760, the
array_to_datetime
function is no longer used and can be removed.