10 minutes to xorbits.pandas
#
This is a short introduction to xorbits.pandas
which is originated from pandas’ quickstart.
Customarily, we import and init as follows:
In [1]: import xorbits
In [2]: import xorbits.numpy as np
In [3]: import xorbits.pandas as pd
In [4]: xorbits.init()
Object creation#
Creating a Series
by passing a list of values, letting it create a default integer index:
In [5]: s = pd.Series([1, 3, 5, np.nan, 6, 8])
In [6]: s
Out[6]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
Creating a DataFrame
by passing an array, with a datetime index and labeled columns:
In [7]: dates = pd.date_range('20130101', periods=6)
In [8]: dates
Out[8]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [9]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
In [10]: df
Out[10]:
A B C D
2013-01-01 0.146749 0.509997 0.927547 0.938364
2013-01-02 0.343210 -0.010853 0.329029 -0.298380
2013-01-03 0.494675 0.998666 -0.472782 1.930903
2013-01-04 0.918839 -0.889056 -0.787589 0.125859
2013-01-05 -1.469828 -0.228817 0.491169 -0.117628
2013-01-06 1.387363 -0.062752 -1.168410 0.120743
Creating a DataFrame
by passing a dict of objects that can be converted to series-like.
In [11]: df2 = pd.DataFrame({'A': 1.,
....: 'B': pd.Timestamp('20130102'),
....: 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
....: 'D': np.array([3] * 4, dtype='int32'),
....: 'E': 'foo'})
....:
In [12]: df2
Out[12]:
A B C D E
0 1.0 2013-01-02 1.0 3 foo
1 1.0 2013-01-02 1.0 3 foo
2 1.0 2013-01-02 1.0 3 foo
3 1.0 2013-01-02 1.0 3 foo
The columns of the resulting DataFrame
have different dtypes.
In [13]: df2.dtypes
Out[13]:
A float64
B datetime64[s]
C float32
D int32
E object
dtype: object
Viewing data#
Here is how to view the top and bottom rows of the frame:
In [14]: df.head()
Out[14]:
A B C D
2013-01-01 0.146749 0.509997 0.927547 0.938364
2013-01-02 0.343210 -0.010853 0.329029 -0.298380
2013-01-03 0.494675 0.998666 -0.472782 1.930903
2013-01-04 0.918839 -0.889056 -0.787589 0.125859
2013-01-05 -1.469828 -0.228817 0.491169 -0.117628
In [15]: df.tail(3)
Out[15]:
A B C D
2013-01-04 0.918839 -0.889056 -0.787589 0.125859
2013-01-05 -1.469828 -0.228817 0.491169 -0.117628
2013-01-06 1.387363 -0.062752 -1.168410 0.120743
Display the index, columns:
In [16]: df.index
Out[16]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [17]: df.columns
Out[17]: Index(['A', 'B', 'C', 'D'], dtype='object')
DataFrame.to_numpy()
gives a ndarray representation of the underlying data. Note that this
can be an expensive operation when your DataFrame
has columns with different data types,
which comes down to a fundamental difference between DataFrame and ndarray: ndarrays have one
dtype for the entire ndarray, while DataFrames have one dtype per column. When you call
DataFrame.to_numpy()
, xorbits.pandas
will find the ndarray dtype that can hold all
of the dtypes in the DataFrame. This may end up being object
, which requires casting every
value to a Python object.
For df
, our DataFrame
of all floating-point values,
DataFrame.to_numpy()
is fast and doesn’t require copying data.
In [18]: df.to_numpy()
Out[18]:
array([[ 0.14674894, 0.50999701, 0.92754719, 0.93836374],
[ 0.34321006, -0.01085331, 0.32902911, -0.29837999],
[ 0.49467495, 0.99866562, -0.47278159, 1.93090271],
[ 0.91883899, -0.88905568, -0.78758909, 0.12585859],
[-1.46982785, -0.22881734, 0.49116879, -0.11762769],
[ 1.38736338, -0.06275176, -1.16841013, 0.12074287]])
For df2
, the DataFrame
with multiple dtypes, DataFrame.to_numpy()
is relatively
expensive.
In [19]: df2.to_numpy()
Out[19]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo']],
dtype=object)
备注
DataFrame.to_numpy()
does not include the index or column
labels in the output.
describe()
shows a quick statistic summary of your data:
In [20]: df.describe()
Out[20]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.303501 0.052864 -0.113506 0.449977
std 0.975786 0.646142 0.817502 0.839919
min -1.469828 -0.889056 -1.168410 -0.298380
25% 0.195864 -0.187301 -0.708887 -0.058035
50% 0.418943 -0.036803 -0.071876 0.123301
75% 0.812798 0.379784 0.450634 0.735237
max 1.387363 0.998666 0.927547 1.930903
Sorting by an axis:
In [21]: df.sort_index(axis=1, ascending=False)
Out[21]:
D C B A
2013-01-01 0.938364 0.927547 0.509997 0.146749
2013-01-02 -0.298380 0.329029 -0.010853 0.343210
2013-01-03 1.930903 -0.472782 0.998666 0.494675
2013-01-04 0.125859 -0.787589 -0.889056 0.918839
2013-01-05 -0.117628 0.491169 -0.228817 -1.469828
2013-01-06 0.120743 -1.168410 -0.062752 1.387363
Sorting by values:
In [22]: df.sort_values(by='B')
Out[22]:
A B C D
2013-01-04 0.918839 -0.889056 -0.787589 0.125859
2013-01-05 -1.469828 -0.228817 0.491169 -0.117628
2013-01-06 1.387363 -0.062752 -1.168410 0.120743
2013-01-02 0.343210 -0.010853 0.329029 -0.298380
2013-01-01 0.146749 0.509997 0.927547 0.938364
2013-01-03 0.494675 0.998666 -0.472782 1.930903
Selection#
备注
While standard Python expressions for selecting and setting are
intuitive and come in handy for interactive work, for production code, we
recommend the optimized xorbits.pandas
data access methods, .at
, .iat
,
.loc
and .iloc
.
Getting#
Selecting a single column, which yields a Series
, equivalent to df.A
:
In [23]: df['A']
Out[23]:
2013-01-01 0.146749
2013-01-02 0.343210
2013-01-03 0.494675
2013-01-04 0.918839
2013-01-05 -1.469828
2013-01-06 1.387363
Freq: D, Name: A, dtype: float64
Selecting via []
, which slices the rows:
In [24]: df[0:3]
Out[24]:
A B C D
2013-01-01 0.146749 0.509997 0.927547 0.938364
2013-01-02 0.343210 -0.010853 0.329029 -0.298380
2013-01-03 0.494675 0.998666 -0.472782 1.930903
In [25]: df['20130102':'20130104']
Out[25]:
A B C D
2013-01-02 0.343210 -0.010853 0.329029 -0.298380
2013-01-03 0.494675 0.998666 -0.472782 1.930903
2013-01-04 0.918839 -0.889056 -0.787589 0.125859
Selection by label#
For getting a cross section using a label:
In [26]: df.loc['20130101']
Out[26]:
A 0.146749
B 0.509997
C 0.927547
D 0.938364
Name: 2013-01-01 00:00:00, dtype: float64
Selecting on a multi-axis by label:
In [27]: df.loc[:, ['A', 'B']]
Out[27]:
A B
2013-01-01 0.146749 0.509997
2013-01-02 0.343210 -0.010853
2013-01-03 0.494675 0.998666
2013-01-04 0.918839 -0.889056
2013-01-05 -1.469828 -0.228817
2013-01-06 1.387363 -0.062752
Showing label slicing, both endpoints are included:
In [28]: df.loc['20130102':'20130104', ['A', 'B']]
Out[28]:
A B
2013-01-02 0.343210 -0.010853
2013-01-03 0.494675 0.998666
2013-01-04 0.918839 -0.889056
Reduction in the dimensions of the returned object:
In [29]: df.loc['20130102', ['A', 'B']]
Out[29]:
A 0.343210
B -0.010853
Name: 2013-01-02 00:00:00, dtype: float64
For getting a scalar value:
In [30]: df.loc['20130101', 'A']
Out[30]: 0.14674893840817868
For getting fast access to a scalar (equivalent to the prior method):
In [31]: df.at['20130101', 'A']
Out[31]: 0.14674893840817868
Selection by position#
Select via the position of the passed integers:
In [32]: df.iloc[3]
Out[32]:
A 0.918839
B -0.889056
C -0.787589
D 0.125859
Name: 2013-01-04 00:00:00, dtype: float64
By integer slices, acting similar to python:
In [33]: df.iloc[3:5, 0:2]
Out[33]:
A B
2013-01-04 0.918839 -0.889056
2013-01-05 -1.469828 -0.228817
By lists of integer position locations, similar to the python style:
In [34]: df.iloc[[1, 2, 4], [0, 2]]
Out[34]:
A C
2013-01-02 0.343210 0.329029
2013-01-03 0.494675 -0.472782
2013-01-05 -1.469828 0.491169
For slicing rows explicitly:
In [35]: df.iloc[1:3, :]
Out[35]:
A B C D
2013-01-02 0.343210 -0.010853 0.329029 -0.298380
2013-01-03 0.494675 0.998666 -0.472782 1.930903
For slicing columns explicitly:
In [36]: df.iloc[:, 1:3]
Out[36]:
B C
2013-01-01 0.509997 0.927547
2013-01-02 -0.010853 0.329029
2013-01-03 0.998666 -0.472782
2013-01-04 -0.889056 -0.787589
2013-01-05 -0.228817 0.491169
2013-01-06 -0.062752 -1.168410
For getting a value explicitly:
In [37]: df.iloc[1, 1]
Out[37]: -0.010853306036420134
For getting fast access to a scalar (equivalent to the prior method):
In [38]: df.iat[1, 1]
Out[38]: -0.010853306036420134
Boolean indexing#
Using a single column’s values to select data.
In [39]: df[df['A'] > 0]
Out[39]:
A B C D
2013-01-01 0.146749 0.509997 0.927547 0.938364
2013-01-02 0.343210 -0.010853 0.329029 -0.298380
2013-01-03 0.494675 0.998666 -0.472782 1.930903
2013-01-04 0.918839 -0.889056 -0.787589 0.125859
2013-01-06 1.387363 -0.062752 -1.168410 0.120743
Selecting values from a DataFrame where a boolean condition is met.
In [40]: df[df > 0]
Out[40]:
A B C D
2013-01-01 0.146749 0.509997 0.927547 0.938364
2013-01-02 0.343210 NaN 0.329029 NaN
2013-01-03 0.494675 0.998666 NaN 1.930903
2013-01-04 0.918839 NaN NaN 0.125859
2013-01-05 NaN NaN 0.491169 NaN
2013-01-06 1.387363 NaN NaN 0.120743
Operations#
Stats#
Operations in general exclude missing data.
Performing a descriptive statistic:
In [41]: df.mean()
Out[41]:
A 0.303501
B 0.052864
C -0.113506
D 0.449977
dtype: float64
Same operation on the other axis:
In [42]: df.mean(1)
Out[42]:
2013-01-01 0.630664
2013-01-02 0.090751
2013-01-03 0.737865
2013-01-04 -0.157987
2013-01-05 -0.331276
2013-01-06 0.069236
Freq: D, dtype: float64
Operating with objects that have different dimensionality and need alignment. In addition,
xorbits.pandas
automatically broadcasts along the specified dimension.
In [43]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
In [44]: s
Out[44]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
In [45]: df.sub(s, axis='index')
Out[45]:
A B C D
2013-01-01 NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN
2013-01-03 -0.505325 -0.001334 -1.472782 0.930903
2013-01-04 -2.081161 -3.889056 -3.787589 -2.874141
2013-01-05 -6.469828 -5.228817 -4.508831 -5.117628
2013-01-06 NaN NaN NaN NaN
Apply#
Applying functions to the data:
In [46]: df.apply(lambda x: x.max() - x.min())
Out[46]:
A 2.857191
B 1.887721
C 2.095957
D 2.229283
dtype: float64
String Methods#
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them).
In [47]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In [48]: s.str.lower()
Out[48]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
Merge#
Concat#
xorbits.pandas
provides various facilities for easily combining together Series and
DataFrame objects with various kinds of set logic for the indexes
and relational algebra functionality in the case of join / merge-type
operations.
Concatenating xorbits.pandas
objects together with concat()
:
In [49]: df = pd.DataFrame(np.random.randn(10, 4))
In [50]: df
Out[50]:
0 1 2 3
0 0.621128 -0.337973 -0.745655 -0.177982
1 0.769179 0.499170 0.231917 0.177169
2 0.965615 -0.455256 0.653178 -0.437557
3 -1.062973 -0.000431 0.270017 -0.762508
4 0.064100 0.740777 0.000655 0.031409
5 -2.291755 -0.542511 0.197584 -1.943454
6 -0.235656 -1.288441 0.329932 -0.601470
7 -0.216890 0.912030 -0.023210 0.629549
8 2.180131 -0.856475 0.823501 -0.252162
9 -0.161037 -1.537540 -0.547591 -1.162559
# break it into pieces
In [51]: pieces = [df[:3], df[3:7], df[7:]]
In [52]: pd.concat(pieces)
Out[52]:
0 1 2 3
0 0.621128 -0.337973 -0.745655 -0.177982
1 0.769179 0.499170 0.231917 0.177169
2 0.965615 -0.455256 0.653178 -0.437557
3 -1.062973 -0.000431 0.270017 -0.762508
4 0.064100 0.740777 0.000655 0.031409
5 -2.291755 -0.542511 0.197584 -1.943454
6 -0.235656 -1.288441 0.329932 -0.601470
7 -0.216890 0.912030 -0.023210 0.629549
8 2.180131 -0.856475 0.823501 -0.252162
9 -0.161037 -1.537540 -0.547591 -1.162559
Join#
SQL style merges.
In [53]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
In [54]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
In [55]: left
Out[55]:
key lval
0 foo 1
1 foo 2
In [56]: right
Out[56]:
key rval
0 foo 4
1 foo 5
In [57]: pd.merge(left, right, on='key')
Out[57]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
Another example that can be given is:
In [58]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
In [59]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
In [60]: left
Out[60]:
key lval
0 foo 1
1 bar 2
In [61]: right
Out[61]:
key rval
0 foo 4
1 bar 5
In [62]: pd.merge(left, right, on='key')
Out[62]:
key lval rval
0 foo 1 4
1 bar 2 5
Grouping#
By “group by” we are referring to a process involving one or more of the following steps:
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
In [63]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
....: 'foo', 'bar', 'foo', 'foo'],
....: 'B': ['one', 'one', 'two', 'three',
....: 'two', 'two', 'one', 'three'],
....: 'C': np.random.randn(8),
....: 'D': np.random.randn(8)})
....:
In [64]: df
Out[64]:
A B C D
0 foo one 1.068893 -0.035675
1 bar one 0.236103 -1.658646
2 foo two 0.025028 0.034265
3 bar three 1.103031 -0.222646
4 foo two -0.029850 -0.465858
5 bar two 0.781703 -0.167088
6 foo one -0.694919 -0.670437
7 foo three -0.573294 0.156498
Grouping and then applying the sum()
function to
the resulting groups.
In [65]: df.groupby('A').sum()
Out[65]:
B C D
A
bar onethreetwo 2.120836 -2.048380
foo onetwotwoonethree -0.204143 -0.981208
Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.
In [66]: df.groupby(['A', 'B']).sum()
Out[66]:
C D
A B
bar one 0.236103 -1.658646
three 1.103031 -0.222646
two 0.781703 -0.167088
foo one 0.373974 -0.706113
three -0.573294 0.156498
two -0.004822 -0.431594
Plotting#
We use the standard convention for referencing the matplotlib API:
In [67]: import matplotlib.pyplot as plt
In [68]: plt.close('all')
In [69]: ts = pd.Series(np.random.randn(1000),
....: index=pd.date_range('1/1/2000', periods=1000))
....:
In [70]: ts = ts.cumsum()
In [71]: ts.plot()
Out[71]: <Axes: >
On a DataFrame, the plot()
method is a convenience to plot all
of the columns with labels:
In [72]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
....: columns=['A', 'B', 'C', 'D'])
....:
In [73]: df = df.cumsum()
In [74]: plt.figure()
Out[74]: <Figure size 640x480 with 0 Axes>
In [75]: df.plot()
Out[75]: <Axes: >
In [76]: plt.legend(loc='best')
Out[76]: <matplotlib.legend.Legend at 0x7f5b49ca77c0>
Getting data in/out#
CSV#
Writing to a csv file.
In [77]: df.to_csv('foo.csv')
Out[77]:
Empty DataFrame
Columns: []
Index: []
Reading from a csv file.
In [78]: pd.read_csv('foo.csv')
Out[78]:
Unnamed: 0 A B C D
0 2000-01-01 -0.008475 1.510711 0.851956 -0.008841
1 2000-01-02 0.812659 2.787078 -0.506546 -0.762796
2 2000-01-03 1.360920 2.033011 -0.329717 0.540186
3 2000-01-04 2.226712 3.620119 -0.897679 0.081344
4 2000-01-05 4.397129 3.483270 -0.142677 -0.639094
.. ... ... ... ... ...
995 2002-09-22 -3.038724 -22.546684 -29.614903 36.044764
996 2002-09-23 -3.342731 -24.292884 -29.059415 35.648857
997 2002-09-24 -5.861932 -24.100191 -28.591981 37.085480
998 2002-09-25 -4.247949 -24.921452 -30.735368 35.761904
999 2002-09-26 -5.797131 -24.266469 -28.616970 34.602231
[1000 rows x 5 columns]