10 minutes to xorbits.pandas
#
This is a short introduction to xorbits.pandas
which is originated from pandas’ quickstart.
Customarily, we import and init as follows:
In [1]: import xorbits
In [2]: import xorbits.numpy as np
In [3]: import xorbits.pandas as pd
In [4]: xorbits.init()
Object creation#
Creating a Series
by passing a list of values, letting it create a default integer index:
In [5]: s = pd.Series([1, 3, 5, np.nan, 6, 8])
In [6]: s
Out[6]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
Creating a DataFrame
by passing an array, with a datetime index and labeled columns:
In [7]: dates = pd.date_range('20130101', periods=6)
In [8]: dates
Out[8]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [9]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
In [10]: df
Out[10]:
A B C D
2013-01-01 0.224446 -1.164657 1.515325 -0.553190
2013-01-02 -0.490051 0.065176 -0.364318 -0.729461
2013-01-03 -0.561802 -2.508903 -0.540759 -1.184117
2013-01-04 -1.266986 1.598020 -0.151260 -1.528504
2013-01-05 0.057634 0.830278 -0.654922 1.041385
2013-01-06 0.674938 0.021462 -1.439848 1.046305
Creating a DataFrame
by passing a dict of objects that can be converted to series-like.
In [11]: df2 = pd.DataFrame({'A': 1.,
....: 'B': pd.Timestamp('20130102'),
....: 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
....: 'D': np.array([3] * 4, dtype='int32'),
....: 'E': 'foo'})
....:
In [12]: df2
Out[12]:
A B C D E
0 1.0 2013-01-02 1.0 3 foo
1 1.0 2013-01-02 1.0 3 foo
2 1.0 2013-01-02 1.0 3 foo
3 1.0 2013-01-02 1.0 3 foo
The columns of the resulting DataFrame
have different dtypes.
In [13]: df2.dtypes
Out[13]:
A float64
B datetime64[ns]
C float32
D int32
E object
dtype: object
Viewing data#
Here is how to view the top and bottom rows of the frame:
In [14]: df.head()
Out[14]:
A B C D
2013-01-01 0.224446 -1.164657 1.515325 -0.553190
2013-01-02 -0.490051 0.065176 -0.364318 -0.729461
2013-01-03 -0.561802 -2.508903 -0.540759 -1.184117
2013-01-04 -1.266986 1.598020 -0.151260 -1.528504
2013-01-05 0.057634 0.830278 -0.654922 1.041385
In [15]: df.tail(3)
Out[15]:
A B C D
2013-01-04 -1.266986 1.598020 -0.151260 -1.528504
2013-01-05 0.057634 0.830278 -0.654922 1.041385
2013-01-06 0.674938 0.021462 -1.439848 1.046305
Display the index, columns:
In [16]: df.index
Out[16]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [17]: df.columns
Out[17]: Index(['A', 'B', 'C', 'D'], dtype='object')
DataFrame.to_numpy()
gives a ndarray representation of the underlying data. Note that this
can be an expensive operation when your DataFrame
has columns with different data types,
which comes down to a fundamental difference between DataFrame and ndarray: ndarrays have one
dtype for the entire ndarray, while DataFrames have one dtype per column. When you call
DataFrame.to_numpy()
, xorbits.pandas
will find the ndarray dtype that can hold all
of the dtypes in the DataFrame. This may end up being object
, which requires casting every
value to a Python object.
For df
, our DataFrame
of all floating-point values,
DataFrame.to_numpy()
is fast and doesn’t require copying data.
In [18]: df.to_numpy()
Out[18]:
array([[ 0.22444615, -1.16465735, 1.51532458, -0.55318981],
[-0.4900508 , 0.06517642, -0.36431766, -0.7294606 ],
[-0.56180194, -2.5089035 , -0.54075921, -1.18411743],
[-1.26698561, 1.59801965, -0.15125973, -1.52850352],
[ 0.05763376, 0.83027825, -0.654922 , 1.04138489],
[ 0.67493784, 0.02146243, -1.43984801, 1.04630517]])
For df2
, the DataFrame
with multiple dtypes, DataFrame.to_numpy()
is relatively
expensive.
In [19]: df2.to_numpy()
Out[19]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo']],
dtype=object)
Note
DataFrame.to_numpy()
does not include the index or column
labels in the output.
describe()
shows a quick statistic summary of your data:
In [20]: df.describe()
Out[20]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.226970 -0.193104 -0.272630 -0.317930
std 0.687193 1.461166 0.979832 1.108995
min -1.266986 -2.508903 -1.439848 -1.528504
25% -0.543864 -0.868127 -0.626381 -1.070453
50% -0.216209 0.043319 -0.452538 -0.641325
75% 0.182743 0.639003 -0.204524 0.642741
max 0.674938 1.598020 1.515325 1.046305
Sorting by an axis:
In [21]: df.sort_index(axis=1, ascending=False)
Out[21]:
D C B A
2013-01-01 -0.553190 1.515325 -1.164657 0.224446
2013-01-02 -0.729461 -0.364318 0.065176 -0.490051
2013-01-03 -1.184117 -0.540759 -2.508903 -0.561802
2013-01-04 -1.528504 -0.151260 1.598020 -1.266986
2013-01-05 1.041385 -0.654922 0.830278 0.057634
2013-01-06 1.046305 -1.439848 0.021462 0.674938
Sorting by values:
In [22]: df.sort_values(by='B')
Out[22]:
A B C D
2013-01-03 -0.561802 -2.508903 -0.540759 -1.184117
2013-01-01 0.224446 -1.164657 1.515325 -0.553190
2013-01-06 0.674938 0.021462 -1.439848 1.046305
2013-01-02 -0.490051 0.065176 -0.364318 -0.729461
2013-01-05 0.057634 0.830278 -0.654922 1.041385
2013-01-04 -1.266986 1.598020 -0.151260 -1.528504
Selection#
Note
While standard Python expressions for selecting and setting are
intuitive and come in handy for interactive work, for production code, we
recommend the optimized xorbits.pandas
data access methods, .at
, .iat
,
.loc
and .iloc
.
Getting#
Selecting a single column, which yields a Series
, equivalent to df.A
:
In [23]: df['A']
Out[23]:
2013-01-01 0.224446
2013-01-02 -0.490051
2013-01-03 -0.561802
2013-01-04 -1.266986
2013-01-05 0.057634
2013-01-06 0.674938
Freq: D, Name: A, dtype: float64
Selecting via []
, which slices the rows:
In [24]: df[0:3]
Out[24]:
A B C D
2013-01-01 0.224446 -1.164657 1.515325 -0.553190
2013-01-02 -0.490051 0.065176 -0.364318 -0.729461
2013-01-03 -0.561802 -2.508903 -0.540759 -1.184117
In [25]: df['20130102':'20130104']
Out[25]:
A B C D
2013-01-02 -0.490051 0.065176 -0.364318 -0.729461
2013-01-03 -0.561802 -2.508903 -0.540759 -1.184117
2013-01-04 -1.266986 1.598020 -0.151260 -1.528504
Selection by label#
For getting a cross section using a label:
In [26]: df.loc['20130101']
Out[26]:
A 0.224446
B -1.164657
C 1.515325
D -0.553190
Name: 2013-01-01 00:00:00, dtype: float64
Selecting on a multi-axis by label:
In [27]: df.loc[:, ['A', 'B']]
Out[27]:
A B
2013-01-01 0.224446 -1.164657
2013-01-02 -0.490051 0.065176
2013-01-03 -0.561802 -2.508903
2013-01-04 -1.266986 1.598020
2013-01-05 0.057634 0.830278
2013-01-06 0.674938 0.021462
Showing label slicing, both endpoints are included:
In [28]: df.loc['20130102':'20130104', ['A', 'B']]
Out[28]:
A B
2013-01-02 -0.490051 0.065176
2013-01-03 -0.561802 -2.508903
2013-01-04 -1.266986 1.598020
Reduction in the dimensions of the returned object:
In [29]: df.loc['20130102', ['A', 'B']]
Out[29]:
A -0.490051
B 0.065176
Name: 2013-01-02 00:00:00, dtype: float64
For getting a scalar value:
In [30]: df.loc['20130101', 'A']
Out[30]: 0.22444615122186354
For getting fast access to a scalar (equivalent to the prior method):
In [31]: df.at['20130101', 'A']
Out[31]: 0.22444615122186354
Selection by position#
Select via the position of the passed integers:
In [32]: df.iloc[3]
Out[32]:
A -1.266986
B 1.598020
C -0.151260
D -1.528504
Name: 2013-01-04 00:00:00, dtype: float64
By integer slices, acting similar to python:
In [33]: df.iloc[3:5, 0:2]
Out[33]:
A B
2013-01-04 -1.266986 1.598020
2013-01-05 0.057634 0.830278
By lists of integer position locations, similar to the python style:
In [34]: df.iloc[[1, 2, 4], [0, 2]]
Out[34]:
A C
2013-01-02 -0.490051 -0.364318
2013-01-03 -0.561802 -0.540759
2013-01-05 0.057634 -0.654922
For slicing rows explicitly:
In [35]: df.iloc[1:3, :]
Out[35]:
A B C D
2013-01-02 -0.490051 0.065176 -0.364318 -0.729461
2013-01-03 -0.561802 -2.508903 -0.540759 -1.184117
For slicing columns explicitly:
In [36]: df.iloc[:, 1:3]
Out[36]:
B C
2013-01-01 -1.164657 1.515325
2013-01-02 0.065176 -0.364318
2013-01-03 -2.508903 -0.540759
2013-01-04 1.598020 -0.151260
2013-01-05 0.830278 -0.654922
2013-01-06 0.021462 -1.439848
For getting a value explicitly:
In [37]: df.iloc[1, 1]
Out[37]: 0.06517641626685941
For getting fast access to a scalar (equivalent to the prior method):
In [38]: df.iat[1, 1]
Out[38]: 0.06517641626685941
Boolean indexing#
Using a single column’s values to select data.
In [39]: df[df['A'] > 0]
Out[39]:
A B C D
2013-01-01 0.224446 -1.164657 1.515325 -0.553190
2013-01-05 0.057634 0.830278 -0.654922 1.041385
2013-01-06 0.674938 0.021462 -1.439848 1.046305
Selecting values from a DataFrame where a boolean condition is met.
In [40]: df[df > 0]
Out[40]:
A B C D
2013-01-01 0.224446 NaN 1.515325 NaN
2013-01-02 NaN 0.065176 NaN NaN
2013-01-03 NaN NaN NaN NaN
2013-01-04 NaN 1.598020 NaN NaN
2013-01-05 0.057634 0.830278 NaN 1.041385
2013-01-06 0.674938 0.021462 NaN 1.046305
Operations#
Stats#
Operations in general exclude missing data.
Performing a descriptive statistic:
In [41]: df.mean()
Out[41]:
A -0.226970
B -0.193104
C -0.272630
D -0.317930
dtype: float64
Same operation on the other axis:
In [42]: df.mean(1)
Out[42]:
2013-01-01 0.005481
2013-01-02 -0.379663
2013-01-03 -1.198896
2013-01-04 -0.337182
2013-01-05 0.318594
2013-01-06 0.075714
Freq: D, dtype: float64
Operating with objects that have different dimensionality and need alignment. In addition,
xorbits.pandas
automatically broadcasts along the specified dimension.
In [43]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
In [44]: s
Out[44]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
In [45]: df.sub(s, axis='index')
Out[45]:
A B C D
2013-01-01 NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN
2013-01-03 -1.561802 -3.508903 -1.540759 -2.184117
2013-01-04 -4.266986 -1.401980 -3.151260 -4.528504
2013-01-05 -4.942366 -4.169722 -5.654922 -3.958615
2013-01-06 NaN NaN NaN NaN
Apply#
Applying functions to the data:
In [46]: df.apply(lambda x: x.max() - x.min())
Out[46]:
A 1.941923
B 4.106923
C 2.955173
D 2.574809
dtype: float64
String Methods#
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them).
In [47]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In [48]: s.str.lower()
Out[48]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
Merge#
Concat#
xorbits.pandas
provides various facilities for easily combining together Series and
DataFrame objects with various kinds of set logic for the indexes
and relational algebra functionality in the case of join / merge-type
operations.
Concatenating xorbits.pandas
objects together with concat()
:
In [49]: df = pd.DataFrame(np.random.randn(10, 4))
In [50]: df
Out[50]:
0 1 2 3
0 -0.630711 -1.055283 0.700147 -1.388102
1 0.890356 -1.709177 0.078013 0.607967
2 1.873063 0.569527 -2.301107 -0.740584
3 -0.360962 2.270846 1.031973 -0.476972
4 2.418310 -0.686323 -1.147042 -0.292606
5 1.089239 0.343140 -0.156534 -0.349143
6 0.742573 -1.403164 0.381083 -1.586600
7 -1.108405 -0.700347 -0.433201 -0.261067
8 -1.478137 0.795965 0.778158 -0.998803
9 -1.553243 0.210058 -2.481455 -0.581690
# break it into pieces
In [51]: pieces = [df[:3], df[3:7], df[7:]]
In [52]: pd.concat(pieces)
Out[52]:
0 1 2 3
0 -0.630711 -1.055283 0.700147 -1.388102
1 0.890356 -1.709177 0.078013 0.607967
2 1.873063 0.569527 -2.301107 -0.740584
3 -0.360962 2.270846 1.031973 -0.476972
4 2.418310 -0.686323 -1.147042 -0.292606
5 1.089239 0.343140 -0.156534 -0.349143
6 0.742573 -1.403164 0.381083 -1.586600
7 -1.108405 -0.700347 -0.433201 -0.261067
8 -1.478137 0.795965 0.778158 -0.998803
9 -1.553243 0.210058 -2.481455 -0.581690
Join#
SQL style merges.
In [53]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
In [54]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
In [55]: left
Out[55]:
key lval
0 foo 1
1 foo 2
In [56]: right
Out[56]:
key rval
0 foo 4
1 foo 5
In [57]: pd.merge(left, right, on='key')
Out[57]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
Another example that can be given is:
In [58]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
In [59]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
In [60]: left
Out[60]:
key lval
0 foo 1
1 bar 2
In [61]: right
Out[61]:
key rval
0 foo 4
1 bar 5
In [62]: pd.merge(left, right, on='key')
Out[62]:
key lval rval
0 foo 1 4
1 bar 2 5
Grouping#
By “group by” we are referring to a process involving one or more of the following steps:
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
In [63]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
....: 'foo', 'bar', 'foo', 'foo'],
....: 'B': ['one', 'one', 'two', 'three',
....: 'two', 'two', 'one', 'three'],
....: 'C': np.random.randn(8),
....: 'D': np.random.randn(8)})
....:
In [64]: df
Out[64]:
A B C D
0 foo one 0.224976 -0.990328
1 bar one 0.740208 -0.199152
2 foo two -0.693955 1.684258
3 bar three -0.658850 0.431106
4 foo two 1.023329 -0.316393
5 bar two 1.409466 1.184248
6 foo one 1.429536 1.139899
7 foo three 1.271735 0.270696
Grouping and then applying the sum()
function to
the resulting groups.
In [65]: df.groupby('A').sum()
Out[65]:
B C D
A
bar onethreetwo 1.490823 1.416202
foo onetwotwoonethree 3.255622 1.788130
Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.
In [66]: df.groupby(['A', 'B']).sum()
Out[66]:
C D
A B
bar one 0.740208 -0.199152
three -0.658850 0.431106
two 1.409466 1.184248
foo one 1.654512 0.149571
three 1.271735 0.270696
two 0.329375 1.367864
Plotting#
We use the standard convention for referencing the matplotlib API:
In [67]: import matplotlib.pyplot as plt
In [68]: plt.close('all')
In [69]: ts = pd.Series(np.random.randn(1000),
....: index=pd.date_range('1/1/2000', periods=1000))
....:
In [70]: ts = ts.cumsum()
In [71]: ts.plot()
Out[71]: <Axes: >

On a DataFrame, the plot()
method is a convenience to plot all
of the columns with labels:
In [72]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
....: columns=['A', 'B', 'C', 'D'])
....:
In [73]: df = df.cumsum()
In [74]: plt.figure()
Out[74]: <Figure size 640x480 with 0 Axes>
In [75]: df.plot()
Out[75]: <Axes: >
In [76]: plt.legend(loc='best')
Out[76]: <matplotlib.legend.Legend at 0x7f068e01ce50>

Getting data in/out#
CSV#
Writing to a csv file.
In [77]: df.to_csv('foo.csv')
Out[77]:
Empty DataFrame
Columns: []
Index: []
Reading from a csv file.
In [78]: pd.read_csv('foo.csv')
Out[78]:
Unnamed: 0 A B C D
0 2000-01-01 1.178291 0.364987 -0.840177 -0.332956
1 2000-01-02 2.057435 0.296857 -0.003532 -1.758008
2 2000-01-03 0.953813 0.271327 -0.756990 -1.198440
3 2000-01-04 0.589310 -0.392329 0.681356 -1.951529
4 2000-01-05 0.904952 1.599681 1.959410 -3.059426
.. ... ... ... ... ...
995 2002-09-22 -12.510332 39.213884 26.434486 59.426481
996 2002-09-23 -13.653644 38.167130 25.543169 58.687596
997 2002-09-24 -16.472875 37.242324 24.860751 56.588386
998 2002-09-25 -15.509081 37.077324 24.493449 56.980079
999 2002-09-26 -16.018922 36.578095 26.079306 57.032693
[1000 rows x 5 columns]