10 minutes to xorbits.pandas
#
This is a short introduction to xorbits.pandas
which is originated from pandas’ quickstart.
Customarily, we import and init as follows:
In [1]: import xorbits
In [2]: import xorbits.numpy as np
In [3]: import xorbits.pandas as pd
In [4]: xorbits.init()
Object creation#
Creating a Series
by passing a list of values, letting it create a default integer index:
In [5]: s = pd.Series([1, 3, 5, np.nan, 6, 8])
In [6]: s
Out[6]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
Creating a DataFrame
by passing an array, with a datetime index and labeled columns:
In [7]: dates = pd.date_range('20130101', periods=6)
In [8]: dates
Out[8]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [9]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
In [10]: df
Out[10]:
A B C D
2013-01-01 -0.895370 2.126507 0.630915 1.413850
2013-01-02 -0.171440 -2.010562 -0.353420 0.429268
2013-01-03 0.142932 0.468562 0.218185 -0.220462
2013-01-04 0.943198 0.313965 1.366018 0.965330
2013-01-05 0.608089 1.220280 1.379540 -1.271964
2013-01-06 0.189339 -0.450788 0.504877 -2.264019
Creating a DataFrame
by passing a dict of objects that can be converted to series-like.
In [11]: df2 = pd.DataFrame({'A': 1.,
....: 'B': pd.Timestamp('20130102'),
....: 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
....: 'D': np.array([3] * 4, dtype='int32'),
....: 'E': 'foo'})
....:
In [12]: df2
Out[12]:
A B C D E
0 1.0 2013-01-02 1.0 3 foo
1 1.0 2013-01-02 1.0 3 foo
2 1.0 2013-01-02 1.0 3 foo
3 1.0 2013-01-02 1.0 3 foo
The columns of the resulting DataFrame
have different dtypes.
In [13]: df2.dtypes
Out[13]:
A float64
B datetime64[s]
C float32
D int32
E object
dtype: object
Viewing data#
Here is how to view the top and bottom rows of the frame:
In [14]: df.head()
Out[14]:
A B C D
2013-01-01 -0.895370 2.126507 0.630915 1.413850
2013-01-02 -0.171440 -2.010562 -0.353420 0.429268
2013-01-03 0.142932 0.468562 0.218185 -0.220462
2013-01-04 0.943198 0.313965 1.366018 0.965330
2013-01-05 0.608089 1.220280 1.379540 -1.271964
In [15]: df.tail(3)
Out[15]:
A B C D
2013-01-04 0.943198 0.313965 1.366018 0.965330
2013-01-05 0.608089 1.220280 1.379540 -1.271964
2013-01-06 0.189339 -0.450788 0.504877 -2.264019
Display the index, columns:
In [16]: df.index
Out[16]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [17]: df.columns
Out[17]: Index(['A', 'B', 'C', 'D'], dtype='object')
DataFrame.to_numpy()
gives a ndarray representation of the underlying data. Note that this
can be an expensive operation when your DataFrame
has columns with different data types,
which comes down to a fundamental difference between DataFrame and ndarray: ndarrays have one
dtype for the entire ndarray, while DataFrames have one dtype per column. When you call
DataFrame.to_numpy()
, xorbits.pandas
will find the ndarray dtype that can hold all
of the dtypes in the DataFrame. This may end up being object
, which requires casting every
value to a Python object.
For df
, our DataFrame
of all floating-point values,
DataFrame.to_numpy()
is fast and doesn’t require copying data.
In [18]: df.to_numpy()
Out[18]:
array([[-0.89536986, 2.12650669, 0.63091484, 1.41384962],
[-0.17143963, -2.01056185, -0.35342023, 0.42926772],
[ 0.14293214, 0.46856241, 0.21818546, -0.22046165],
[ 0.94319762, 0.31396544, 1.36601814, 0.96533004],
[ 0.60808868, 1.2202805 , 1.37953977, -1.27196401],
[ 0.18933859, -0.45078809, 0.50487679, -2.26401882]])
For df2
, the DataFrame
with multiple dtypes, DataFrame.to_numpy()
is relatively
expensive.
In [19]: df2.to_numpy()
Out[19]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo']],
dtype=object)
Note
DataFrame.to_numpy()
does not include the index or column
labels in the output.
describe()
shows a quick statistic summary of your data:
In [20]: df.describe()
Out[20]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.136125 0.277994 0.624352 -0.158000
std 0.638055 1.422050 0.671670 1.396968
min -0.895370 -2.010562 -0.353420 -2.264019
25% -0.092847 -0.259600 0.289858 -1.009088
50% 0.166135 0.391264 0.567896 0.104403
75% 0.503401 1.032351 1.182242 0.831314
max 0.943198 2.126507 1.379540 1.413850
Sorting by an axis:
In [21]: df.sort_index(axis=1, ascending=False)
Out[21]:
D C B A
2013-01-01 1.413850 0.630915 2.126507 -0.895370
2013-01-02 0.429268 -0.353420 -2.010562 -0.171440
2013-01-03 -0.220462 0.218185 0.468562 0.142932
2013-01-04 0.965330 1.366018 0.313965 0.943198
2013-01-05 -1.271964 1.379540 1.220280 0.608089
2013-01-06 -2.264019 0.504877 -0.450788 0.189339
Sorting by values:
In [22]: df.sort_values(by='B')
Out[22]:
A B C D
2013-01-02 -0.171440 -2.010562 -0.353420 0.429268
2013-01-06 0.189339 -0.450788 0.504877 -2.264019
2013-01-04 0.943198 0.313965 1.366018 0.965330
2013-01-03 0.142932 0.468562 0.218185 -0.220462
2013-01-05 0.608089 1.220280 1.379540 -1.271964
2013-01-01 -0.895370 2.126507 0.630915 1.413850
Selection#
Note
While standard Python expressions for selecting and setting are
intuitive and come in handy for interactive work, for production code, we
recommend the optimized xorbits.pandas
data access methods, .at
, .iat
,
.loc
and .iloc
.
Getting#
Selecting a single column, which yields a Series
, equivalent to df.A
:
In [23]: df['A']
Out[23]:
2013-01-01 -0.895370
2013-01-02 -0.171440
2013-01-03 0.142932
2013-01-04 0.943198
2013-01-05 0.608089
2013-01-06 0.189339
Freq: D, Name: A, dtype: float64
Selecting via []
, which slices the rows:
In [24]: df[0:3]
Out[24]:
A B C D
2013-01-01 -0.895370 2.126507 0.630915 1.413850
2013-01-02 -0.171440 -2.010562 -0.353420 0.429268
2013-01-03 0.142932 0.468562 0.218185 -0.220462
In [25]: df['20130102':'20130104']
Out[25]:
A B C D
2013-01-02 -0.171440 -2.010562 -0.353420 0.429268
2013-01-03 0.142932 0.468562 0.218185 -0.220462
2013-01-04 0.943198 0.313965 1.366018 0.965330
Selection by label#
For getting a cross section using a label:
In [26]: df.loc['20130101']
Out[26]:
A -0.895370
B 2.126507
C 0.630915
D 1.413850
Name: 2013-01-01 00:00:00, dtype: float64
Selecting on a multi-axis by label:
In [27]: df.loc[:, ['A', 'B']]
Out[27]:
A B
2013-01-01 -0.895370 2.126507
2013-01-02 -0.171440 -2.010562
2013-01-03 0.142932 0.468562
2013-01-04 0.943198 0.313965
2013-01-05 0.608089 1.220280
2013-01-06 0.189339 -0.450788
Showing label slicing, both endpoints are included:
In [28]: df.loc['20130102':'20130104', ['A', 'B']]
Out[28]:
A B
2013-01-02 -0.171440 -2.010562
2013-01-03 0.142932 0.468562
2013-01-04 0.943198 0.313965
Reduction in the dimensions of the returned object:
In [29]: df.loc['20130102', ['A', 'B']]
Out[29]:
A -0.171440
B -2.010562
Name: 2013-01-02 00:00:00, dtype: float64
For getting a scalar value:
In [30]: df.loc['20130101', 'A']
Out[30]: -0.8953698607067291
For getting fast access to a scalar (equivalent to the prior method):
In [31]: df.at['20130101', 'A']
Out[31]: -0.8953698607067291
Selection by position#
Select via the position of the passed integers:
In [32]: df.iloc[3]
Out[32]:
A 0.943198
B 0.313965
C 1.366018
D 0.965330
Name: 2013-01-04 00:00:00, dtype: float64
By integer slices, acting similar to python:
In [33]: df.iloc[3:5, 0:2]
Out[33]:
A B
2013-01-04 0.943198 0.313965
2013-01-05 0.608089 1.220280
By lists of integer position locations, similar to the python style:
In [34]: df.iloc[[1, 2, 4], [0, 2]]
Out[34]:
A C
2013-01-02 -0.171440 -0.353420
2013-01-03 0.142932 0.218185
2013-01-05 0.608089 1.379540
For slicing rows explicitly:
In [35]: df.iloc[1:3, :]
Out[35]:
A B C D
2013-01-02 -0.171440 -2.010562 -0.353420 0.429268
2013-01-03 0.142932 0.468562 0.218185 -0.220462
For slicing columns explicitly:
In [36]: df.iloc[:, 1:3]
Out[36]:
B C
2013-01-01 2.126507 0.630915
2013-01-02 -2.010562 -0.353420
2013-01-03 0.468562 0.218185
2013-01-04 0.313965 1.366018
2013-01-05 1.220280 1.379540
2013-01-06 -0.450788 0.504877
For getting a value explicitly:
In [37]: df.iloc[1, 1]
Out[37]: -2.010561845234837
For getting fast access to a scalar (equivalent to the prior method):
In [38]: df.iat[1, 1]
Out[38]: -2.010561845234837
Boolean indexing#
Using a single column’s values to select data.
In [39]: df[df['A'] > 0]
Out[39]:
A B C D
2013-01-03 0.142932 0.468562 0.218185 -0.220462
2013-01-04 0.943198 0.313965 1.366018 0.965330
2013-01-05 0.608089 1.220280 1.379540 -1.271964
2013-01-06 0.189339 -0.450788 0.504877 -2.264019
Selecting values from a DataFrame where a boolean condition is met.
In [40]: df[df > 0]
Out[40]:
A B C D
2013-01-01 NaN 2.126507 0.630915 1.413850
2013-01-02 NaN NaN NaN 0.429268
2013-01-03 0.142932 0.468562 0.218185 NaN
2013-01-04 0.943198 0.313965 1.366018 0.965330
2013-01-05 0.608089 1.220280 1.379540 NaN
2013-01-06 0.189339 NaN 0.504877 NaN
Operations#
Stats#
Operations in general exclude missing data.
Performing a descriptive statistic:
In [41]: df.mean()
Out[41]:
A 0.136125
B 0.277994
C 0.624352
D -0.158000
dtype: float64
Same operation on the other axis:
In [42]: df.mean(1)
Out[42]:
2013-01-01 0.818975
2013-01-02 -0.526538
2013-01-03 0.152305
2013-01-04 0.897128
2013-01-05 0.483986
2013-01-06 -0.505148
Freq: D, dtype: float64
Operating with objects that have different dimensionality and need alignment. In addition,
xorbits.pandas
automatically broadcasts along the specified dimension.
In [43]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
In [44]: s
Out[44]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
In [45]: df.sub(s, axis='index')
Out[45]:
A B C D
2013-01-01 NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN
2013-01-03 -0.857068 -0.531438 -0.781815 -1.220462
2013-01-04 -2.056802 -2.686035 -1.633982 -2.034670
2013-01-05 -4.391911 -3.779720 -3.620460 -6.271964
2013-01-06 NaN NaN NaN NaN
Apply#
Applying functions to the data:
In [46]: df.apply(lambda x: x.max() - x.min())
Out[46]:
A 1.838567
B 4.137069
C 1.732960
D 3.677868
dtype: float64
String Methods#
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them).
In [47]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In [48]: s.str.lower()
Out[48]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
Merge#
Concat#
xorbits.pandas
provides various facilities for easily combining together Series and
DataFrame objects with various kinds of set logic for the indexes
and relational algebra functionality in the case of join / merge-type
operations.
Concatenating xorbits.pandas
objects together with concat()
:
In [49]: df = pd.DataFrame(np.random.randn(10, 4))
In [50]: df
Out[50]:
0 1 2 3
0 -2.222102 0.089420 0.444085 0.497213
1 -2.419492 0.274049 -0.304995 -1.175190
2 -0.067044 -0.525800 -1.264446 0.960053
3 -0.143600 -0.340298 0.385199 -1.351128
4 2.121553 0.623916 0.647778 -0.646439
5 -1.113658 -2.215303 -0.083893 0.259849
6 -0.182575 0.161977 1.115416 0.809280
7 1.028048 0.685183 -0.740458 1.099633
8 0.026282 0.749932 0.451062 1.211574
9 0.635577 0.958059 -0.381735 1.822400
# break it into pieces
In [51]: pieces = [df[:3], df[3:7], df[7:]]
In [52]: pd.concat(pieces)
Out[52]:
0 1 2 3
0 -2.222102 0.089420 0.444085 0.497213
1 -2.419492 0.274049 -0.304995 -1.175190
2 -0.067044 -0.525800 -1.264446 0.960053
3 -0.143600 -0.340298 0.385199 -1.351128
4 2.121553 0.623916 0.647778 -0.646439
5 -1.113658 -2.215303 -0.083893 0.259849
6 -0.182575 0.161977 1.115416 0.809280
7 1.028048 0.685183 -0.740458 1.099633
8 0.026282 0.749932 0.451062 1.211574
9 0.635577 0.958059 -0.381735 1.822400
Join#
SQL style merges.
In [53]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
In [54]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
In [55]: left
Out[55]:
key lval
0 foo 1
1 foo 2
In [56]: right
Out[56]:
key rval
0 foo 4
1 foo 5
In [57]: pd.merge(left, right, on='key')
Out[57]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
Another example that can be given is:
In [58]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
In [59]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
In [60]: left
Out[60]:
key lval
0 foo 1
1 bar 2
In [61]: right
Out[61]:
key rval
0 foo 4
1 bar 5
In [62]: pd.merge(left, right, on='key')
Out[62]:
key lval rval
0 foo 1 4
1 bar 2 5
Grouping#
By “group by” we are referring to a process involving one or more of the following steps:
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
In [63]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
....: 'foo', 'bar', 'foo', 'foo'],
....: 'B': ['one', 'one', 'two', 'three',
....: 'two', 'two', 'one', 'three'],
....: 'C': np.random.randn(8),
....: 'D': np.random.randn(8)})
....:
In [64]: df
Out[64]:
A B C D
0 foo one -0.006975 -0.563461
1 bar one -1.596581 -0.019810
2 foo two -0.079158 0.409955
3 bar three -0.376174 0.189913
4 foo two -1.685014 -0.473401
5 bar two -0.317097 0.925201
6 foo one -0.674133 -0.987325
7 foo three -1.948111 0.476968
Grouping and then applying the sum()
function to
the resulting groups.
In [65]: df.groupby('A').sum()
Out[65]:
B C D
A
bar onethreetwo -2.289852 1.095304
foo onetwotwoonethree -4.393390 -1.137264
Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.
In [66]: df.groupby(['A', 'B']).sum()
Out[66]:
C D
A B
bar one -1.596581 -0.019810
three -0.376174 0.189913
two -0.317097 0.925201
foo one -0.681107 -1.550786
three -1.948111 0.476968
two -1.764172 -0.063446
Plotting#
We use the standard convention for referencing the matplotlib API:
In [67]: import matplotlib.pyplot as plt
In [68]: plt.close('all')
In [69]: ts = pd.Series(np.random.randn(1000),
....: index=pd.date_range('1/1/2000', periods=1000))
....:
In [70]: ts = ts.cumsum()
In [71]: ts.plot()
Out[71]: <Axes: >
On a DataFrame, the plot()
method is a convenience to plot all
of the columns with labels:
In [72]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
....: columns=['A', 'B', 'C', 'D'])
....:
In [73]: df = df.cumsum()
In [74]: plt.figure()
Out[74]: <Figure size 640x480 with 0 Axes>
In [75]: df.plot()
Out[75]: <Axes: >
In [76]: plt.legend(loc='best')
Out[76]: <matplotlib.legend.Legend at 0x7f6c45045fa0>
Getting data in/out#
CSV#
Writing to a csv file.
In [77]: df.to_csv('foo.csv')
Out[77]:
Empty DataFrame
Columns: []
Index: []
Reading from a csv file.
In [78]: pd.read_csv('foo.csv')
Out[78]:
Unnamed: 0 A B C D
0 2000-01-01 -0.771442 1.000700 0.878125 2.056192
1 2000-01-02 -0.693299 -0.224015 -0.576945 -0.247457
2 2000-01-03 0.392360 -1.431820 1.180861 1.259017
3 2000-01-04 2.264389 -0.085838 -0.307007 0.538263
4 2000-01-05 3.226983 0.824098 1.046063 -0.515959
.. ... ... ... ... ...
995 2002-09-22 27.254041 6.937970 18.681139 -18.979443
996 2002-09-23 27.339197 5.966404 19.230130 -19.203582
997 2002-09-24 28.336219 6.241372 17.442813 -19.330066
998 2002-09-25 29.551700 5.160639 17.915526 -18.414204
999 2002-09-26 28.983247 5.327610 18.110839 -17.842912
[1000 rows x 5 columns]