10 minutes to xorbits.pandas#

This is a short introduction to xorbits.pandas which is originated from pandas’ quickstart.

Customarily, we import and init as follows:

In [1]: import xorbits

In [2]: import xorbits.numpy as np

In [3]: import xorbits.pandas as pd

In [4]: xorbits.init()

Object creation#

Creating a Series by passing a list of values, letting it create a default integer index:

In [5]: s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [6]: s
Out[6]: 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing an array, with a datetime index and labeled columns:

In [7]: dates = pd.date_range('20130101', periods=6)

In [8]: dates
Out[8]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [9]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [10]: df
Out[10]: 
                   A         B         C         D
2013-01-01  0.132367 -1.060137 -0.063151 -1.426271
2013-01-02 -0.777725 -0.525146 -0.140125  1.745446
2013-01-03  1.097986  1.190745  1.436745  1.016605
2013-01-04 -0.357783 -1.185216 -1.941686 -0.800746
2013-01-05  1.445575  1.705857 -2.031742  0.684557
2013-01-06 -0.939175  0.182649 -0.381172 -0.101025

Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [11]: df2 = pd.DataFrame({'A': 1.,
   ....:                     'B': pd.Timestamp('20130102'),
   ....:                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
   ....:                     'D': np.array([3] * 4, dtype='int32'),
   ....:                     'E': 'foo'})
   ....: 

In [12]: df2
Out[12]: 
     A          B    C  D    E
0  1.0 2013-01-02  1.0  3  foo
1  1.0 2013-01-02  1.0  3  foo
2  1.0 2013-01-02  1.0  3  foo
3  1.0 2013-01-02  1.0  3  foo

The columns of the resulting DataFrame have different dtypes.

In [13]: df2.dtypes
Out[13]: 
A          float64
B    datetime64[s]
C          float32
D            int32
E           object
dtype: object

Viewing data#

Here is how to view the top and bottom rows of the frame:

In [14]: df.head()
Out[14]: 
                   A         B         C         D
2013-01-01  0.132367 -1.060137 -0.063151 -1.426271
2013-01-02 -0.777725 -0.525146 -0.140125  1.745446
2013-01-03  1.097986  1.190745  1.436745  1.016605
2013-01-04 -0.357783 -1.185216 -1.941686 -0.800746
2013-01-05  1.445575  1.705857 -2.031742  0.684557

In [15]: df.tail(3)
Out[15]: 
                   A         B         C         D
2013-01-04 -0.357783 -1.185216 -1.941686 -0.800746
2013-01-05  1.445575  1.705857 -2.031742  0.684557
2013-01-06 -0.939175  0.182649 -0.381172 -0.101025

Display the index, columns:

In [16]: df.index
Out[16]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [17]: df.columns
Out[17]: Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy() gives a ndarray representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between DataFrame and ndarray: ndarrays have one dtype for the entire ndarray, while DataFrames have one dtype per column. When you call DataFrame.to_numpy(), xorbits.pandas will find the ndarray dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying data.

In [18]: df.to_numpy()
Out[18]: 
array([[ 0.13236676, -1.0601369 , -0.06315081, -1.42627082],
       [-0.77772537, -0.52514563, -0.14012538,  1.74544637],
       [ 1.09798598,  1.19074521,  1.43674542,  1.01660493],
       [-0.35778252, -1.18521622, -1.94168551, -0.80074602],
       [ 1.44557542,  1.70585707, -2.03174231,  0.68455716],
       [-0.93917526,  0.18264922, -0.38117187, -0.10102539]])

For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive.

In [19]: df2.to_numpy()
Out[19]: 
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'foo']],
      dtype=object)

Note

DataFrame.to_numpy() does not include the index or column labels in the output.

describe() shows a quick statistic summary of your data:

In [20]: df.describe()
Out[20]: 
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.100208  0.051459 -0.520188  0.186428
std    0.986656  1.195801  1.304431  1.185395
min   -0.939175 -1.185216 -2.031742 -1.426271
25%   -0.672740 -0.926389 -1.551557 -0.625816
50%   -0.112708 -0.171248 -0.260649  0.291766
75%    0.856581  0.938721 -0.082394  0.933593
max    1.445575  1.705857  1.436745  1.745446

Sorting by an axis:

In [21]: df.sort_index(axis=1, ascending=False)
Out[21]: 
                   D         C         B         A
2013-01-01 -1.426271 -0.063151 -1.060137  0.132367
2013-01-02  1.745446 -0.140125 -0.525146 -0.777725
2013-01-03  1.016605  1.436745  1.190745  1.097986
2013-01-04 -0.800746 -1.941686 -1.185216 -0.357783
2013-01-05  0.684557 -2.031742  1.705857  1.445575
2013-01-06 -0.101025 -0.381172  0.182649 -0.939175

Sorting by values:

In [22]: df.sort_values(by='B')
Out[22]: 
                   A         B         C         D
2013-01-04 -0.357783 -1.185216 -1.941686 -0.800746
2013-01-01  0.132367 -1.060137 -0.063151 -1.426271
2013-01-02 -0.777725 -0.525146 -0.140125  1.745446
2013-01-06 -0.939175  0.182649 -0.381172 -0.101025
2013-01-03  1.097986  1.190745  1.436745  1.016605
2013-01-05  1.445575  1.705857 -2.031742  0.684557

Selection#

Note

While standard Python expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized xorbits.pandas data access methods, .at, .iat, .loc and .iloc.

Getting#

Selecting a single column, which yields a Series, equivalent to df.A:

In [23]: df['A']
Out[23]: 
2013-01-01    0.132367
2013-01-02   -0.777725
2013-01-03    1.097986
2013-01-04   -0.357783
2013-01-05    1.445575
2013-01-06   -0.939175
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows:

In [24]: df[0:3]
Out[24]: 
                   A         B         C         D
2013-01-01  0.132367 -1.060137 -0.063151 -1.426271
2013-01-02 -0.777725 -0.525146 -0.140125  1.745446
2013-01-03  1.097986  1.190745  1.436745  1.016605

In [25]: df['20130102':'20130104']
Out[25]: 
                   A         B         C         D
2013-01-02 -0.777725 -0.525146 -0.140125  1.745446
2013-01-03  1.097986  1.190745  1.436745  1.016605
2013-01-04 -0.357783 -1.185216 -1.941686 -0.800746

Selection by label#

For getting a cross section using a label:

In [26]: df.loc['20130101']
Out[26]: 
A    0.132367
B   -1.060137
C   -0.063151
D   -1.426271
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label:

In [27]: df.loc[:, ['A', 'B']]
Out[27]: 
                   A         B
2013-01-01  0.132367 -1.060137
2013-01-02 -0.777725 -0.525146
2013-01-03  1.097986  1.190745
2013-01-04 -0.357783 -1.185216
2013-01-05  1.445575  1.705857
2013-01-06 -0.939175  0.182649

Showing label slicing, both endpoints are included:

In [28]: df.loc['20130102':'20130104', ['A', 'B']]
Out[28]: 
                   A         B
2013-01-02 -0.777725 -0.525146
2013-01-03  1.097986  1.190745
2013-01-04 -0.357783 -1.185216

Reduction in the dimensions of the returned object:

In [29]: df.loc['20130102', ['A', 'B']]
Out[29]: 
A   -0.777725
B   -0.525146
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value:

In [30]: df.loc['20130101', 'A']
Out[30]: 0.13236675637964368

For getting fast access to a scalar (equivalent to the prior method):

In [31]: df.at['20130101', 'A']
Out[31]: 0.13236675637964368

Selection by position#

Select via the position of the passed integers:

In [32]: df.iloc[3]
Out[32]: 
A   -0.357783
B   -1.185216
C   -1.941686
D   -0.800746
Name: 2013-01-04 00:00:00, dtype: float64

By integer slices, acting similar to python:

In [33]: df.iloc[3:5, 0:2]
Out[33]: 
                   A         B
2013-01-04 -0.357783 -1.185216
2013-01-05  1.445575  1.705857

By lists of integer position locations, similar to the python style:

In [34]: df.iloc[[1, 2, 4], [0, 2]]
Out[34]: 
                   A         C
2013-01-02 -0.777725 -0.140125
2013-01-03  1.097986  1.436745
2013-01-05  1.445575 -2.031742

For slicing rows explicitly:

In [35]: df.iloc[1:3, :]
Out[35]: 
                   A         B         C         D
2013-01-02 -0.777725 -0.525146 -0.140125  1.745446
2013-01-03  1.097986  1.190745  1.436745  1.016605

For slicing columns explicitly:

In [36]: df.iloc[:, 1:3]
Out[36]: 
                   B         C
2013-01-01 -1.060137 -0.063151
2013-01-02 -0.525146 -0.140125
2013-01-03  1.190745  1.436745
2013-01-04 -1.185216 -1.941686
2013-01-05  1.705857 -2.031742
2013-01-06  0.182649 -0.381172

For getting a value explicitly:

In [37]: df.iloc[1, 1]
Out[37]: -0.5251456316161774

For getting fast access to a scalar (equivalent to the prior method):

In [38]: df.iat[1, 1]
Out[38]: -0.5251456316161774

Boolean indexing#

Using a single column’s values to select data.

In [39]: df[df['A'] > 0]
Out[39]: 
                   A         B         C         D
2013-01-01  0.132367 -1.060137 -0.063151 -1.426271
2013-01-03  1.097986  1.190745  1.436745  1.016605
2013-01-05  1.445575  1.705857 -2.031742  0.684557

Selecting values from a DataFrame where a boolean condition is met.

In [40]: df[df > 0]
Out[40]: 
                   A         B         C         D
2013-01-01  0.132367       NaN       NaN       NaN
2013-01-02       NaN       NaN       NaN  1.745446
2013-01-03  1.097986  1.190745  1.436745  1.016605
2013-01-04       NaN       NaN       NaN       NaN
2013-01-05  1.445575  1.705857       NaN  0.684557
2013-01-06       NaN  0.182649       NaN       NaN

Operations#

Stats#

Operations in general exclude missing data.

Performing a descriptive statistic:

In [41]: df.mean()
Out[41]: 
A    0.100208
B    0.051459
C   -0.520188
D    0.186428
dtype: float64

Same operation on the other axis:

In [42]: df.mean(1)
Out[42]: 
2013-01-01   -0.604298
2013-01-02    0.075612
2013-01-03    1.185520
2013-01-04   -1.071358
2013-01-05    0.451062
2013-01-06   -0.309681
Freq: D, dtype: float64

Operating with objects that have different dimensionality and need alignment. In addition, xorbits.pandas automatically broadcasts along the specified dimension.

In [43]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)

In [44]: s
Out[44]: 
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [45]: df.sub(s, axis='index')
Out[45]: 
                   A         B         C         D
2013-01-01       NaN       NaN       NaN       NaN
2013-01-02       NaN       NaN       NaN       NaN
2013-01-03  0.097986  0.190745  0.436745  0.016605
2013-01-04 -3.357783 -4.185216 -4.941686 -3.800746
2013-01-05 -3.554425 -3.294143 -7.031742 -4.315443
2013-01-06       NaN       NaN       NaN       NaN

Apply#

Applying functions to the data:

In [46]: df.apply(lambda x: x.max() - x.min())
Out[46]: 
A    2.384751
B    2.891073
C    3.468488
D    3.171717
dtype: float64

String Methods#

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them).

In [47]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

In [48]: s.str.lower()
Out[48]: 
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

Merge#

Concat#

xorbits.pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

Concatenating xorbits.pandas objects together with concat():

In [49]: df = pd.DataFrame(np.random.randn(10, 4))

In [50]: df
Out[50]: 
          0         1         2         3
0  0.461021  0.169237  0.665447  0.490070
1  2.098822  2.148103 -0.352036 -0.114643
2  0.034086 -0.978211  0.766791 -0.475893
3  1.764791  1.666357  0.489422 -0.980288
4 -1.322947 -1.043980  0.514517 -1.433020
5 -0.328488 -0.518575 -0.680450 -0.640676
6 -0.758504 -0.482748 -0.452888 -0.243717
7 -0.600254 -0.785388  1.179903  0.080592
8 -1.704476  0.753127 -0.155942  0.311720
9 -2.253653  0.678155 -0.508297  1.575723

# break it into pieces
In [51]: pieces = [df[:3], df[3:7], df[7:]]

In [52]: pd.concat(pieces)
Out[52]: 
          0         1         2         3
0  0.461021  0.169237  0.665447  0.490070
1  2.098822  2.148103 -0.352036 -0.114643
2  0.034086 -0.978211  0.766791 -0.475893
3  1.764791  1.666357  0.489422 -0.980288
4 -1.322947 -1.043980  0.514517 -1.433020
5 -0.328488 -0.518575 -0.680450 -0.640676
6 -0.758504 -0.482748 -0.452888 -0.243717
7 -0.600254 -0.785388  1.179903  0.080592
8 -1.704476  0.753127 -0.155942  0.311720
9 -2.253653  0.678155 -0.508297  1.575723

Note

Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be expensive. We recommend passing a pre-built list of records to the DataFrame constructor instead of building a DataFrame by iteratively appending records to it.

Join#

SQL style merges.

In [53]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

In [54]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [55]: left
Out[55]: 
   key  lval
0  foo     1
1  foo     2

In [56]: right
Out[56]: 
   key  rval
0  foo     4
1  foo     5

In [57]: pd.merge(left, right, on='key')
Out[57]: 
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

Another example that can be given is:

In [58]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})

In [59]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

In [60]: left
Out[60]: 
   key  lval
0  foo     1
1  bar     2

In [61]: right
Out[61]: 
   key  rval
0  foo     4
1  bar     5

In [62]: pd.merge(left, right, on='key')
Out[62]: 
   key  lval  rval
0  foo     1     4
1  bar     2     5

Grouping#

By “group by” we are referring to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria

  • Applying a function to each group independently

  • Combining the results into a data structure

In [63]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
   ....:                          'foo', 'bar', 'foo', 'foo'],
   ....:                    'B': ['one', 'one', 'two', 'three',
   ....:                          'two', 'two', 'one', 'three'],
   ....:                    'C': np.random.randn(8),
   ....:                    'D': np.random.randn(8)})
   ....: 

In [64]: df
Out[64]: 
     A      B         C         D
0  foo    one -0.421368 -0.554350
1  bar    one -1.354405 -1.466662
2  foo    two  0.516785  1.618462
3  bar  three -0.087463 -0.117263
4  foo    two  1.072735  0.569074
5  bar    two  0.269461 -0.278725
6  foo    one  0.393827  0.996956
7  foo  three  0.560126  0.067663

Grouping and then applying the sum() function to the resulting groups.

In [65]: df.groupby('A').sum()
Out[65]: 
                     B         C         D
A                                         
bar        onethreetwo -1.172407 -1.862651
foo  onetwotwoonethree  2.122105  2.697805

Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.

In [66]: df.groupby(['A', 'B']).sum()
Out[66]: 
                  C         D
A   B                        
bar one   -1.354405 -1.466662
    three -0.087463 -0.117263
    two    0.269461 -0.278725
foo one   -0.027541  0.442606
    three  0.560126  0.067663
    two    1.589521  2.187536

Plotting#

We use the standard convention for referencing the matplotlib API:

In [67]: import matplotlib.pyplot as plt

In [68]: plt.close('all')
In [69]: ts = pd.Series(np.random.randn(1000),
   ....:                index=pd.date_range('1/1/2000', periods=1000))
   ....: 

In [70]: ts = ts.cumsum()

In [71]: ts.plot()
Out[71]: <Axes: >
savefig/series_plot_basic.png

On a DataFrame, the plot() method is a convenience to plot all of the columns with labels:

In [72]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
   ....:                   columns=['A', 'B', 'C', 'D'])
   ....: 

In [73]: df = df.cumsum()

In [74]: plt.figure()
Out[74]: <Figure size 640x480 with 0 Axes>

In [75]: df.plot()
Out[75]: <Axes: >

In [76]: plt.legend(loc='best')
Out[76]: <matplotlib.legend.Legend at 0x7f3956077e50>
savefig/frame_plot_basic.png

Getting data in/out#

CSV#

Writing to a csv file.

In [77]: df.to_csv('foo.csv')
Out[77]: 
Empty DataFrame
Columns: []
Index: []

Reading from a csv file.

In [78]: pd.read_csv('foo.csv')
Out[78]: 
     Unnamed: 0         A          B         C          D
0    2000-01-01 -2.770775  -0.199407  0.839147   0.575720
1    2000-01-02 -2.973869   0.851474  0.969097   2.756999
2    2000-01-03 -2.078455   1.380832  1.225448   2.559200
3    2000-01-04 -1.720549   1.411522  1.766471   2.492420
4    2000-01-05 -2.408572   1.021019 -0.911477   3.504701
..          ...       ...        ...       ...        ...
995  2002-09-22  8.449480 -17.054612  9.859169  16.593931
996  2002-09-23  7.168922 -16.559581  8.737754  16.761812
997  2002-09-24  6.719205 -17.056124  8.627867  16.992679
998  2002-09-25  5.934494 -17.205668  8.063046  16.620705
999  2002-09-26  4.833538 -16.478666  7.915041  15.991093

[1000 rows x 5 columns]