TSDataset¶
- class TSDataset(df: pandas.core.frame.DataFrame, freq: str, df_exog: Optional[pandas.core.frame.DataFrame] = None, known_future: Union[Literal['all'], Sequence] = ())[source]¶
Bases:
object
TSDataset is the main class to handle your time series data. It prepares the series for exploration analyzing, implements feature generation with Transforms and generation of future points.
Notes
TSDataset supports custom indexing and slicing method. It maybe done through these interface:
TSDataset[timestamp, segment, column]
If at the start of the period dataset contains NaN those timestamps will be removed.During creation segment is casted to string type.
Examples
>>> from etna.datasets import generate_const_df >>> df = generate_const_df(periods=30, start_time="2021-06-01", n_segments=2, scale=1) >>> df_ts_format = TSDataset.to_dataset(df) >>> ts = TSDataset(df_ts_format, "D") >>> ts["2021-06-01":"2021-06-07", "segment_0", "target"] timestamp 2021-06-01 1.0 2021-06-02 1.0 2021-06-03 1.0 2021-06-04 1.0 2021-06-05 1.0 2021-06-06 1.0 2021-06-07 1.0 Freq: D, Name: (segment_0, target), dtype: float64
>>> from etna.datasets import generate_ar_df >>> pd.options.display.float_format = '{:,.2f}'.format >>> df_to_forecast = generate_ar_df(100, start_time="2021-01-01", n_segments=1) >>> df_regressors = generate_ar_df(120, start_time="2021-01-01", n_segments=5) >>> df_regressors = df_regressors.pivot(index="timestamp", columns="segment").reset_index() >>> df_regressors.columns = ["timestamp"] + [f"regressor_{i}" for i in range(5)] >>> df_regressors["segment"] = "segment_0" >>> df_to_forecast = TSDataset.to_dataset(df_to_forecast) >>> df_regressors = TSDataset.to_dataset(df_regressors) >>> tsdataset = TSDataset(df=df_to_forecast, freq="D", df_exog=df_regressors, known_future="all") >>> tsdataset.df.head(5) segment segment_0 feature regressor_0 regressor_1 regressor_2 regressor_3 regressor_4 target timestamp 2021-01-01 1.62 -0.02 -0.50 -0.56 0.52 1.62 2021-01-02 1.01 -0.80 -0.81 0.38 -0.60 1.01 2021-01-03 0.48 0.47 -0.81 -1.56 -1.37 0.48 2021-01-04 -0.59 2.44 -2.21 -1.21 -0.69 -0.59 2021-01-05 0.28 0.58 -3.07 -1.45 0.77 0.28
Init TSDataset.
- Parameters
df (pandas.core.frame.DataFrame) – dataframe with timeseries
freq (str) – frequency of timestamp in df
df_exog (Optional[pandas.core.frame.DataFrame]) – dataframe with exogenous data;
known_future (Union[Literal['all'], typing.Sequence]) – columns in
df_exog[known_future]
that are regressors, if “all” value is given, all columns are meant to be regressors
- Inherited-members
Methods
describe
([segments])Overview of the dataset that returns a DataFrame.
fit_transform
(transforms)Fit and apply given transforms to the data.
head
([n_rows])Return the first
n_rows
rows.info
([segments])Overview of the dataset that prints the result.
Apply inverse transform method of transforms to the data.
isnull
()Return dataframe with flag that means if the correspondent object in
self.df
is null.make_future
(future_steps)Return new TSDataset with future steps.
plot
([n_segments, column, segments, start, ...])Plot of random or chosen segments.
tail
([n_rows])Return the last
n_rows
rows.to_dataset
(df)Convert pandas dataframe to ETNA Dataset format.
to_flatten
(df)Return pandas DataFrame with flatten index.
to_pandas
([flatten])Return pandas DataFrame.
train_test_split
([train_start, train_end, ...])Split given df with train-test timestamp indices or size of test set.
transform
(transforms)Apply given transform to the data.
Attributes
Return columns of
self.df
.idx
Return TSDataset timestamp index.
Return self.df.loc method.
Get list of all regressors across all segments in dataset.
Get list of all segments in dataset.
- describe(segments: Optional[Sequence[str]] = None) pandas.core.frame.DataFrame [source]¶
Overview of the dataset that returns a DataFrame.
Method describes dataset in segment-wise fashion. Description columns:
start_timestamp: beginning of the segment, missing values in the beginning are ignored
end_timestamp: ending of the segment, missing values in the ending are ignored
length: length according to
start_timestamp
andend_timestamp
num_missing: number of missing variables between
start_timestamp
andend_timestamp
num_segments: total number of segments, common for all segments
num_exogs: number of exogenous features, common for all segments
num_regressors: number of exogenous factors, that are regressors, common for all segments
num_known_future: number of regressors, that are known since creation, common for all segments
freq: frequency of the series, common for all segments
- Parameters
segments (Optional[Sequence[str]]) – segments to show in overview, if None all segments are shown.
- Returns
result_table – table with results of the overview
- Return type
pd.DataFrame
Examples
>>> from etna.datasets import generate_const_df >>> pd.options.display.expand_frame_repr = False >>> df = generate_const_df( ... periods=30, start_time="2021-06-01", ... n_segments=2, scale=1 ... ) >>> df_ts_format = TSDataset.to_dataset(df) >>> regressors_timestamp = pd.date_range(start="2021-06-01", periods=50) >>> df_regressors_1 = pd.DataFrame( ... {"timestamp": regressors_timestamp, "regressor_1": 1, "segment": "segment_0"} ... ) >>> df_regressors_2 = pd.DataFrame( ... {"timestamp": regressors_timestamp, "regressor_1": 2, "segment": "segment_1"} ... ) >>> df_exog = pd.concat([df_regressors_1, df_regressors_2], ignore_index=True) >>> df_exog_ts_format = TSDataset.to_dataset(df_exog) >>> ts = TSDataset(df_ts_format, df_exog=df_exog_ts_format, freq="D", known_future="all") >>> ts.describe() start_timestamp end_timestamp length num_missing num_segments num_exogs num_regressors num_known_future freq segments segment_0 2021-06-01 2021-06-30 30 0 2 1 1 1 D segment_1 2021-06-01 2021-06-30 30 0 2 1 1 1 D
- fit_transform(transforms: Sequence[Transform])[source]¶
Fit and apply given transforms to the data.
- Parameters
transforms (Sequence[Transform]) –
- head(n_rows: int = 5) pandas.core.frame.DataFrame [source]¶
Return the first
n_rows
rows.Mimics pandas method.
This function returns the first
n_rows
rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.For negative values of
n_rows
, this function returns all rows except the lastn_rows
rows, equivalent todf[:-n_rows]
.- Parameters
n_rows (int) – number of rows to select.
- Returns
the first
n_rows
rows or 5 by default.- Return type
pd.DataFrame
- info(segments: Optional[Sequence[str]] = None) None [source]¶
Overview of the dataset that prints the result.
Method describes dataset in segment-wise fashion.
Information about dataset in general:
num_segments: total number of segments
num_exogs: number of exogenous features
num_regressors: number of exogenous factors, that are regressors
num_known_future: number of regressors, that are known since creation
freq: frequency of the dataset
Information about individual segments:
start_timestamp: beginning of the segment, missing values in the beginning are ignored
end_timestamp: ending of the segment, missing values in the ending are ignored
length: length according to
start_timestamp
andend_timestamp
num_missing: number of missing variables between
start_timestamp
andend_timestamp
- Parameters
segments (Optional[Sequence[str]]) – segments to show in overview, if None all segments are shown.
- Return type
None
Examples
>>> from etna.datasets import generate_const_df >>> df = generate_const_df( ... periods=30, start_time="2021-06-01", ... n_segments=2, scale=1 ... ) >>> df_ts_format = TSDataset.to_dataset(df) >>> regressors_timestamp = pd.date_range(start="2021-06-01", periods=50) >>> df_regressors_1 = pd.DataFrame( ... {"timestamp": regressors_timestamp, "regressor_1": 1, "segment": "segment_0"} ... ) >>> df_regressors_2 = pd.DataFrame( ... {"timestamp": regressors_timestamp, "regressor_1": 2, "segment": "segment_1"} ... ) >>> df_exog = pd.concat([df_regressors_1, df_regressors_2], ignore_index=True) >>> df_exog_ts_format = TSDataset.to_dataset(df_exog) >>> ts = TSDataset(df_ts_format, df_exog=df_exog_ts_format, freq="D", known_future="all") >>> ts.info() <class 'etna.datasets.TSDataset'> num_segments: 2 num_exogs: 1 num_regressors: 1 num_known_future: 1 freq: D start_timestamp end_timestamp length num_missing segments segment_0 2021-06-01 2021-06-30 30 0 segment_1 2021-06-01 2021-06-30 30 0
- inverse_transform()[source]¶
Apply inverse transform method of transforms to the data.
Applied in reversed order.
- isnull() pandas.core.frame.DataFrame [source]¶
Return dataframe with flag that means if the correspondent object in
self.df
is null.- Returns
is_null dataframe
- Return type
pd.Dataframe
- make_future(future_steps: int) etna.datasets.tsdataset.TSDataset [source]¶
Return new TSDataset with future steps.
- Parameters
future_steps (int) – number of timestamp in the future to build features for.
- Returns
dataset with features in the future.
- Return type
Examples
>>> from etna.datasets import generate_const_df >>> df = generate_const_df( ... periods=30, start_time="2021-06-01", ... n_segments=2, scale=1 ... ) >>> df_regressors = pd.DataFrame({ ... "timestamp": list(pd.date_range("2021-06-01", periods=40))*2, ... "regressor_1": np.arange(80), "regressor_2": np.arange(80) + 5, ... "segment": ["segment_0"]*40 + ["segment_1"]*40 ... }) >>> df_ts_format = TSDataset.to_dataset(df) >>> df_regressors_ts_format = TSDataset.to_dataset(df_regressors) >>> ts = TSDataset( ... df_ts_format, "D", df_exog=df_regressors_ts_format, known_future="all" ... ) >>> ts.make_future(4) segment segment_0 segment_1 feature regressor_1 regressor_2 target regressor_1 regressor_2 target timestamp 2021-07-01 30 35 NaN 70 75 NaN 2021-07-02 31 36 NaN 71 76 NaN 2021-07-03 32 37 NaN 72 77 NaN 2021-07-04 33 38 NaN 73 78 NaN
- plot(n_segments: int = 10, column: str = 'target', segments: Optional[Sequence[str]] = None, start: Optional[str] = None, end: Optional[str] = None, seed: int = 1, figsize: Tuple[int, int] = (10, 5))[source]¶
Plot of random or chosen segments.
- Parameters
n_segments (int) – number of random segments to plot
column (str) – feature to plot
segments (Optional[Sequence[str]]) – segments to plot
seed (int) – seed for local random state
start (Optional[str]) – start plot from this timestamp
end (Optional[str]) – end plot at this timestamp
figsize (Tuple[int, int]) – size of the figure per subplot with one segment in inches
- tail(n_rows: int = 5) pandas.core.frame.DataFrame [source]¶
Return the last
n_rows
rows.Mimics pandas method.
This function returns last
n_rows
rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows.For negative values of
n_rows
, this function returns all rows except the first n rows, equivalent todf[n_rows:]
.- Parameters
n_rows (int) – number of rows to select.
- Returns
the last
n_rows
rows or 5 by default.- Return type
pd.DataFrame
- static to_dataset(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]¶
Convert pandas dataframe to ETNA Dataset format.
Columns “timestamp” and “segment” are required.
- Parameters
df (pandas.core.frame.DataFrame) – DataFrame with columns [“timestamp”, “segment”]. Other columns considered features.
- Return type
pandas.core.frame.DataFrame
Notes
During conversion segment is casted to string type.
Examples
>>> from etna.datasets import generate_const_df >>> df = generate_const_df( ... periods=30, start_time="2021-06-01", ... n_segments=2, scale=1 ... ) >>> df.head(5) timestamp segment target 0 2021-06-01 segment_0 1.00 1 2021-06-02 segment_0 1.00 2 2021-06-03 segment_0 1.00 3 2021-06-04 segment_0 1.00 4 2021-06-05 segment_0 1.00 >>> df_ts_format = TSDataset.to_dataset(df) >>> df_ts_format.head(5) segment segment_0 segment_1 feature target target timestamp 2021-06-01 1.00 1.00 2021-06-02 1.00 1.00 2021-06-03 1.00 1.00 2021-06-04 1.00 1.00 2021-06-05 1.00 1.00
>>> df_regressors = pd.DataFrame({ ... "timestamp": pd.date_range("2021-01-01", periods=10), ... "regressor_1": np.arange(10), "regressor_2": np.arange(10) + 5, ... "segment": ["segment_0"]*10 ... }) >>> TSDataset.to_dataset(df_regressors).head(5) segment segment_0 feature regressor_1 regressor_2 timestamp 2021-01-01 0 5 2021-01-02 1 6 2021-01-03 2 7 2021-01-04 3 8 2021-01-05 4 9
- static to_flatten(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]¶
Return pandas DataFrame with flatten index.
- Parameters
df (pandas.core.frame.DataFrame) – DataFrame in ETNA format.
- Returns
dataframe with TSDataset data
- Return type
pd.DataFrame
Examples
>>> from etna.datasets import generate_const_df >>> df = generate_const_df( ... periods=30, start_time="2021-06-01", ... n_segments=2, scale=1 ... ) >>> df.head(5) timestamp segment target 0 2021-06-01 segment_0 1.00 1 2021-06-02 segment_0 1.00 2 2021-06-03 segment_0 1.00 3 2021-06-04 segment_0 1.00 4 2021-06-05 segment_0 1.00 >>> df_ts_format = TSDataset.to_dataset(df) >>> TSDataset.to_flatten(df_ts_format).head(5) timestamp target segment 0 2021-06-01 1.0 segment_0 1 2021-06-02 1.0 segment_0 2 2021-06-03 1.0 segment_0 3 2021-06-04 1.0 segment_0 4 2021-06-05 1.0 segment_0
- to_pandas(flatten: bool = False) pandas.core.frame.DataFrame [source]¶
Return pandas DataFrame.
- Parameters
flatten (bool) –
If False, return pd.DataFrame with multiindex
If True, return with flatten index
- Returns
dataframe with TSDataset data
- Return type
pd.DataFrame
Examples
>>> from etna.datasets import generate_const_df >>> df = generate_const_df( ... periods=30, start_time="2021-06-01", ... n_segments=2, scale=1 ... ) >>> df.head(5) timestamp segment target 0 2021-06-01 segment_0 1.00 1 2021-06-02 segment_0 1.00 2 2021-06-03 segment_0 1.00 3 2021-06-04 segment_0 1.00 4 2021-06-05 segment_0 1.00 >>> df_ts_format = TSDataset.to_dataset(df) >>> ts = TSDataset(df_ts_format, "D") >>> ts.to_pandas(True).head(5) timestamp target segment 0 2021-06-01 1.0 segment_0 1 2021-06-02 1.0 segment_0 2 2021-06-03 1.0 segment_0 3 2021-06-04 1.0 segment_0 4 2021-06-05 1.0 segment_0 >>> ts.to_pandas(False).head(5) segment segment_0 segment_1 feature target target timestamp 2021-06-01 1.00 1.00 2021-06-02 1.00 1.00 2021-06-03 1.00 1.00 2021-06-04 1.00 1.00 2021-06-05 1.00 1.00
- train_test_split(train_start: Optional[Union[str, pandas._libs.tslibs.timestamps.Timestamp]] = None, train_end: Optional[Union[str, pandas._libs.tslibs.timestamps.Timestamp]] = None, test_start: Optional[Union[str, pandas._libs.tslibs.timestamps.Timestamp]] = None, test_end: Optional[Union[str, pandas._libs.tslibs.timestamps.Timestamp]] = None, test_size: Optional[int] = None) Tuple[etna.datasets.tsdataset.TSDataset, etna.datasets.tsdataset.TSDataset] [source]¶
Split given df with train-test timestamp indices or size of test set.
In case of inconsistencies between
test_size
and (test_start
,test_end
),test_size
is ignored- Parameters
train_start (Optional[Union[str, pandas._libs.tslibs.timestamps.Timestamp]]) – start timestamp of new train dataset, if None first timestamp is used
train_end (Optional[Union[str, pandas._libs.tslibs.timestamps.Timestamp]]) – end timestamp of new train dataset, if None previous to
test_start
timestamp is usedtest_start (Optional[Union[str, pandas._libs.tslibs.timestamps.Timestamp]]) – start timestamp of new test dataset, if None next to
train_end
timestamp is usedtest_end (Optional[Union[str, pandas._libs.tslibs.timestamps.Timestamp]]) – end timestamp of new test dataset, if None last timestamp is used
test_size (Optional[int]) – number of timestamps to use in test set
- Returns
generated datasets
- Return type
train, test
Examples
>>> from etna.datasets import generate_ar_df >>> pd.options.display.float_format = '{:,.2f}'.format >>> df = generate_ar_df(100, start_time="2021-01-01", n_segments=3) >>> df = TSDataset.to_dataset(df) >>> ts = TSDataset(df, "D") >>> train_ts, test_ts = ts.train_test_split( ... train_start="2021-01-01", train_end="2021-02-01", ... test_start="2021-02-02", test_end="2021-02-07" ... ) >>> train_ts.df.tail(5) segment segment_0 segment_1 segment_2 feature target target target timestamp 2021-01-28 -2.06 2.03 1.51 2021-01-29 -2.33 0.83 0.81 2021-01-30 -1.80 1.69 0.61 2021-01-31 -2.49 1.51 0.85 2021-02-01 -2.89 0.91 1.06 >>> test_ts.df.head(5) segment segment_0 segment_1 segment_2 feature target target target timestamp 2021-02-02 -3.57 -0.32 1.72 2021-02-03 -4.42 0.23 3.51 2021-02-04 -5.09 1.02 3.39 2021-02-05 -5.10 0.40 2.15 2021-02-06 -6.22 0.92 0.97
- transform(transforms: Sequence[Transform])[source]¶
Apply given transform to the data.
- Parameters
transforms (Sequence[Transform]) –
- property columns: pandas.core.indexes.multi.MultiIndex¶
Return columns of
self.df
.- Returns
multiindex of dataframe with target and features.
- Return type
pd.core.indexes.multi.MultiIndex
- property index: pandas.core.indexes.datetimes.DatetimeIndex¶
Return TSDataset timestamp index.
- Returns
timestamp index of TSDataset
- Return type
pd.core.indexes.datetimes.DatetimeIndex
- property loc: pandas.core.indexing._LocIndexer¶
Return self.df.loc method.
- Returns
dataframe with self.df.loc[…]
- Return type
pd.core.indexing._LocIndexer
- property regressors: List[str]¶
Get list of all regressors across all segments in dataset.
Examples
>>> from etna.datasets import generate_const_df >>> df = generate_const_df( ... periods=30, start_time="2021-06-01", ... n_segments=2, scale=1 ... ) >>> df_ts_format = TSDataset.to_dataset(df) >>> regressors_timestamp = pd.date_range(start="2021-06-01", periods=50) >>> df_regressors_1 = pd.DataFrame( ... {"timestamp": regressors_timestamp, "regressor_1": 1, "segment": "segment_0"} ... ) >>> df_regressors_2 = pd.DataFrame( ... {"timestamp": regressors_timestamp, "regressor_1": 2, "segment": "segment_1"} ... ) >>> df_exog = pd.concat([df_regressors_1, df_regressors_2], ignore_index=True) >>> df_exog_ts_format = TSDataset.to_dataset(df_exog) >>> ts = TSDataset( ... df_ts_format, df_exog=df_exog_ts_format, freq="D", known_future="all" ... ) >>> ts.regressors ['regressor_1']
- property segments: List[str]¶
Get list of all segments in dataset.
Examples
>>> from etna.datasets import generate_const_df >>> df = generate_const_df( ... periods=30, start_time="2021-06-01", ... n_segments=2, scale=1 ... ) >>> df_ts_format = TSDataset.to_dataset(df) >>> ts = TSDataset(df_ts_format, "D") >>> ts.segments ['segment_0', 'segment_1']