Preprocessing¶
- preprocessing.add_noise_to_series(series, noise_max=9e-05)[source]¶
Add uniform noise to series.
- Args:
series: The time series to be added noise. noise_max: The upper limit of the amount of noise that can be added to a time series point
- Return:
DataFrame with noise
- preprocessing.add_noise_to_series_md(df, noise_max=9e-05)[source]¶
Add uniform noise to a multidimensional time series that is given as a pandas DataFrame.
- Args:
df: The DataFrame that contains the multidimensional time series. noise_max: The upper limit of the amount of noise that can be added to a time series point.
- Return:
The DataFrame with noise to all the columns
- preprocessing.change_granularity(df, granularity='30s', size=10000000, chunk=True)[source]¶
Changing the offset of a TimeSeries. We do this procedure by using chunk_interpolate. We divide our TimeSeries into pieces in order to interpolate them.
- Args:
df: Date/Time DataFrame. size: The size/chunks we want to divide our /DataFrame according to the global index of the set. The Default price is 10 million. . granularity: The offset user wants to resample the Time Series chunk: If set True, It applies the chunk_interpolation
- Return:
The interpolated DataFrame/TimeSeries
- preprocessing.chunk_interpolate(df, size=1000000, method='linear', axis=0, limit_direction='both', limit=1)[source]¶
After Chunker makes the pieces according to index, we Interpolate them with args of pandas.interpolate() and then we Merge them back together. This step is crucial for the complete data interpolation without RAM problems especially in large DataSets.
- Args:
df: Date/Time DataFrame or any Given DataFrame. size: The size/chunks we want to divide our /DataFrame according to the global index of the set. The Default price is 10 million.
- Return:
The Interpolated DataFrame
- preprocessing.chunker(seq, size)[source]¶
Dividing a file/DataFrame etc into pieces for better hadling of RAM.
- Args:
seq: Sequence, Folder, Date/Time DataFrame or any Given DataFrame. size: The size/chunks we want to divide our Seq/Folder/DataFrame.
- Return:
The divided groups
- preprocessing.enumerate2(start, end, step=1)[source]¶
- Args:
start: starting point end: ending point . step: step of the process
- Return:
The interpolated DataFrame/TimeSeries
- preprocessing.filter_col(df, col, less_than=None, bigger_than=None)[source]¶
Remove rows of the dataframe that they are under, over/both from a specific/two different input price/prices.
- Args:
df: Date/Time DataFrame. col: The desired column to work on our DataFrame. less_than: Filtering the column dropping values below that price. bigger_than: Filtering the column dropping values above that price.
- Return:
The Filtrated TimeSeries/DataFrame
- preprocessing.filter_dates(df, start, end)[source]¶
Remove rows of the dataframe that are not in the [start, end] interval.
- Args:
df:DataFrame that has a datetime index. start: Date that signifies the start of the interval. end: Date that signifies the end of the interval.
- Returns:
The Filtrared TimeSeries/DataFrame
- preprocessing.filter_df(df, filter_dict)[source]¶
Creates a filtered DataFrame with multiple columns.
- Args:
df: Date/Time DataFrame or any Given DataFrame. filter_dict: A dictionary of columns user wants to filter
- Return:
Filtered DataFrame
- preprocessing.filter_dispersed(df, window, eps)[source]¶
We are looking at windows of consecutive row and calculate the mean and variance. For each window if the index of disperse or given column is in the given threshhold then the last row will remain in the data frame.
- Args:
df: Date/Time DataFrame or any Given DataFrame. window: A small value in order to avoid dividing with Zero. eps: A small value in order to avoid dividing with Zero (See is_stable)
Return: The Filtered DataFrame
- preprocessing.is_stable(*args, epsilon)[source]¶
- Args:
epsilon: A small value in order to avoid dividing with Zero.
- Return:
A boolean vector from the division of variance with mean of a column.