Load the data using tensorflow:
We have a one function feed_dict to pass the data to tensorflow. But that is a slow process and that must be avoided. The correct way is to load the data into our models is a input pipeline but fortunately the tensorflow has a in built api to accomplish this task. In this we are going to create a pipeline to load the data.
We have a three steps to use a dataset. They are
- Importing Data.
- Create an Iterator.
- Consuming Data.
Importing data:
First we have a some data to put in Dataset. For this we have many ways to create a data.
Using Numpy:
By using numpy array we can create a sample data.
# first create a random vector a=np.random.sample((100,2)) # make a dataset dataset= tf.data.Dataset.from_tensor_slices(a)
And we can pass more then one numpy array for example when we are having a couple of data devide into labels and features.
features, labels = (np.random.sample((100,2)), np.random.sample((100,1))) dataset = tf.data.Dataset.from_tensor_slices((features,labels))
using tensors:
we can initialize our data with tensor.
# using a tensor dataset = tf.data.Dataset.from_tensor_slices(tf.random_uniform([100, 2]))
Using placeholder:
When we want to dynamic change in inside of the data we can use this placeholder.
x = tf.placeholder(tf.float32, shape=[None,2]) dataset = tf.data.Dataset.from_tensor_slices(x)
Using Generator:
By using generator we can initialize our data where we are having array of different lengths
# from generator sequence = np.array([[[1]],[[2],[3]],[[3],[4],[5]]]) def generator(): for el in sequence: yield el dataset = tf.data.Dataset().batch(1).from_generator(generator,output_types= tf.int64, output_shapes=(tf.TensorShape([None, 1]))) iter = dataset.make_initializable_iterator() el = iter.get_next() with tf.Session() as sess: sess.run(iter.initializer) print(sess.run(el)) print(sess.run(el)) print(sess.run(el))
Output:
[[1]] [[2] [3]] [[3] [4] [5]]
Here we are specifyng the types and shapes to our data because it will create a correct tensors.
Using csv file:
We can directly read a csv file using a function. For ex I have a csv file.
I can easily create a dataset by calling a function tf.contrib.data.make_csv_dataset. But be aware of the iterator it will gives a dictionary in that key as a column names and value is tensor with row values.
# load a csv CSV_PATH = './tweets.csv' dataset = tf.contrib.data.make_csv_dataset(CSV_PATH, batch_size=32) iter = dataset.make_one_shot_iterator() next = iter.get_next() print(next) # next is a dictionary with key=columns names and value=row values or column data inputs, labels = next['text'], next['sentiment'] with tf.Session() as sess: sess.run([inputs, labels])
Till now , we know how to create a dataset but how can you getback the data , For this we are using an iterator that will gives us to tha ability of itertae throuh the data and retrive the right values from the data.in this we have different types.
1.One shot.
It is used to iterate once through our data and it cannot feed any value.
x = np.random.sample((100,2)) # make a dataset from a numpy array dataset = tf.data.Dataset.from_tensor_slices(x) # create the iterator iter = dataset.make_one_shot_iterator()
After that you need to call the get_next to get the tensor which contain our data.
# create the iterator iter = dataset.make_one_shot_iterator() el = iter.get_next()
2. Initializable.
If any dynamic change we can call directly its initializer and passing a new data with feed_dict.we can initialize the placeholder for dynamic dataset using a commom feed_dict method.
# using a placeholder
x = tf.placeholder(tf.float32, shape=[None,2])
dataset = tf.data.Dataset.from_tensor_slices(x)
data = np.random.sample((100,2))
iter = dataset.make_initializable_iterator() # create the iterator
el = iter.get_next()with tf.Session() as sess:
# feed the placeholder
with data
sess.run(iter.initializer, feed_dict={ x: data })
print(sess.run(el))
# output :
[ 0.52374458 0.71968478]
For example:
we have a train data and test data.
train_data = (np.random.sample((100,2)), np.random.sample((100,1))) test_data = (np.array([[1,2]]), np.array([[0]]))
In this , we are done with training and validate with test data , this can be done by initializing the iterator after training.
# initializable iterator to switch between dataset
EPOCHS = 10
x, y = tf.placeholder(tf.float32, shape=[None,2]), tf.placeholder(tf.float32, shape=[None,1])
dataset = tf.data.Dataset.from_tensor_slices((x, y))
train_data = (np.random.sample((100,2)), np.random.sample((100,1)))
test_data = (np.array([[1,2]]), np.array([[0]]))
iter = dataset.make_initializable_iterator()
features, labels = iter.get_next()with tf.Session() as sess:
# initialise iterator with train data
sess.run(iter.initializer, feed_dict={ x: train_data[0], y: train_data[1]})
for _ in range(EPOCHS):
sess.run([features, labels])
# switch to test data
sess.run(iter.initializer, feed_dict={ x: test_data[0], y: test_data[1]})
print(sess.run([features, labels]))
3. Reinitializable:
It is very useful when we have a training dataset that wants to some another transformations shuffle , testing dataset etc.. it can initialised from different different Dataset. This is very similar to the initializable but the difference is instaed of new data to the same data.
# making fake data using numpy train_data = (np.random.sample((100,2)), np.random.sample((100,1))) test_data = (np.random.sample((10,2)), np.random.sample((10,1)))
Here we can create a two datasets.
# create two datasets, one for training and another one for testing train_dataset = tf.data.Dataset.from_tensor_slices(train_data) test_dataset = tf.data.Dataset.from_tensor_slices(test_data)
Create a iterator for correct shape and type and two initializingation operations
iter = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
# create the initialisation operations train_init_op = iter.make_initializer(train_dataset) test_init_op = iter.make_initializer(test_dataset)
Here we get the next element as before.
features, labels = iter.get_next()
now , we can the two operations at a time in the session.
# Reinitializable iterator to switch between Datasets EPOCHS = 10 # making fake data using numpy train_data = (np.random.sample((100,2)), np.random.sample((100,1))) test_data = (np.random.sample((10,2)), np.random.sample((10,1))) # create two datasets, one for for training and another one for testing train_dataset = tf.data.Dataset.from_tensor_slices(train_data) test_dataset = tf.data.Dataset.from_tensor_slices(test_data) # create a iterator for the correct type and shape iter = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes) features, labels = iter.get_next() # create the initialisation operations train_init_op = iter.make_initializer(train_dataset) test_init_op = iter.make_initializer(test_dataset)with tf.Session() as sess: sess.run(train_init_op) # switch to train dataset for _ in range(EPOCHS): sess.run([features, labels]) sess.run(test_init_op) # switch to val dataset print(sess.run([features, labels]))
4. Feedable:
It is used to select which iterator to use.This is similar to the reinitialized iterator but the difference is the switch between iterators not the datasets. Here also we are create a two datasets.
train_dataset = tf.data.Dataset.from_tensor_slices((x,y)) test_dataset = tf.data.Dataset.from_tensor_slices((x,y))
After this we crate an iterator , otherwise we can use one shot iterator or initializable iterator.
train_iterator = train_dataset.make_initializable_iterator() test_iterator = test_dataset.make_initializable_iterator() handle = tf.placeholder(tf.string, shape=[])
We create a generic iterator by using shape of the dataset.
iter = tf.data.Iterator.from_string_handle( handle, train_dataset.output_types, train_dataset.output_shapes)
After this we can get the next elements.
next_elements = iter.get_next()
In order to switch between the iterators we can just call the next_elements operator passing the correct handle feed_dict method.
sess.run(next_elements, feed_dict = {handle: train_handle})if you are using initializable iterators just remember to initialize them before starting.
sess.run(train_iterator.initializer, feed_dict={ x: train_data[0], y: train_data[1]})
sess.run(test_iterator.initializer, feed_dict={ x: test_data[0], y: test_data[1]})this is the way how to load and retrive the data using iterators.
