-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TransformSpec using Pandas causes incompatibilities with other libraries for make_batch_reader #603
Comments
Hi @KamWithK : thank you for sharing the usecase. Pretty interesting. Looping through reader would make your code run on the main pytorch process. If processing is non trivial then we would loose benefits of thread/process pools provided by petastorm. Can you please give an example of such 'not flat' tensor? Are you getting back ragged/jagged tensors? Or list of torch tensors? Or something else? A small code snippet demonstrating these types could help. One workaround that could be used to represent a list of variable size arrays could be to use two tensors: one with data and another one with index into this data. For example, to represent: ragged_data = [["a", "bc"], ["d"]] one could use: data = ["a", "ab", "d"]
data_index = [
[0, 2],
[2, 3]
] Recovering the data from these two tensors would be: ragged_data[i] == data[data_index[i, 0]:data_index[i, 1]] |
Hey, thanks for the feedback. Here's a small code snippet which illustrates what the output of simple tokenization (using Hugging Face Transformers) looks like:
|
Okay, so I finally managed to get it to work half-decently with Transformers (although not AllenNLP yet I may need to rewrite the Petastorm's data loader for it). In the end, I added in the following code just after
This isn't perfect, but it does work decently and |
There is a longer term solution that might be better (but requires a significant effort) which is to start using pytorch.util.data.DataLoader for parallelization instead of petastorm's custom Thread/ProcessPool. That way we would fit more natively into pytorch eco-system and these kind of operations would be more natural. As for your solution, is it different from the one you originally considered:
|
@KamWithK, if I'm understanding correctly, you shouldn't need to tokenize "under-the-hood" in a way that requires modifications to import numpy as np
import pandas as pd
rows = [{
"input_ids": np.array(
[
[101, 1142, 1110, 170, 5855, 102, 0, 0, 0, 0, 0, 0],
[101, 18348, 1301, 184, 10223, 102, 0, 0, 0, 0, 0, 0],
[101, 5211, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[101, 194, 9717, 194, 9717, 194, 9717, 5005, 170, 5024, 22549, 102],
]
),
"token_type_ids": np.array(
[
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
]
),
"attention_mask": np.array(
[
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
]
),
}]
tokens_df = pd.DataFrame.from_records(rows) It looks like the input would be |
Also, @KamWithK can you please clarify: are using |
Hmm, I'll give this a try now. Shouldn't be too tough to use. Normally you'd just pass in a PyTorch Dataset/IterableDataset with a few options. However, here we're using Petastorm's own Reader object, so not sure whether it'll allow that (I'll check now though). The other problem is that I can't seem to tokenize within the Reader object because of the need to use Pandas DataFrames. |
I'm using Petastorm's |
I've run your code @dmcguire81 but Pandas interprets those as objects, which the built-in data loader's will raise an error for. Or am I wrong about that? How would you'd go about specifying the Here is my code:
|
@selitvin one quick way to achieve this would without having to modify or recreate the current What really confuses me here though is where in the By the way, I am aware that PyArrow can cause some trouble when using multiple threads or processes (I've read this online). This would be what stops us from just simply iterating through the |
Also, I'm trying to read through |
|
Okay thanks @selitvin, #605 looks like a decent solution for now. I've been reading through the docs to try and understand how you handle multiprocessing. Do you just use Parquet row groups? If so, this provides a very easy way to handle multiprocessing using PyTorch |
Okay I've created a PyTorch class IterableParquetDataset(IterableDataset):
def __init__(self, path, process_func):
super().__init__()
dataset = ds.dataset(path)
self.process_func = process_func
self.batches = Queue()
[self.batches.put(batch) for batch in dataset.to_batches()]
def __iter__(self):
while True:
if self.batches.empty() == True:
self.batches.close()
break
batch = self.batches.get().to_pydict()
batch.update(self.process_func(batch))
yield batch This is simple and works perfectly with Hugging Face Transformers (plus minimum conversions necessary). My only worry is how I'm splitting up the work across the threads. |
Could something like what I did above be useful for Petastorm? |
Hey guys, I'm trying to create a compatibility interface between Petastorm and a few PyTorch based libraries (PyTorch Lightning, Hugging Face Transformers and AllenNLP) which I'm trying to use in a project. So far I've managed to get PyTorch Lightning working (pretty much research oriented Keras), but a few design choices within Petastorm seem to prevent usage with the NLP libraries.
My problem is that
TransformSpec
requires input and output as PandasDataFrame
's. This at first may seem decent, but commonplace NLP libraries like Hugging Face Transformers tokenize lists of strings (this transformation is easy) and directly output tensors. These tensors aren't flat, so they can't be converted to Pandas, meaning that processing textual data (despite being fairly straight forward) seems nearly impossible with Petastorm's built in data loaders.I've been working on this for a week and these are the methods to mitigate the problem:
TransformSpec
/PyTorchDataLoader
/PyArrow classesReader
objectI've been trying to interpret how I'd do the first option (through debug the code), however it looks extremely complicated and (I think) it would require modifying a number of classes (what modifications are needed still elude me). On the other hand, looping through a
Reader
seems reasonable as we can still read in strings. But, would doing this forfeit any optimisations/performance boosting code from Petastorm's loader (although shouldn't these be inReader
)?So, would anyone be able to provide some advice on what you believe to be the best approach/course of action (or just what might have to be coded/modified)?
Thanks so much for in advance!
The text was updated successfully, but these errors were encountered: