Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating new "ground truth" for several datasets #19

Closed
isarandi opened this issue Oct 20, 2019 · 25 comments
Closed

Creating new "ground truth" for several datasets #19

isarandi opened this issue Oct 20, 2019 · 25 comments

Comments

@isarandi
Copy link

Hi, thanks for this amazing work.

Do you have any plans for running your method on other datasets and releasing the resulting poses? This would be very beneficial for correcting many ground truth errors. Specifically I'm thinking of MPI-INF-3DHP (some annotations are wrong) and HumanEva-I (in some sequences the head ground truth is wrong), in addition the already mentioned H3.6M (problems with S9) and CMU-Panoptic (ground truth unavailable for many sequences, e.g. dance + errors).

I think your results could be better than the original ground truth in many cases. So by e.g. leave-one-subject-out training and testing, one could generate new, polished "ground truth" for each subject of a particular dataset (to avoid memorizing the training set errors).

@karfly
Copy link
Owner

karfly commented Oct 21, 2019

Hey, thank you for your interest.

Next step for us is to add CMU Panoptic dataset support. Then we can think about adding other multi-view datasets.

We have some vague plans about annotating/reannotating datasets. Maybe the community can help us with it? 😊

@dulibubai
Copy link

If I want to train the model with CMU Panoptic dataset ,that means I'd better modify the dataset prepare codes based on Human3.6M precesssing way? Can you add CMU Panoptic dataset support? Can you add CMU Panoptic dataset more details in your paper?Thanks a lot!

@karfly
Copy link
Owner

karfly commented Oct 23, 2019

@dulibubai
We’re going to add CMU Panoptic dataset support soon, but if you need it right now, you can implement it yourself using Human3.6M as a reference.

What exact details of CMU Panoptic dataset training are you interested in?

@dulibubai
Copy link

When I download CMU dataset in its official network, I found most of the sequences have not labels,only some have , and there many people in most of them ,just single human is several..So when you train with it,which sequences you chose before,and how to split the train,val and test dataset when you train the model with CMU dataset? Thanks again.

@karfly
Copy link
Owner

karfly commented Oct 23, 2019

@dulibubai
We use train/val splits provided by authors of the original paper "Monocular Total Capture: Posing Face, Body, and Hands in the Wild".

Each scene contains multiple recorded persons => for each person an interval is provided in format[start_frame, end_frame]. Here is the list of scene names split into train/val:

train:
    - 171026_pose3
      - [1000, 3000]

    - 171026_pose2
      - [1000, 7500]
      - [8000, 14000]

    - 171026_pose1
      - [380, 7300]
      - [7900, 14500]
      - [15400, 22400]

    - 171204_pose4
      - [500, 4300]
      - [4900, 8800]
      - [9400, 13200]
      - [14200, 17800]
      - [18700, 22500]
      - [23050, 27050]
      - [28000, 31600]

    - 171204_pose3
      - [500, 4400]
      - [5400, 9000]

    - 171204_pose2
      - [350, 4300]
      - [5000, 8800]
      - [9600, 13600]
      - [14300, 18500]
      - [19600, 23500]
      - [24200, 28200]
      - [28800, 32800]
      - [33500, 37700]

    - 171204_pose1
      - [300, 4100]
      - [4800, 8900]
      - [10000, 13600]
      - [14000, 18200]
      - [18500, 22900]
      - [23500, 27600]

  val:
    - 171204_pose5
      - [400, 4300]
      - [5000,  8500]
      - [9500, 13400]
      - [14200, 18000]
      - [19000, 22600]
      - [23500, 27100]

    - 171204_pose6
      - [1000, 4500]
      - [5150, 9100]
      - [9830, 13800]
      - [14370, 18300]
      - [19000, 22900]

@dulibubai
Copy link

Thanks for your sharing sincerely! As show in CMU dataset, every sequences have 31 cameras,and how to split 31 cameras images for train and val dataset? Thanks again.

@karfly
Copy link
Owner

karfly commented Oct 23, 2019

@dulibubai
We used val cameras: ["00_02", "00_13", "00_16", "00_18"]

@dulibubai
Copy link

Yeah! Thanks a lot!

@dulibubai
Copy link

Hi! I have another question, Is it convenient to provide the 2d bbox lable file in every camera images with CMU that extracted by object detection net?

@karfly
Copy link
Owner

karfly commented Oct 24, 2019

@dulibubai

I've uploaded our Mask R-CNN detections to the Google Drive.
The format of the detection is the same as in the Human3.6M dataset:

detection == (left, upper, right, lower, confidence)

@dulibubai
Copy link

Thanks a lot!

@dulibubai
Copy link

@karfly ,
1)In the generate-labels-npy-multiview.py,what's the effect of the square_the_bbox(bbox) function?
def square_the_bbox(bbox):
top, left, bottom, right = bbox
width = right - left
height = bottom - top
if height < width:
center = (top + bottom) * 0.5
top = int(round(center - width * 0.5))
bottom = top + width
else:
center = (left + right) * 0.5
left = int(round(center - height * 0.5))
right = left + height
return top, left, bottom, right
2)In the human36m.py, why do you ' TLBR to LTRB' with bbox information?
bbox = shot['bbox_by_camera_tlbr'][camera_idx][[1,0,3,2]] # TLBR to LTRB
Thanks a lot!

@shrubb
Copy link
Collaborator

shrubb commented Oct 27, 2019

@dulibubai

  1. Object detectors will output rectangluar bounding boxes with arbitrary aspect ratio (this can possibly be true even for ground truth bounding boxes). However, since we are training a CNN, we'd like all input images to be of same size and obviously of same aspect ratio. Therefore, we decided to adjust all bounding boxes to 1:1 height-width ratio, i.e. make them square (we could have chosen some other ratio). This function does this for one box by growing the smallest side.

  2. That is nothing important, it is there just for convenience. I think when we were writing human36m.py, some functions would already require LTRB bboxes (like crop_image(), scale_bbox()), so we had to adapt to them.

@dulibubai
Copy link

@shrubb ,Thanks a lot!
I have another question to ask for you
In the generate-labels-npy-multiview.py,
1)Why transposes R? Did R in Human3.6M have transposed when stored?
2)Why not store 'T' in camera_retval['t'] , Did 'T' in Human3.6M's cameras_param file not equal true 't'?
camera_retval['R'] = np.array(camera_params['R']).T
camera_retval['t'] = -camera_retval['R'] @ camera_params['T']
3)When I use my externel dataset, I don't need to do so with R's tranpose ?
If I can receive t directly, I can store t in camera_retval['t'] directly?

Thanks very much!

@shrubb
Copy link
Collaborator

shrubb commented Oct 28, 2019

@dulibubai
In our code, the projection math (for all datasets) is handled by the code in mvn/utils/multiview.py. There, we adopted an OpenCV-like convention regarding camera model-related formulae (I might be wrong here). Human3.6M's intrinsics and extrinsics came in a different format, they used different projection formulae, and a different distortion model. So, the code you quoted simply converts R and T shipped with Human3.6M to our format.

@dulibubai
Copy link

@shrubb , When train the model with cmu dataset, how to set the follow paras?
n_objects_per_epoch:
n_epochs:
And what's the GPU's memory did you used? Single GPU or Muiti-GPU?
Thanks again!

@karfly
Copy link
Owner

karfly commented Oct 29, 2019

@dulibubai

We used same parameters as for Human3.6M. Paper experiments were done with single GPU, but you can use multiple GPUs to reduce training time.

@dulibubai
Copy link

@karfly
Hi! When you train the volumetric model, you used the position of pelvis with ground truth or prediction by algebraic? The result in your paper about volumetric model is unclear?
Thanks!

@karfly
Copy link
Owner

karfly commented Nov 5, 2019

@dulibubai
Hey! We use predictions by Algebraic method.

@dulibubai
Copy link

@karfly
Hi! If you train Algebraic model with Human3.6M, and trained the Volumetric model with Human3.6M based on the pelvis's position predicted by the Algebraic mode ltrained with Human3.6M still, It's not reasonable, because when you predict the pelvis position, that have been trained in Human3.6M .

@karfly
Copy link
Owner

karfly commented Nov 5, 2019

@dulibubai
Why? In such scenario it’s absolutely fair and there is no data leak to validation data – so I think it’s reasonable.

@dulibubai
Copy link

@karfly
Okay! I get it.Thanks.

@Samleo8
Copy link

Samleo8 commented May 29, 2020

@dulibubai
We used val cameras: ["00_02", "00_13", "00_16", "00_18"]

Hi @karfly what about the cameras for training? Did you just use all the other cameras? Also in my own attempts to test/train (#75 #77) I found that the projection matrix data for cameras 25 and 29 were off.

@karfly
Copy link
Owner

karfly commented May 29, 2020

Hi, @Samleo8!
Yes, we just used all other cameras. Don’t remember if some of them missed projection matrix, but I think it’s okay to remove such cameras from training. It shouldn’t influence the result too much.

@fxyQAQ
Copy link

fxyQAQ commented Jun 25, 2023

Hello. could you upload that Google Drive again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants