forked from opendatalab/MinerU
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
xu rui
committed
Dec 10, 2024
1 parent
959f986
commit 43a571c
Showing
12 changed files
with
234 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
|
||
|
||
Convert Word | ||
============= | ||
|
||
.. admonition:: Warning | ||
:class: tip | ||
|
||
When processing MS-Office files, we first use third-party software to convert the MS-Office files to PDF. | ||
|
||
For certain MS-Office files, the quality of the converted PDF files may not be very high, which can affect the quality of the final output. | ||
|
||
.. code:: python | ||
import os | ||
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader | ||
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze | ||
from magic_pdf.data.read_api import read_local_office | ||
# prepare env | ||
local_image_dir, local_md_dir = "output/images", "output" | ||
image_dir = str(os.path.basename(local_image_dir)) | ||
os.makedirs(local_image_dir, exist_ok=True) | ||
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter( | ||
local_md_dir | ||
) | ||
# proc | ||
## Create Dataset Instance | ||
input_file = "some_doc.doc" # replace with real ms-office file | ||
input_file_name = input_file.split(".")[0] | ||
ds = read_local_office(input_file)[0] | ||
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md( | ||
md_writer, f"{input_file_name}.md", image_dir | ||
) | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
|
||
Convert DocX | ||
============= | ||
|
||
.. admonition:: Warning | ||
:class: tip | ||
|
||
When processing MS-Office files, we first use third-party software to convert the MS-Office files to PDF. | ||
|
||
For certain MS-Office files, the quality of the converted PDF files may not be very high, which can affect the quality of the final output. | ||
|
||
|
||
.. code:: python | ||
import os | ||
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader | ||
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze | ||
from magic_pdf.data.read_api import read_local_office | ||
# prepare env | ||
local_image_dir, local_md_dir = "output/images", "output" | ||
image_dir = str(os.path.basename(local_image_dir)) | ||
os.makedirs(local_image_dir, exist_ok=True) | ||
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter( | ||
local_md_dir | ||
) | ||
# proc | ||
## Create Dataset Instance | ||
input_file = "some_docx.docx" # replace with real ms-office file | ||
input_file_name = input_file.split(".")[0] | ||
ds = read_local_office(input_file)[0] | ||
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md( | ||
md_writer, f"{input_file_name}.md", image_dir | ||
) | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
|
||
|
||
Convert Image | ||
=============== | ||
|
||
.. code:: python | ||
import os | ||
from magic_pdf.data.data_reader_writer import FileBasedDataWriter | ||
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze | ||
from magic_pdf.data.read_api import read_local_images | ||
# prepare env | ||
local_image_dir, local_md_dir = "output/images", "output" | ||
image_dir = str(os.path.basename(local_image_dir)) | ||
os.makedirs(local_image_dir, exist_ok=True) | ||
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter( | ||
local_md_dir | ||
) | ||
# proc | ||
## Create Dataset Instance | ||
input_file = "some_image.jpg" # replace with real image file | ||
input_file_name = input_file.split(".")[0] | ||
ds = read_local_images(input_file)[0] | ||
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md( | ||
md_writer, f"{input_file_name}.md", image_dir | ||
) |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
|
||
|
||
Convert PPTX | ||
================= | ||
|
||
.. admonition:: Warning | ||
:class: tip | ||
|
||
When processing MS-Office files, we first use third-party software to convert the MS-Office files to PDF. | ||
|
||
For certain MS-Office files, the quality of the converted PDF files may not be very high, which can affect the quality of the final output. | ||
|
||
|
||
|
||
.. code:: python | ||
import os | ||
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader | ||
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze | ||
from magic_pdf.data.read_api import read_local_office | ||
# prepare env | ||
local_image_dir, local_md_dir = "output/images", "output" | ||
image_dir = str(os.path.basename(local_image_dir)) | ||
os.makedirs(local_image_dir, exist_ok=True) | ||
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter( | ||
local_md_dir | ||
) | ||
# proc | ||
## Create Dataset Instance | ||
input_file = "some_pptx.pptx" # replace with real ms-office file | ||
input_file_name = input_file.split(".")[0] | ||
ds = read_local_office(input_file)[0] | ||
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md( | ||
md_writer, f"{input_file_name}.md", image_dir | ||
) |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters