트럼프 레이져 기계에서 나오는 pdf를 파싱해서 이미지를 추출할 수 있을까? (도전기 기록)

티스토리 뷰

IT tech Coding/javascript

트럼프 레이져 기계에서 나오는 pdf를 파싱해서 이미지를 추출할 수 있을까? (도전기 기록)

Coding Life 2023. 4. 13. 12:53

프로젝트에서 꼭 성공해야 할 부분이 있어서 그 과정을 담아봅니다.

pdf.js를 사용해서 특정 pdf파일의 .geo파일을 과연 웹에서 추출하거나 데이터로 저장할 수 있을지 연구합니다.

질문 1) GPT에 절의해보기

위의 코드에서 pdf를 구성하는 이미지의 확장자가 .geo파일인 경우 이미지가 안나오는데, 해결방법이 있을까?

PDF 파일 내부에 저장된 이미지는 여러 가지 형식의 이미지 파일일 수 있습니다. 하지만 PDF.js 라이브러리는 기본적으로 JPEG, PNG, GIF 형식의 이미지만을 지원합니다. 따라서 .geo 확장자를 가진 이미지는 PDF.js 라이브러리로는 직접적으로 지원되지 않습니다.

이미지가 .geo 확장자를 가지는 경우, 해당 이미지 파일을 다른 형식으로 변환한 뒤 PDF 파일에 다시 삽입하는 방법을 고려해볼 수 있습니다. 예를 들어, 이미지 파일을 불러와 다른 형식의 이미지로 변환한 후, 변환된 이미지를 PDF.js 라이브러리에서 지원하는 형식으로 다시 삽입할 수 있습니다. 이를 위해서는 다른 이미지 포맷으로 변환할 수 있는 프로그램이 필요합니다.

다른 방법으로는, .geo 확장자를 가진 이미지 파일을 직접 HTML 파일에 삽입하여 PDF 파일을 표시하는 방법도 있습니다. 이 경우에는 PDF 파일을 HTML5에서 지원하는 Canvas 요소에 렌더링하는 방식으로 PDF 파일을 불러와야 하므로, PDF.js 라이브러리에서 제공하는 pdfjsLib.getDocument() 대신에 pdfjsLib.getDocument({ data: pdfData }) 형태의 메소드를 사용하여 PDF 파일을 로드해야 합니다.

위의 내용대로 코드를 한번 시도해봄.

기존코드 (화면에 이미지가 있다고는 나오지만, undefined로 전부표시됨)

 const url = 'test.pdf';  // 이미지 10개
// const url = 'test1_4.pdf';  // 이미지 4개
const pdfjsLib = window['pdfjs-dist/build/pdf'];

pdfjsLib.GlobalWorkerOptions.workerSrc = '//mozilla.github.io/pdf.js/build/pdf.worker.js';
pdfjsLib.getDocument(url).promise.then(function(pdfDoc) {
  const imageList = [];

  for (let i = 1; i <= pdfDoc.numPages; i++) {
    pdfDoc.getPage(i).then(function(page) {
      const operatorList = page.getOperatorList();
      const opListPromise = operatorList.then(function(opList) {
        const fnArray = opList.fnArray;
        const imageData = fnArray.find(function(elem, index) {
          return elem === pdfjsLib.OPS.paintImageXObject && fnArray[index - 1] === pdfjsLib.OPS.save;
        });
        if (imageData) {
          const imageIndex = fnArray.indexOf(imageData);
          const imageRef = operatorList.argsArray[imageIndex][0];
          const imageObj = page.objs.get(imageRef);
          imageList.push(imageObj);
        }
      });

      if (opListPromise) {
        opListPromise.then(function() {
          if (imageList.length === pdfDoc.numPages) {
            console.log('imageList');
            console.log(imageList);
          }
        });
      } else {
        if (imageList.length === pdfDoc.numPages) {
          console.log('imageList');
          console.log(imageList);
        }
      }
    });
  }
});

결과화면 console창

이미지가 없다고 안나옴.

챗GTP로 질의한 결과 나온 값으로 한번 만들어보려고 한다.

PDF.js 라이브러리에서 제공하는 pdfjsLib.getDocument() 대신에 pdfjsLib.getDocument({ data: pdfData }) 형태의 메소드를 사용하여 PDF 파일을 로드해야 합니다.

기존코드 수정

const pdfjsLib = window['pdfjs-dist/build/pdf'];
pdfjsLib.GlobalWorkerOptions.workerSrc = '//mozilla.github.io/pdf.js/build/pdf.worker.js';

const pdfData = new Uint8Array(data); // data는 PDF 데이터
pdfjsLib.getDocument({ data: pdfData }).promise.then(function(pdfDoc) {
  console.log('PDF 문서 로드 완료!');
  // 여기에 필요한 작업 수행
}, function(error) {
  console.error('PDF 문서 로드 중 오류 발생:', error);
});

ajax를 이용해서 제이쿼리로 만들경우

const pdfUrl = 'path/to/pdf/file.pdf'; // PDF 파일 경로

$.ajax({
  url: pdfUrl,
  method: 'GET',
  responseType: 'arraybuffer', // 응답 데이터 유형을 ArrayBuffer로 설정
}).done(function(data) {
  const pdfData = new Uint8Array(data); // ArrayBuffer를 Uint8Array로 변환

  // pdf.js를 사용하여 PDF 파일 로드
  pdfjsLib.getDocument({ data: pdfData }).promise.then(function(pdfDoc) {
    const imageList = [];

    for (let i = 1; i <= pdfDoc.numPages; i++) {
      pdfDoc.getPage(i).then(function(page) {
        const operatorList = page.getOperatorList();
        const opListPromise = operatorList.then(function(opList) {
          const fnArray = opList.fnArray;
          const imageData = fnArray.find(function(elem, index) {
            return elem === pdfjsLib.OPS.paintImageXObject && fnArray[index - 1] === pdfjsLib.OPS.save;
          });
          if (imageData) {
            const imageIndex = fnArray.indexOf(imageData);
            const imageRef = operatorList.argsArray[imageIndex][0];
            return page.objs.get(imageRef);
          }
        });

        if (opListPromise) {
          opListPromise.then(function(image) {
            imageList.push(image);		  
            if (imageList.length === pdfDoc.numPages) {
              console.log('imageList');
              console.log(imageList);
            }
          });
        } else {
          if (imageList.length === pdfDoc.numPages) {
            console.log('imageList');
            console.log(imageList);
          }
        }
      });
    }
  });
}).fail(function(jqXHR, textStatus, errorThrown) {
  console.log('PDF 파일을 로드할 수 없습니다.');
});

계속 연구결과 추가 정리예정임.

저작자표시 비영리 변경금지

'IT tech Coding > javascript' 카테고리의 다른 글

자바스크립트 여러 복잡한 조건을 적용한 체크박스 마음대로 다뤄보기 (0)	2023.04.21
자바스크립트 오류 : 'Uncaught TypeError: Cannot read properties of undefined' (0)	2023.04.20
[자바스트립트] 특정문구와 붙어있는 텍스트만 추출하고 싶을때 (0)	2023.04.13
자바스크립트 오류 Uncaught SyntaxError: Invalid or unexpected token (0)	2023.03.31
자바스크립트 현재 페이지의 변수를 서버에 저장하지 않고 브라우저에 저장하는 방법 (0)	2023.03.30

공지사항

내 인생은 코딩....

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

글 보관함

My Coding World

티스토리 뷰

트럼프 레이져 기계에서 나오는 pdf를 파싱해서 이미지를 추출할 수 있을까? (도전기 기록)

'IT tech Coding > javascript' 카테고리의 다른 글

티스토리툴바